An Approach to Text Mining using Information Extraction

格式：pdf
大小：92.31 KB
文档页数：14

下载文档原格式

高考英语冲刺模拟试题分类汇编：专题01 阅读理解(新高考) 解析版

Safety distances between adults and students/ children must be respected.
Rates
All our educational programs are free of charge for those under the age of 16.
高考英语（5月）三轮冲刺模拟试题分类汇编（新高考）
专题01阅读理解（应用文+记叙文+说明文）解析版
距离高考还有一段时间，不少有经验的老师都会提醒考生，愈是临近高考，能否咬紧牙关、学会自我调节，态度是否主动积极，安排是否科学合理，能不能保持良好的心态、以饱满的情绪迎接挑战，其效果往往大不一样。以下是本人从事
【2023届东北三省三校高三下学期第一次联考】
B
In July, I received an e-mail from Richard K. Weems, who directed the creative writing division of the New Jersey Governor’s School of the Arts. He hired me to teach poetry to a group of gifted high school students.
【1题详解】
细节理解题。根据Workshop + independent visit (optional)部分中“Accompanied by an activity leader, make the most of a themed workshop and extend your Olympic experience by visiting the exhibitions independently.(在活动负责人的陪同下，充分利用主题工作坊，通过独立参观展览来扩展您的奥运体验)”和“Duration: 45 minutes for the workshop, unlimited for an independent visit(持续时间：研讨会45分钟，独立参观无限制)”可知，教师在为学生计划参观博物馆可以设计一个含45分钟工作坊的自助游。故选D。

ABSTRACT Mining, Indexing, and Querying Historical Spatiotemporal Data

Mining,Indexing,and Querying HistoricalSpatiotemporal DataNikos Mamoulis University of Hong Kong Marios Hadjieleftheriou University of California,RiversideHuiping CaoUniversity of Hong KongYufei TaoCity University of Hong KongGeorge KolliosBoston UniversityDavid W.CheungUniversity of Hong KongABSTRACTIn many applications that track and analyze spatiotemporal data,movements obey periodic patterns;the objects follow the same routes(approximately)over regular time intervals. For example,people wake up at the same time and follow more or less the same route to their work everyday.The dis-covery of hidden periodic patterns in spatiotemporal data, apart from unveiling important information to the data an-alyst,can facilitate data management substantially.Based on this observation,we propose a framework that analyzes, manages,and queries object movements that follow such pat-terns.We deﬁne the spatiotemporal periodic pattern mining problem and propose an eﬀective and fast mining algorithm for retrieving maximal periodic patterns.We also devise a novel,specialized index structure that can beneﬁt from the discovered patterns to support more eﬃcient execution of spatiotemporal queries.We evaluate our methods experi-mentally using datasets with object trajectories that exhibit periodicity.Categories&Subject Descriptors:H.2.8[Database Man-agement]:Database Applications-Data Mining Keywords:Spatiotemporal data,Trajectories,Pattern min-ing,Indexing1.INTRODUCTIONThe eﬃcient management of spatiotemporal data has gai-ned much interest during the past few years[10,13,4,12], mainly due to the rapid advancements in telecommunications (e.g.,GPS,Cellular networks,etc.),which facilitate the col-lection of large datasets of such information.Management and analysis of moving object trajectories is challenging due to the vast amount of collected data and novel types of spa-tiotemporal queries.This work was supported by grant HKU7149/03E from Hong Kong RGC and partially supported by NSF grants IIS-0308213and Career Award IIS-0133825.Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on theﬁrst page.To copy otherwise,to republish,to post on servers or to redistribute to lists,requires prior speciﬁc permission and/or a fee.KDD’04,August22–25,2004,Seattle,Washington,USA.Copyright2004ACM1-58113-888-1/04/0008...$5.00.In many applications,the movements obey periodic pat-terns;i.e.,the objects follow the same routes(approximately) over regular time intervals.Objects that follow approximate periodic patterns include transportation vehicles(buses,boats, airplanes,trains,etc.),animal movements,mobile phone users, etc.For example,Bob wakes up at the same time and then follows,more or less,the same route to his work everyday. Based on this observation,which has been overlooked in past research,we propose a framework for mining,indexing and querying periodic spatiotemporal data.The problem of discovering periodic patterns from histor-ical object movements is very ually,the pat-terns are not explicitly speciﬁed,but have to be mined from the data.The patterns can be thought of as(possibly non-contiguous)sequences of object locations that reappear in the movement history periodically.Moreover,since we do not expect an object to visit exactly the same locations at every time instant of each period,the patterns are not rigid but diﬀer slightly from one occurrence to the next.The pat-tern occurrences may also be shifted in time(e.g.,due to traﬃc delays or Bob waking up late again).The approx-imate nature of patterns in the spatiotemporal domain in-creases the complexity of mining tasks.We need to discover, along with the patterns,aﬂexible description of how they variate in space and time.Previous approaches have stud-ied the extraction of patterns from long event sequences[5, 7].We identify the diﬀerence between the two problems and propose novel techniques for mining periodic patterns from a large historical collection of object movements.In addition,we design a novel indexing scheme that ex-ploits periodic pattern information to organize historical spa-tiotemporal data,such that spatiotemporal queries are eﬃ-ciently processed.Since the patterns are accurate approx-imations of object trajectories,they can be managed in a lightweight index structure,which can be used for pruning large parts of the search space without having to access the actual data from storage.This index is optimized for provid-ing fast answers to range queries with temporal predicates. Eﬀective indexing is not the only application of the mined patterns;since they are compact summaries of the actual tra-jectories,we can use them to compress and replace historical data to save space.Finally,periodic patterns can predict future movements of objects that follow them.The rest of the paper is organized as follows.Section2 presents related work.In Section3,we give a concrete formu-lation of periodic patterns in object trajectories and propose eﬀective mining techniques.Section4presents the indexingscheme that exploits spatiotemporal patterns.We present a concise experimental evaluation of our techniques in Section 5.Finally,Section6concludes with a discussion about future work.2.RELATED WORKOur work is related to two research problems.Theﬁrst is data mining in spatiotemporal and time-series databases. The second is management of spatiotemporal data.Previous work on spatiotemporal data mining focuses on two types of patterns:(i)frequent movements of objects over time and(ii) evolution of natural phenomena,such as forest coverage.[14] studies the discovery of frequent patterns related to changes of natural phenomena(e.g.,temperature changes)in spatial regions.In general,there is limited work on spatiotemporal data mining,which has been treated as a generalization of pattern mining in time-series data(e.g.,see[14,9]).The locations of objects or the changes of natural phenomena over time are mapped to sequences of values.For instance, we can divide the map into spatial regions and replace the location of the object at each timestamp,by the region-id where it is located.Similarly,we can model the change of temperature in a spatial region as a sequence of tempera-ture values.Continuous domains of the resulting time-series data are discretized,prior to mining.In the case of multi-ple moving objects(or time-series),trajectories are typically concatenated to a single long sequence.Then,an algorithm that discovers frequent subsequences in a long sequence(e.g., [16])is applied.Periodicity has only been studied in the context of time-series databases.[6]addresses the following problem.Given a long sequence S and a period T,the aim is to discover the most representative trend that repeats itself in S every T timestamps.Exact search might be slow;thus,[6]pro-poses an approximate search technique based on sketches. However,the discovered trend for a given T is only one and spans the whole periodic interval.In[8],the problem ofﬁnd-ing association rules that repeat themselves in every period of a data sequence is addressed.The discovery of multiple par-tial periodical patterns that do not appear in every periodic segment wasﬁrst studied in[5].A version of the well-known Apriori algorithm[1]was adapted for the problem ofﬁnding patterns of the form*AB**C,where A,B,and C are speciﬁc symbols(e.g.,event types)and*could be any symbol(T= 6,in this example).This pattern may not repeat itself in ev-ery period,but it must appear at least min sup times,where min sup is a user-deﬁned parameter.In[5],a faster mining method for this problem was also proposed,which uses a tree structure to count the support of multiple patterns at two database scans.[7]studies the problem ofﬁnding sets of events that appear together periodically.In each qualifying period,the set of events may not appear in exactly the same positions,but their occurrence may be shifted or disrupted, due to the presence of noise.However,this work does not consider the order of events in such patterns.On the other hand,it addresses the problem of mining patterns and their periods automatically.Finally,[15]studies the problem of ﬁnding patterns,which appear in at least a minimum num-ber of consecutive periodic intervals and groups of such in-tervals are allowed to be separated by at most a time interval threshold.A number of spatial access methods,which are variants of the R–tree[3]have developed for the management of moving object trajectories.[10]proposes3D variants of this access method,suitable for indexing historical spatiotemporal data. Time is modeled as a third dimension and each moving ob-ject trajectory is mapped to a polyline in this3D space.The polyline is then decomposed into a sequence of3D line seg-ments,tagged with the object-id they correspond to.The segments,in turn,are indexed by variants of the3D R–tree, which diﬀer in the criteria they use to split their nodes.Al-though this generic method is always applicable,it stores redundant information if the positions of the objects do not constantly change.Other works[13,4]propose multi-version variants of the R–tree,which share similar concepts to ac-cess methods for time-evolving data[11].Recently[12],there is an increasing interest in(approximate)aggregate queries on spatiotemporal data,e.g.,“ﬁnd the distinct number of objects that were in region r during a speciﬁc time interval”.3.PERIODIC PATTERNS IN OBJECT TRA-JECTORIESIn our model,we assume that the locations of objects are sampled over a long history.In other words,the movement of an object is tracked as an n-length sequence S of spa-tial locations,one for each timestamp in the history,of the form{(l0,t0),(l1,t1),...,(l n−1,t n−1)},where l i is the ob-ject’s location at time t i.If the diﬀerence between consecu-tive timestamps isﬁxed(locations are sampled every regular time interval),we can represent the movement by a simple sequence of locations l i(i.e.,by dropping the timestamps t i, since they can be implied).Each location l i is expressed in terms of spatial coordinates.Figure1a,for example,illus-trates the movement of an object in three consecutive days (assuming that it is tracked only during speciﬁc hours,e.g., working hours).We can model it with sequence S={ 4,9 , 3.5,8 ,..., 6.5,3.9 , 4.1,9 ,...}.Given such a sequence,a minimum support min sup,and an integer T,called period, our problem is to discover movement patterns that repeat themselves every T timestamps.A discovered pattern P is a T-length sequence of the form r0r1...r T−1,where r i is a spatial region or the special character*,indicating the whole spatial universe.For instance,pattern AB*C**implies that at the beginning of the cycle the object is in region A,at the next timestamp it is found in region B,then it moves ir-regularly(it can be anywhere),then it goes to region C,and after that it can go anywhere,until the beginning of the next cycle,when it can be found again in region A.The patterns are required to be followed by the object in at least min sup periodic intervals in S.Existing algorithms for mining periodic patterns(e.g.,[5]) operate on event sequences and discover patterns of the above form.However,in this case,the elements r i of a pattern are events(or sets of events).As a result,we cannot directly apply these techniques for our problem,unless we treat the exact locations l i as discrete categorical values.Nevertheless it is highly unlikely that an object will repeat an identical se-quence of x,y locations precisely.Even if the spatial route is precise,the location transmissions at each timestamp are unlikely to be perfectly synchronized.Thus,the object will not reach the same location at the same time every day,and as a result the sampled locations at speciﬁc timestamps(e.g., at9:00a.m.sharp,every day),will be diﬀerent.In Figure 1a,for example,theﬁrst daily locations of the object are very close to each other,however,they will be treated diﬀerently5x y 51010day 2day 3day 15x y 51010AB CDEFG H IJ KL MN Oday 2day 3day 1 A A C C C G | A A C B D G | A A A C H G events sequence:support(AAC**G) = 2support(AA***G) = 3some partial periodic patterns:support(AA*C*G) = 2(a)an object’s movement (b)a set of predeﬁned regions(c)event-based patternsFigure 1:Periodic patterns in with respect to pre-deﬁned spatial regionsby a straightforward mining algorithm.One way to handle the noise in object movement is to re-place the exact locations of the objects by the regions (e.g.,districts,mobile communication cells,or cells of a synthetic grid)which contain them.Figure 1b shows an example of an area’s division into such regions.Sequence {A ,A ,C ,C ,C ,G ,A ,...}can now summarize the object’s movement and periodic sequence pattern mining algorithms,like [5],can di-rectly be applied.Figure 1c shows three (closed)discovered patterns for T =6,and min sup =2.A disadvantage of this approach is that the discovered patterns may not be very descriptive,if the space division is not very detailed.For ex-ample,regions A and C are too large to capture in detail the ﬁrst three positions of the object in each periodic instance.On the other hand,with detailed space divisions,the same (approximate)object location may span more than one dif-ferent regions.For example,in Figure 1b,observe that the third object positions for the three days are close to each other,however,they fall into diﬀerent regions (A and C )at diﬀerent days.Therefore,we are interested in the automated discovering of patterns and their descriptive regions .Before we present methods for this problem,we will ﬁrst deﬁne it formally.3.1Problem deﬁnitionLet S be a sequence of n spatial locations {l 0,l 1,...,l n −1},representing the movement of an object over a long history.Let T n be an integer called period (e.g.,day,week,month).A periodic segment s is deﬁned by a subsequence l i l i +1...l i +T −1of S ,such that i modulo T =0.Thus,seg-ments start at positions 0,T,...,( n−1)·T ,and there areexactly m = nT periodic segments in S .∗Let s j denote the segment starting at position l j ·T of S ,for 0≤j <m ,and let s j i =l j ·T +i ,for 0≤i <T .A periodic pattern P is deﬁned by a sequence r 0r 1...r T −1of length T ,such that r i is either a spatial region or *.The length of a periodic pattern P is the number of non-*regions in P .A segment s j is said to comply with P ,if for each r i ∈P ,r i =*or s j i is inside region r i .The support |P |of a pattern P in S is deﬁned by the number of periodic segments in S that comply with P .We sometimes use the same symbol P to refer to a pattern and the set of segments that comply with it.Let min sup ≤m be a positive integer ∗If n is not a multiple of T ,then the last n modulo T loca-tions are truncated and the length n of sequence S is reduced accordingly.(minimum support ).A pattern P is frequent ,if its support is larger than min sup .A problem with the deﬁnition above is that it imposes no control over the density of the pattern regions r i .In other words,if the pattern regions are too relaxed (e.g.,each r i is the whole map),the pattern may always be frequent.Therefore,we impose an additional constraint as follows.Let S P be the set of segments that comply with a pattern P .Then each region r i of P is valid if the set of locations R Pi :={s j i |s j ∈S P}form a dense cluster .To deﬁne a dense cluster,we borrow the deﬁnitions from [2]and use two parametersand MinP ts .A point p in the spatial dataset R Pi is a core point if the circular range centered at p with radius contains at least MinP ts points.If a point q is within distance from a core point p ,it is assigned in the same cluster as p .If q is a core point itself,then all points within distance from q areassigned in the same cluster as p and q .If R Pi forms a single,dense cluster with respect to some values of parameters and MinP ts ,we say that region r i is valid.If all non-*regions of P are valid,then P is a valid pattern.We are interested in the discovery of valid patterns only.In the following,we use the terms valid region and dense cluster interchangeably;i.e.,we will often use the term dense region to refer to a spatial dense cluster and the points in it.Figure 2a shows an example of a valid pattern,if =1.5and MinP ts =4.Each region at positions 1,2,and 3forms a single,dense cluster and is therefore a dense region.Notice,however,that it is possible that two valid patterns P and P of the same length (i)have the same *positions,(ii)every segment that complies with P ,complies with P ,and (iii)|P |<|P |.In other words,P implies P .For example,the pattern of Figure 2a implies the one of Figure 2b (denoted by the three circles).A frequent pattern P is redundant if it is implied by some other frequent pattern P .The mining periodic patterns problem searches for all valid periodic patterns P in S ,which are frequent and non-redundant with respect to a minimum support min sup .For simplicity,we will use ‘frequent pattern’to refer to a valid,non-redundant frequent pattern.3.2Mining periodic patternsIn this section,we present techniques for mining frequent periodic patterns and their associated regions in a long his-tory of object trajectories.We ﬁrst address the problem of ﬁnding frequent 1-patterns (i.e.,of length 1).Then,we propose two methods to ﬁnd longer patterns;a bottom-up,yy (a)a valid pattern (b)a redundant patternFigure 2:Redundancy of patternslevel-wise technique and a faster top-down approach.3.2.1Obtaining frequent 1-patternsIncluding automatic discovery of regions in the mining task does not allow for the direct application of techniques that ﬁnd patterns in sequences (e.g.,[5]),as discussed.In order to tackle this problem,we propose the following methodology.We divide the sequence S of locations into T spatial datasets,one for each oﬀset of the period T .In other words,locations {l i ,l i +T ,...,l i +(m −1)·T }go to set R i ,for each 0≤i <T .Each location is tagged by the id j ∈[0,...,m −1]of the seg-ment that contains it.Figure 3a shows the spatial datasets obtained after decomposing the object trajectory of Figure 1a.We use a diﬀerent symbol to denote locations that cor-respond to diﬀerent periodic oﬀsets and diﬀerent colors for diﬀerent segment-ids.by temporal positiony such locationsy (a)T -based decomposition (b)dense clusters in R i ’sFigure 3:locations and regions per periodic oﬀset Observe that a dense cluster r in dataset R i correspondsto a frequent pattern,having *at all positions and r at po-sition i .Figure 3b shows examples of ﬁve clusters discovered in datasets R 1,R 2,R 3,R 4,and R 6.These correspond to ﬁve 1-patterns (i.e.,r 11*****,*r 21****,etc.).In order to iden-tify the dense clusters for each R i ,we can apply a density-based clustering algorithm like DBSCAN [2].Clusters with less than min sup points are discarded,since they are not frequent 1-patterns according to our deﬁnition.Clustering is quite expensive and it is a frequently used module of the mining algorithms,as we will see later.DB-SCAN [2]has quadratic cost to the number of clustered points,unless an index (e.g.,R–tree)is available.Since R–trees are not available for every set of arbitrary points to be clustered,we use a hash-based method,that divides the 2Dspace using a regular grid with cell area √2× √2.This grid is used to hash the points into buckets according to the cell that contains them.The rationale of choosing this cell size is that if one cell contains at least MinP ts points,we know for sure that it is dense and need not perform any range queries for the objects in it.The remainder of the algorithm merges dense cells that contain points within distance (using inex-pensive minimum bounding rectangle tests or spatial join,if required)and applies -range queries from objects located in sparse cells to assign them to clusters and potentially merge clusters.Our clustering technique is fast because not only does it avoid R–tree construction,but it also minimizes ex-pensive distance computations.The details of this algorithm are omitted for the sake of readability.3.2.2A level-wise,bottom-up approachStarting from the discovered 1-patterns (i.e.,clusters for each R i ),we can apply a variant of the level-wise Apriori-TID algorithm [1]to discover longer ones,as shown in Figure 4.The input of our algorithm is a collection L 1of frequent 1-patterns,discovered as described in the previous paragraph;for each R i ,0≤i <T ,and each dense region r ∈R i ,there is a 1-pattern in L 1.Pairs P 1,P 2 of (k −1)-patterns in L k −1,with their ﬁrst k −2non-*regions in the same position and diﬀerent (k −1)-th non-*position create candidate k -patterns (lines 4–6).For each candidate pattern P cand ,we then perform a segment-id join between P 1and P 2and if the number of segments that comply with both patterns is at least min sup ,we run a pattern validation function to check whether the regions of P cand are still clusters.After the patterns of length k have been discovered,we ﬁnd the patterns at the next level,until there are no more patterns at the current level,or there are no more levels.Algorithm STPMine1(L 1,T ,min sup );1).k :=2;2).while (L k −1=∅∧k <T )3).L k :=∅;4).for each pair of patterns (P 1,P 2)∈L k −15).such that P 1and P 2agree on the ﬁrst k −26).and have diﬀerent (k −1)-th non-*position 7).P cand :=candidate gen (P 1,P 2);8).if (P cand =null )then 9).P cand :=P 11P 1.sid =P 2.sid P 2;//segment-id join 10).if |P cand |≥min sup then 11).validate pattern (P cand ,L k ,min sup );12).k :=k +1;13).return P :=SL k ,∀1≤k <T ;Figure 4:Level-wise pattern miningIn order to facilitate fast and eﬀective candidate genera-tion,we use the MBRs (i.e.,minimum bounding rectangles )of the pattern regions.For each common non-*position i the intersection of the MBRs of the regions for P 1and P 2must be non-empty,otherwise a valid superpattern cannot exist.The intersection is adopted as an approximation for the new pat-tern P cand at each such position i .During candidate pruning,we check for every (k −1)-subpattern of P cand if there is at least one pattern in L k −1,which agrees in the non-*posi-tions with the subpattern and the MBR-intersection with it is non-empty at all those positions.In such a case,we ac-cept P cand as a candidate pattern.Otherwise,we know that P cand cannot be a valid pattern,since some of its subpatterns (with common space covered by the non-*regions)are not included in L k −1.Function validate pattern takes as input a k -length can-didate pattern P cand and computes a number of actual k -length patterns from it.The rationale is that the points at all non-*positions of P cand may not form a cluster anymore after the join of P 1and P 2.Thus,for each non-*position of P cand we re-cluster the points.If for some position the points can be grouped to more than one clusters,we create a new candidate pattern for each cluster and validate it.Note that,from a candidate pattern P cand ,it is possible to gener-ate more than one actual patterns eventually.If no position of P cand is split to multiple clusters,we may need to re-cluster the non-*positions of P cand ,since some points (and segment-ids)may be eliminated during clustering at some position.To illustrate the algorithm,consider the 2-length patterns P 1=r 1x r 2y *and P 2=r 1w *r 3z of Figure 5a.Assume that MinP ts =4and =1.5.The two patterns have com-mon ﬁrst non-*position and MBR (r 1x )overlaps MBR (r 1w ).Therefore,a candidate 3-length pattern P cand is generated.During candidate pruning,we verify that there is a 2-length pattern with non-*positions 2and 3which is in L 2.Indeed,such a pattern can be spotted at the ﬁgure (see the dashed lines).After joining the segment-ids in P 1and P 2at line 9of STPMine1,P cand contains the trajectories shown in Fig-ure 5b.Notice that the locations of the segment-ids in the in-tersection may not form clusters any more at some positions of P cand .This is why we have to call validate pattern ,in order to identify the valid patterns included in P cand .Ob-serve that,the segment-id corresponding to the lowermost location of the ﬁrst position is eliminated from the cluster as an outlier.Then,while clustering at position 2,we identify two dense clusters,which deﬁne the ﬁnal patterns r 1a r 2b r 3c and r 1d r 2e r 3f .yy (a)2-length patterns (b)generated 3-length patternsFigure 5:Example of STPMine13.2.3A two-phase,top-down algorithmAlthough the algorithm of Figure 4can ﬁnd all partial pe-riodic patterns correctly,it can be very slow due to the huge number of region combinations to be joined.If the actual patterns are long,all their subpatterns have to be computed and validated.In addition,a potentially huge number of can-didates need to be checked and evaluated.In this section,we propose a top-down method that can discover long patterns more eﬃciently.After applying clustering on each R i (as described in Sec-tion 3.2.1),we have discovered the frequent 1-patterns with their segment-ids.The ﬁrst phase of STPMine2algorithm replaces each location in S with the cluster-id it belongs to or with an ‘empty’value (e.g.,*)if the location belongs to no cluster.For example,assume that we have discov-ered clusters {r 11,r 12}at position 1,{r 21}at position 2,and {r 31,r 32}at position 3.A segment {l 1,l 2,l 3},such thatl 1∈r 12,l 2/∈r 21,and l 3∈r 31is transformed to subsequence {r 12*r 31}.Therefore,the original spatiotemporal sequence S is transformed to a symbol-sequence S .Now,we could use the mining algorithm of [5]to discover fast all frequent patterns of the form r 0r 1...r T −1,where each r i is a cluster in R i or *.However,we do not know whether the results of the sequence-based algorithm are ac-tual patterns,since the contents of each non-*position may not form a cluster.For example,{r 12*r 31}may be frequent,however if we consider only the segment-ids that qualify this pattern,r 12may no longer be a cluster or may form diﬀer-ent actual clusters (as illustrated in Figure 5).We call the patterns P which can be discovered by the algorithm of [5]pseudopatterns ,since they may not be valid.To discover the actual patterns,we apply some changes in the original algorithm of [5].While creating the max-subpattern tree ,we store with each tree node the segment-ids that correspond to the pseudopattern of the node after the transformation.In this way,one segment-id goes to exactly one node of the tree.However,S could be too large to man-age in memory.In order to alleviate this problem,while scanning S ,for every segment s we encounter we perform the following operations.•First,we insert the segment to the max-subpattern tree,as in [5],increasing the counter of the candidate pseudopattern P that s corresponds to after the trans-formation.An example of such a tree is shown in Figure 6.This node can be found by ﬁnding the (ﬁrst)max-imal pseudopattern that is a superpattern of P and following its children,recursively.If the node corre-sponding to P does not exist,it is created (together with any non-existent ancestors).Notice that the dot-ted lines are not implemented and not followed during insertion (thus,we materialize the tree instead of a lat-tice).For instance,for segment with P ={*r 21r 31},we increase the counter of the corresponding node at the second level of the tree.•Second,we insert an entry P .id,s.sid to a ﬁle F ,where P .id is the id of the node of the lattice that corresponds to pseudopattern P and s.sid is the id of segment s .At the end,ﬁle F is sorted on P .id to bring together segment-ids that comply to the same (maxi-mal)pseudopattern.For each pseudopattern with at least one segment,we insert a pointer to the ﬁle posi-tion,where the ﬁrst segment-id is located.Nodes of the tree are labeled in breadth-ﬁrst search order for reasons we will explain shortly.r 11r 21r 31r 11r 21r 32r 12r 21r 31r 12r 21r 32*r 21r 31r 11*r 31r 11r 21*r 21r 32*r 11*r 32r 12*r 31r 12r 21*r 12*r 32rootr 11-r 21-segment-ids containing r 11r 21r 32segment-ids containing 31r 11r 21r segment-ids containing *r 21r 31......503703015segment-ids fileFigure 6:Example of max-subpattern tree Now,instead of ﬁnding frequent patterns in a bottom-up fashion,we traverse the tree in a top-down,breadth-ﬁrst or-。

Text Mining-The state of the art and the challenges

Text Mining:The state of the art and the challengesAh-Hwee TanKent Ridge Digital Labs21 Heng Mui Keng TerraceSingapore 119613Email: ahhwee@.sgAbstractText mining, also known as text data mining or knowledge discovery from textualdatabases, refers to the process of extracting interesting and non-trivial patterns orknowledge from text documents. Regarded by many as the next wave of knowledgediscovery, text mining has very high commercial values. Last count reveals that there aremore than ten high-tech companies offering products for text mining. Has text miningevolved so rapidly to become a mature field? This article attempts to shed some lights tothe question. We first present a text mining framework consisting of two components:Text refining that transforms unstructured text documents into an intermediate form; andknowledge distillation that deduces patterns or knowledge from the intermediate form.We then survey the state-of-the-art text mining products/applications and align thembased on the text refining and knowledge distillation functions as well as the intermediateform that they adopt. In conclusion, we highlight the upcoming challenges of text miningand the opportunities it offers.1. IntroductionText mining, also known as text data mining [3] or knowledge discovery from textual databases [2], refers generally to the process of extracting interesting and non-trivial patterns or knowledge from unstructured text documents. It can be viewed as an extension of data mining or knowledge discovery from (structured) databases [1,4].As the most natural form of storing information is text, text mining is believed to have a commercial potential higher than that of data mining. In fact, a recent study indicated that 80% of a company’s information is contained in text documents. Text mining, however, is also a much more complex task (than data mining) as it involves dealing with text data that are inherently unstructured and fuzzy. Text mining is a multidisciplinary field, involving information retrieval, text analysis, information extraction, clustering, categorization, visualization, database technology, machine learning, and data mining.This article presents a general framework for text mining consisting of two components: Text refining that transforms free-form text documents into an intermediate form; and knowledge distillation that deduces patterns or knowledge from the intermediate form. We then use the proposed framework to study and align the state-of-the-art text mining products and applications based on the text refining and knowledge distillation functions as well as the intermediate form that they adopt.The rest of this paper is organized as follows. Section 2 presents the proposed text mining framework, that bridges the gap between text mining and data mining. Section 3 gives an overview of the current text mining products and applications in the light of the proposed framework. The final section discusses open problems and research directions.2. A Framework of text miningText mining can be visualized as consisting of two phases: Text refining that transforms free-form text documents into a chosen intermediate form, and knowledge distillation that deduces patterns or knowledge from the intermediate form. Intermediate form (IF) can be semi-structured such as the conceptual graph representation, or structured such as the relational data representation.Intermediate form can be document-based wherein each entity represents a document, or concept-based wherein each entity represents an object or concept of interests in a specific domain.Mining a document-based IF deduces patterns and relationship across documents. Document clustering/visualization and categorization are examples of mining from a document-based IF.Mining a concept-based IF derives pattern and relationship across objects or concepts. Data mining operations, such as predictive modeling and associative discovery, fall into this category.A document-based IF can be transformed into a concept-based IF by realigning or extracting the relevant information according to the objects of interests in a specific domain. It follows that document-based IF is usually domain-independent and concept-based IF is domain-dependent.Figure 1: A text mining framework. Text refining converts unstructured text documents into an intermediate form (IF). IF can be document-based or concept-based. Knowledge distillation from a document-based IF deduces patterns or knowledge across documents. A document-based IF can be projected onto a concept-based IF by extracting object information relevant to a domain.Knowledge distillation from a concept-based IF deduces patterns or knowledge across objects or concepts.Concept-basedintermediateformText Clustering Categorization Visualization...Predictive Modeling Associative Discovery Visualization...Textrefining Knowledge distillationDocument-basedintermediateformFor example, given a set of news articles, text refining first converts each document into a document-based IF. One can then perform knowledge distillation on the document-based IF for the purpose of organizing the articles, according to their content, for visualization and navigation purposes. For knowledge discovery in a specific domain, the document-based IF of the news articles can be projected onto a concept-based IF depending on the task requirement. For example, one can extract information related to “company” from the document-based IF and form a company database. Knowledge distillation can then be performed on the company database (company-based IF) to derive company-related knowledge.3. A Survey of text mining productsTable 1 shows an illustrative list of text mining products and applications based on the text refining and knowledge distillation functions as well as the intermediate form adopted. One group of products focuses on document organization, visualization, and navigation. Another group focuses on text analysis functions, notably, information retrieval, information extraction, categorization, and summarization. While we see that most text mining systems are based on natural language processing, none of the products has integrated data mining functions for knowledge distillation across concepts or objects.3.1.Document visualizationThere are a good number of text mining products that fall into this category. The general approach is to organize the documents based on their similarities and present the groups or clusters of the documents in certain graphical representation. The following list is by no means exhaustive but is sufficient to illustrate the variety of the representation schemes available. Cartia's ThemeScape is an enterprise information mapping application that presents clusters of documents in landscape representation. Canis's cMap is a document clustering and visualization tool based on Self-Organizing Map. IBM’s Technology Watch, developed jointly with Synthema in Italy, is a text mining application in the scientific domain. It performs document clustering plus visualization in the form of maps for patent databases and technical publications. Inxight also offers a visualization tool, known as VizControls, that performs value-added post-processing of search results by clustering the documents into groups and displaying based on a hyperbolic tree representation. Semio Corp's SemioMap employs a three-dimensional graphical interface that maps the links between concepts in the document collection. Note that SemioMap is concept-based in the sense that it explores the relationships between concepts whereas most other visualization tools are document-based.3.2.Text analysis and understandingThe second group of the text mining products is mainly based on natural language processing techniques, including text analysis, text categorization, information extraction, and summarization.Knowledge Discovery System's Concept Explorer is a visual search tool that helps to find precisely related content on the web. It "learns" relationships between words and phrases automatically from sample documents and visually guides you to construct searches. Inxight's LinguistX is another document retrieval tool with some text analysis and summarizationcapabilities. IBM’s Intelligent Miner is probably one of the most comprehensive text mining products around. It offers a set of text analysis tools, including a feature extraction tool, a set of clustering tools, a summarization tool, and a categorization tool. Also incorporated are the IBM’s text search engine, NetQuestion Solution and the IBM web crawler package. TextWise, an R&D company based in Syracuse University, offers various text mining products. DR-LINK is an information retrieval system based on automatic concept expansion. CINDOR is its cross lingual version. CHESS is a text analysis and information extraction tool. Also an information extraction tool is the Data Junction's Cambio, which extracts data in the form of relational attributes from text. Megaputer's TextAnalyst uses a semantic net representation of documents and performs automated indexing, topic assignment, text abstraction, and semantic search.Company/ Organization Product/ApplicationText RefiningFunctionsIntermediateFormKnowledgeDistillationFunctionsCartia ThemeScape Document-based Clustering,visualizationCanis cMap Document-basedword histograms Clustering, visualizationIBM/ Synthema TechnologyWatchDocument-based Clustering,visualizationInxight VisControls Document-basedHyperbolic treeVisualization Semio Corp SemioMap Concept-based VisualizationKnowledge Discovery System ConceptExplorerInfo retrieval Concept-basedInxight Linguist Info retrieval,text analysis,summarizationDocument-basedIBM iMiner Info retrieval,summarization Document-based Clustering,categorizationTextWise DR_LINKCINDORCHESS Info retrievalInfo extractionConcept-basedCambio Data Junction Info extraction Concept-basedMegaputer TextAnalyst Info retrieval,summarization Document-basedsemantic netClassificationTable 1: A summary of illustrative text mining products and applications based on the text refining and knowledge distillation functions as well as the intermediate form adopted.4. Open problems and future directions4.1.Intermediate formIntermediate forms with varying degrees of complexity are suitable for different mining purposes. For a fine-grain domain-specific knowledge discovery task, it is necessary to perform semantic analysis to derive a sufficiently rich representation to capture the relationship between the objects or concepts described in the documents. However, semantic analysis methods are computationally expensive and often operate in the order of a few words per second. It remains a challenge to see how semantic analysis can be made much more efficient and scalable for very large text corpora.4.2.Multilingual text refiningWhereas data mining is largely language independent, text mining involves a significant language component. It is essential to develop text refining algorithms, that process multilingual text documents and produce language-independent intermediate forms. While most text mining tools focus on processing English documents, mining from documents in other languages allows access to previously untapped information and offers a new host of opportunities.4.3.Domain knowledge integrationDomain knowledge, not catered for by any current text mining tools, could play an important role in text mining. Specifically, domain knowledge can be used as early as in the text refining stage. It is interesting to explore how one can take advantage of domain information to improve parsing efficiency and derive a more compact intermediate form.Domain knowledge could also play a part in knowledge distillation. In a classification or predictive modeling task, domain knowledge helps to improve learning/mining efficiency as well as the quality of the learned model (or mined knowledge) [5]. It is also interesting to explore how a user’s knowledge can be used to initialize a knowledge structure and make the discovered knowledge more interpretable.4.4.Personalized autonomous miningCurrent text mining products and applications are still tools designed for trained knowledge specialists. Future text mining tools, as part of the knowledge management systems, should be readily usable by technical users as well as management executives. There have been some efforts in developing systems that interpret natural language queries and automatically perform the appropriate mining operations. Text mining tools could also appear in the form of intelligent personal assistants [6]. Under the agent paradigm, a personal miner would learn a user’s profile, conduct text mining operations automatically, and forward information without requiring an explicit request from the user.References[1]Fayyad, U., Piatetsky-Shapiro, G. & Smyth, P. (1996). From data mining to knowledgediscovery: An Overview. In Advances in Knowledge Discovery and Data Mining, U.Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, eds., MIT Press, Cambridge, Mass., 1-36.[2]Feldman, R. & Dagan, I. (1995) Knowledge discovery in textual databases (KDT). Inproceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD-95), Montreal, Canada, August 20-21, AAAI Press, 112-117.[3]Hearst, M. A. (1997) Text data mining: Issues, techniques, and the relationship toinformation access. Presentation notes for UW/MS workshop on data mining, July 1997.[4]Simoudis, E. (1996). Reality check for data mining. IEEE Expert, 11(5).[5]Tan, A.-H. (1997). Cascade ARTMAP: Integrating neural computation and symbolicknowledge processing. IEEE Transactions on Neural Networks, 8(2), 237-250.[6]Tan, A.-H. & Teo, C. (1998). Learning user profiles for personalized informationdissemination. In proceedings, International Joint Conference on Neural Networks (IJCNN'98), Alaska, 183-188.。

高三英语学术研究方法创新思路探讨单选题30题

高三英语学术研究方法创新思路探讨单选题30题1. In academic research, a method that involves collecting and analyzing data from a large number of people is called:A. Qualitative researchB. Quantitative researchC. Descriptive researchD. Exploratory research答案：B。

本题主要考查学术研究方法的定义。

选项A“Qualitative research”定性研究，侧重于对事物的性质、特征和意义进行描述和解释。

选项C“Descriptive research”描述性研究，主要是对现象进行详细的描述。

选项D“Exploratory research”探索性研究，旨在探索新的领域或问题。

而选项B“Quantitative research”定量研究，就是通过收集和分析大量的数据来得出结论。

2. Which of the following is NOT a type of academic research method?A. Historical researchB. Experimental researchC. Hypothetical researchD. Comparative research答案：C。

在学术研究中，选项A“Historical research”历史研究，通过研究过去的事件和情况来获取知识。

选项B“Experimental research”实验研究，通过控制变量来验证假设。

选项D“Comparative research”比较研究，对不同的事物进行对比分析。

而“Hypotheticalresearch”不是常见的学术研究方法类型。

3. The research method that focuses on understanding the meaning and experience of individuals is:A. Mixed-method researchB. Grounded theory researchC. Action researchD. Phenomenological research答案：D。

Using soft-matching mined rules to improve information extraction

To Appear in Proceedings of the AAAI-2004 Workshop on Adaptive Text Extraction and Mining (ATEM-2004),San Jose, CA, July 2004.Using Soft-Matching Mined Rules to Improve Information ExtractionUn Yong Nahm and Raymond J.MooneyDepartment of Computer SciencesThe University of Texas at Austin1University Station C0500Austin,TX78712-1188{pebronia,mooney}@AbstractBy discovering predictive relationships between different pieces of extracted data,data-mining algorithms can be used to improve the accuracy of information extraction.How-ever,textual variation due to typos,abbreviations,and other sources can prevent the productive discovery and utilization of hard-matching rules.Recent methods for inducing soft-matching rules from extracted data can more effectivelyﬁnd and exploit predictive relationships in textual data.This pa-per presents techniques for using mined soft-matching asso-ciation rules to increase the accuracy of information extrac-tion.Experimental results on a corpus of computer-science job postings demonstrate that soft-matching rules improve information extraction more effectively than hard-matching rules.IntroductionInformation extraction(IE)and knowledge discovery and data mining(KDD)are both useful tools for discovering knowledge from textual corpora.However,there has been relatively little research on exploring the productive integra-tion of traditional IE and KDD methods.Nahm&Mooney (2000)introduced a mutually-beneﬁcial framework for inte-grating IE and KKD.IE beneﬁts KDD by extracting struc-tured data from textual documents,which can then be mined using traditional methods.KDD beneﬁts IE by discovering rules that support predictions that can improve the accuracy of subsequent information extraction.The predictive relationships between different IE slot ﬁllers discovered by KDD can provide additional clues about what information should be extracted from a doc-ument.For example,suppose we discover the following rule from data on programming languages and topic areas extracted from a corpus of computer-science job postings:“SQL”∈language→“Database”∈area.If the IE system extracted“SQL”for the language slot but failed to extract “Database”for the area slot,we may want to assume there was an extraction error and add“Database”to the area slot. Since typically the recall(percentage of correct slotﬁllers Copyright c 2004,American Association for Artiﬁcial Intelli-gence().All rights reserved.extracted)of an IE system is signiﬁcantly lower than its pre-cision(percentage of extracted slotﬁllers which are correct) (DARPA1998),such predictive relationships can be produc-tively used to improve recall by suggesting additional infor-mation to extract.Nahm&Mooney(2000)employed C4.5rules(Quinlan 1993)to induce predictive rules which were then used to im-prove the recall of subsequent information extraction.Un-fortunately,extracted text often exhibits variations that can prevent mining algorithms from discovering important reg-ularities.Variations can arise from typographical errors, misspellings,abbreviations,as well as other sources.For example,in data on local job offerings that we automati-cally extracted from newsgroup postings,the Windows op-erating system is variously referred to as“Microsoft Win-dows”,“MS Windows”,“Windows95/98/ME”,etc..To ad-dress such textual variation,we have developed two new KDD algorithms,T EXT RISE(Nahm&Mooney2001)and S OFT A PRIORI(Nahm&Mooney2002),that induce soft-matching rules appropriate for variable text data.The result-ing rules use a text-similarity measure such as string edit-distance(Gusﬁeld1997)or vector-space cosine similarity (Salton1989)toﬂexibly match textual items.In this paper,we demonstrate that using such soft-matching rules to predict potential extractions improves the accuracy of IE more than using hard-matching rules(as in our previous work).Speciﬁcally,we compare the hard-matching rules mined with A PRIORI(Agrawal&Srikant 1994)to soft-matching rules mined with S OFT A PRIORI with respect to their ability to improve information extraction from a corpus of job postings.BackgroundInformation ExtractionThe goal of an IE system is to locate speciﬁc data in natural-language text.The data to be extracted is typically given by a template which speciﬁes a list of slots to beﬁlled with sub-strings taken from the document.IE is useful for a variety of applications,particularly given the recent proliferation of Internet and web documents.In particular,machine learn-ing techniques have been suggested for extracting informa-•city:Austin•state:TX•language:Java;C;C++•platform:MS Windows•area:Technical writing;QA;technical support •application:Microsoft Ofﬁce97/2000;Outlook97/98;Word-1-2-3•hardware:Laptops;Printers;Palm Pilots•major:Computer Sciences•required degree:BSFigure1:Sampleﬁlled templatetion from text documents in order to create easily searchable databases from the information,thus making the online text more accessible(Califf&Mooney1999).For instance,in-formation extracted from job postings on company websites can be used to build a searchable database of jobs.1In this paper,we consider the task of extracting a database from postings to the USENET newsgroup,austin.jobs with BWI(Boosted Wrapper Induction)(Freitag&Kushm-erick2000).Figure1shows aﬁlled computer-science job template where several slots may have multipleﬁllers.For example,slots such as languages,platforms,applications, and areas usually have more than oneﬁller,while slots re-lated to the city or state have only one.BWI learns extraction rules by boosting a wrapper induc-tion system.A wrapper is a contextual pattern that is sim-ple but highly accurate.Wrapper induction,the automated process of learning wrappers,has been traditionally used to extract data from highly structured text such as web pages generated by CGI scripts.Since individual patterns typi-cally have high precision and low recall,BWI uses boost-ing(Freund&Schapire1996)to create accurate extractors by combining many high-precision patterns.In BWI,IE is treated as a classiﬁcation problem,and contextual patterns are learned to identify the beginning and end of relevant text segments.Boundary patterns are repeatedly learned and training examples are reweighted using the A DA B OOST al-gorithm(Schapire&Singer1998)so that subsequent pat-terns identify examples missed by previous rules.BWI has been shown to perform reasonably well in several domains, from traditional free text to structured HTML documents. The S OFT A PRIORI AlgorithmData mining methods generally require terms in discovered rules to exactly match database entries.Normal variation in textual data items can therefore prevent the discovery of im-portant regularities.Variations can arise from typographical errors,misspellings,abbreviations,as well as other sources. Variations are particularly pronounced in data that is au-tomatically extracted from unstructured or semi-structured documents or web pages.S OFT A PRIORI(Nahm&Mooney 1/•database(databases,database sys.)∈area⇒oracle(oracle7)∈application[3.2%,43.2%]•mfc∈area⇒windows(windows nt,windows95,windows3.1, windows3.x,windowsnt,windows95,windows’95)∈platform[2.7%,39.0%]•linux(linux-pc)∈platform⇒html(dhtml,d/html,shtml)∈language[11.0%,50.0%]Figure2:Sample soft association rules mined from600jobpostings2002)discovers soft matching association rules whose an-tecedents and consequents are evaluated based on sufﬁcientsimilarity to database entries.S OFT A PRIORI generalizes the standard A PRIORI algo-rithm for discovering association rules(Agrawal&Srikant1994)by allowing soft matching based on a given similar-ity metric for eachﬁeld.Similarity of textual entries inS OFT A PRIORI is measured using standard edit-distance orbag-of-words measures.In our experiments,we used edit-distance with afﬁne gap cost(Needleman&Wunsch1970),incurring one penalty for starting a new gap(i.e.sequenceof character deletions)and a smaller penalty for continuingan existing gap(i.e.contiguous character deletions).In S OFT A PRIORI,soft relations on items sets are de-ﬁned,assuming that a function,similarity(x,y),is givenfor measuring the similarity between two items x and y.LetI={i1,i2,...,i m}be a set of literals,called items.Let D be a set of baskets,where each basket B⊆I is a set ofitems.A soft association rule is an implication of the formX⇒Y,where X⊂I,Y⊂I and no item in X is similar to an element of Y.The problem of mining soft association rules isﬁnding all soft association rules,X⇒Y,such that their soft-support and the soft-conﬁdence are greater than user-deﬁned minimum values.Formal deﬁnitions of soft-support and soft-conﬁdence are straightforward generaliza-tions of the traditional ones that allow items to match as long as their similarity exceeds a pre-speciﬁed threshold.In order toﬁnd association rules for extracted data,weﬁrst map each extractedﬁller to an item.A document is rep-resented as a basket of items where each item is a slot-ﬁllerpair extracted from the document.By applying S OFT A PRI-ORI to job postings,we mined relationships between items such as“If a computer-related job posting requres knowl-edge of MFC then it also lists Windows in the list of required skills.”Sample rules mined from a database of600job post-ings extracted from a USENET newsgroup are shown in Fig-ure2.Items found to be similar to a given item during train-ing are shown in parentheses,and values for soft-support and soft-conﬁdence are shown in brackets.Using Soft Association Rules to ImproveInformation ExtractionThe general D ISCO TEX framework described in(Nahm& Mooney2000)serially combines an information extraction system and a KDD module.After constructing an IE system that extracts the desired set of slots for a given application,Parameter:minconf,minsup-minimum conﬁdence/support.T sim-similarity threshold.T ex-extraction threshold.Input:D train-set of labeled documents.D test-set of n unlabeled documents.Output:L-set of n ew labels for D test.Function InformationExtraction(D train,D test)Build an information extraction rule base,RB IE(by applying BWI to D train)Let L train:=set of labeled slotﬁllers of D train.Build a soft association rule base,RB(by applying S OFT A PRIORI to L trainwith parameters minconf,minsup,and T sim)For each unlabeled document D test(i)doExtract slotﬁllers from D test(i)using RB IE.Let L(i):=set of extracted slotﬁllers of D test(i).Until no change obtained on L(i)DoFor each rule R(X⇒Y)∈RB doIf Rﬁres on L(i)For each matching substring Y′in D test(i)do(with similarity(Y,Y′)≥T sim)score(Y′):=similarity(Y,Y′)×conf(R).If score(Y′)≥T exadd Y′to L(i).Let L:=(L(1),L(2),...,L(n)).Return L.Figure3:Algorithm speciﬁcationa database is constructed from a corpus of texts by apply-ing the extractor to each document to create a collection of structured records.A standard rule mining algorithm is then applied to the resulting database to discover interesting re-lationships.Speciﬁcally,we mine rules for predicting the presence or absence of each database item given informa-tion about all other items.However,for data sets with signiﬁcant textual variation, hard-matching rules discovered by standard data mining al-gorithms may not work.We propose mining soft-matching rules instead,which allow non-standardized database en-tries to match antecedents and consequents based on rele-vant similarity metrics.A beneﬁt of association-rule mining instead of classiﬁcation-rule induction is that consequents of rules are not predetermined,resulting in efﬁcient mining of all potential associations as part of a single process.When inducing classiﬁcation rules,a separate learning step must be run for predicting each possible item.For instance,classi-ﬁcation rule induction is not as efﬁcient for our job-postings data set with as many as700items in600documents.Tests of IE systems usually consider two accuracy mea-sures,precision and recall.Precision is the number of cor-rectly extracted items divided by the total number of extrac-tions while recall is the number of correct extractions di-vided by the total number of items actually present in the documents.Also,F-measure,the harmonic mean of pre-cision and recall was introduced to combine them.Many extraction systems provide relatively high precision,but re-call is typically much lower.Currently,BWI’s search is also biased toward high precision.Experiments in the job post-ings domain shows BWI’s precision(e.g.high50%’s)is higher than its recall(e.g.low40%’s)(Freitag&Kushmer-ick2000).Although several methods have been developed that allow a rule learner to trade-off precision and recall(Co-hen1996),this typically leaves the overall F-measure un-changed.By forward-chaining on extracted data using mined soft-association rules,we can derive additional probable extrac-tions,thereby improving recall without unduly sacriﬁcing precision.For example,suppose we discover the rule“MS Access2000”∈application⇒“Database”∈area.If the IE system extracted“Access2000”∈application but failed to extract“Database”∈area,we may want to assume there was an extraction error and add“Database”to the area slot,potentially improving recall.Therefore,after mining knowledge from extracted data,the discovered soft-matching rules are used to predict additional potential ex-tractions during subsequent extraction.Pseudocode shown in Figure3describes the use of mined rules in information extraction.Theﬁnal decision whether or not to extract a predicted ﬁller is based on whether theﬁller(or any of its“syn-onyms”)occurs in the document.If a string equal or similar to the predictedﬁller is found in the text,the extractor considers its prediction conﬁrmed and extracts the string.In the previous example,even if the string “Database”is not found in the document,a similar string such as“databases”is still considered for extraction since similarity(“Database”,“databases”)≥T sim where T sim is the prespeciﬁed threshold for determining a match.The con-ﬁdence of the rule is also considered in conﬁrming that the rule is strong enough to extract theﬁller,combined with the similarity information indicating how close the actual string is to the predicted one.In summary,documents which the user has annotated with extracted information are used to create a database.The S OFT A PRIORI rule miner then processes this database to construct a knowledge base of soft-matching rules for pre-dicting slot values.These rules are then used during testing to improve the recall of the existing IE system by proposing additional slotﬁllers whose similar strings are conﬁrmed to be present in the document before adding them to theﬁnal extraction template.The overall architecture of theﬁnal sys-tem is presented in Figure4.EvaluationDataset and Experimental MethodologyTo test the overall system,600computer-science job postings to the newsgroups austin.jobs and misc.jobs.offered were annotated with information to be extracted.We used a simple web-crawler for spidering the web site in order to collect documents from the newsgroups.Not all the documents collected from the Internet are rel-evant documents.Non-computer-science job postings or re-sum´e postings,spam,and other irrelevant documents were ﬁltered out by a trained text categorizer.A bag-of-words Naive-Bayes text categorizer(McCallum&Nigam1998)isFigure 4:The system architectureSlots AvgNumFillerAvgNumDocNumFillerlanguage 0.13 2.3080platform 0.177.11104application 0.30 3.76179area 0.60 1.17361total1.211.38724Table 1:Statistics on slot-ﬁllersused to identity relevant documents.Classiﬁcation accuracy as measured by ten-fold cross-validation is 91.17%,with 91.99%precision and 86.0%recall.Ten-fold cross validation was used to generate training and test sets for extraction from the set of documents.Rules were mined for predicting the ﬁllers of the languages ,platforms ,applications ,and areas slots,since these are usually ﬁlled with multiple items that have poten-tial predictive relationships.Statistics on these slot-ﬁllers are shown in Table 1,including the average number of ﬁllers per document,average number of documents per ﬁller,and the total number of distinct ﬁller strings in the corpus.Parameters for BWI are set to the default values,a look-ahead of 3and 20boosting iterations.The default set of wildcards is used.The similarity threshold,minimum sup-port,and minimum conﬁdence for A PRIORI and S OFT A PRI -ORI were set to 0.70,5%,and 10%,respectively.Associ-ation rules without antecedents (e.g.⇒C++)are also em-ployed.The minimum support and conﬁdence values are de-termined by validating on the training data set.The minium conﬁdence value is set to a low value because the ﬁnal ex-traction of a ﬁller is conﬁrmed by checking if same (hard rules)or similar (soft rules)strings are found in the docu-ment or not.The match cost,mismatch cost,gap-start cost,and gap-extend cost parameters for the afﬁne-gap edit dis-tance were set to 0,3,3,and 1,respectively.All white spaces in strings are considered as blank characters and upper andR e c a l l (%)Number of Training ExamplesFigure 5:Recall on job postingslower cases are distinguished only in the IE phase.ResultsTo evaluate our system,we compared the performance of BWI alone,BWI aided with hard-matching rules mined by standard A PRIORI ,and BWI with soft-matching association rules.Figures 5and 6and show the learning curves for recall and and F-measure for the job-postings data.The same set of human-annotated training data was provided to both BWI and the rule miner as shown in Figure 4.As a benchmark,we show the performance of a simple baseline (Memorizing)for increasing recall that always ex-tracts substrings that are known ﬁllers for a particular slot.This baseline remembers all slot-ﬁllers that appear at least once in the training data.Whenever a known ﬁller string,e.g.Java ,is contained in a test document,it is extracted as a ﬁller for the corresponding slot,nguage .This method has good recall but limited precision since a ﬁller string contained in a document is not necessarily the correct ﬁller for the corresponding slot.For instance,“www”can appear in a document,not in a list of required skills but in a URL of the company’s homepage.We also tested a “soft”version of this baseline (Soft-Memorizing)that extracts all strings that are sufﬁciently similar to known items in the training data.Although this increases recall,it decreases precision even further.For example,“Peoplesoft”remembered as a ﬁller for the application slot can cause the system to extract the spu-rious ﬁller “people”.The fact that the F-measure of these baselines are worse than the proposed system demonstrates the additional value of rule mining for improving extraction performance.As hypothesized,BWI with soft-matching rules provides higher recall,and in spite of decreasing precision some-what,overall F-measure is moderately increased.For each training set size,systems were compared to determine if their differences in recall were statistically signiﬁcant us-F -m e a s u r e (%)Number of Training ExamplesFigure 6:F-measure on job postingsing a two-tailed,paired t -test (p <0.05).For all set of training examples,using soft-matching rules is signiﬁcantly better than both using hard-matching rules and unaided ex-traction while using hard-matching rules is also signiﬁcantly better than unaided extraction.Although the differences are somewhat small,these results demonstrate the advantage of mining soft-matching rules for improving extraction accu-racy.We believe that the small gains especially in F-measure are due to the relatively low accuracy of the underlying IE learner,BWI,and therefore plan to apply our method to a more accurate IE learning system,such as R APIER (Califf &Mooney 1999).Related ResearchAlthough there is a growing interest in the general topic of text mining (Berry 2003;Grobelnik 2003;Hearst 2003),there has been relatively little research exploring the integra-tion of IE and KDD.Several rule-induction and association-rule-mining algorithms have been applied to databases of corporations or product reviews automatically extracted from the Web (Ghani et al.2000;Ghani &Fano 2002;Pierre 2002);however,these projects do not use the mined knowledge to improve subsequent extraction.Recently,a probabilistic framework for unifying information extraction and data mining has been proposed (McCallum &Jensen 2003).A general approach for using statistical relational models to integrate IE and KDD is presented,but an actual implementation and experimental results for this approach are still forthcoming.A boosted text classiﬁcation system based on link analysis (Cohen 2003),and a feedback combi-nation of an HTML parser and a high-level wrapper (Cohen,Hurst,&Jensen 2002)are related to our work in spirit since they also attempt to improve the underlying learners by uti-lizing feedback from KDD modules.The use of Web-based statistics (search-engine hit counts)to improve the precision of information extraction has been also proposed recently (Soderland et al.2004).Future WorkOur current experiments mine associations only from human-annotated data.An interesting question is whether the performance of an IE system can be further improved by discovering associations in automatically extracted data.Although adding information extracted from unannotated documents to the database may result in a larger database and therefore some better prediction rules,it may also cre-ate noise in the database due to extraction errors and conse-quently cause some inaccurate prediction rules to be discov-ered as well.Currently,we only consider improving the recall of IE.Methods for using mined knowledge to improve extraction precision are also needed.Simply eliminating extracted ﬁllers that are not predicted is too coarse and would likely severely damage recall.One possible solution is to use a variation of negative association rules (Savasere,Omiecin-ski,&Navathe 1998;Wu,Zhang,&Zhang 2002).By con-ﬁdently predicting the absence of certain slot values given other extracted information,both precision and recall could potentially be improved.A related issue is combining the conﬁdence measures for prediction rules with the extraction conﬁdences from the IE system to produce an overall conﬁ-dence in ﬁnal extraction decisions.The procedure for selecting the slots to be used in rule mining needs to be automated.In the current experiments,we manually chose four slots from the computer-science job template.By identifying and quantifying the correlations between slot values,this decision could be automated.Auto-matic learning of threshold values for mining soft rules and dynamic setting of minimum support and conﬁdence values on each training data set could also be explored.ConclusionsIn previous work,we introduced an approach to using pre-dictive rules mined from extracted data to improve the re-call of information extraction.However,this approach was limited by the requirement that the antecedents and conse-quents of mined rules exactly match textual items.The nor-mal variation that occurs in textual information frequently prevents such an approach from effectively exploiting many of the potentially-useful predictive relationships in the data.In more recent work,we have developed techniques for min-ing soft-matching rules that employ standard text-similarity metrics to discover more subtle relationships in variable tex-tual data.By combining these ideas,we have developed a method for using soft-matching mined rules to further im-prove the accuracy of information extraction.Empirical ex-periments on a real text corpus showed that our new method can more effectively use automatically-discovered knowl-edge to improve the recall (and F-measure)of information-extraction.AcknowledgementsThis research was supported by the National Science Foun-dation under grant IIS-0117308.ReferencesAgrawal,R.,and Srikant,R.1994.Fast algorithms for mining association rules.In Proceedings of the20th Inter-national Conference on Very Large Databases(VLDB-94), 487–499.Berry,M.W.,ed.2003.Proceedings of the Third SIAM In-ternational Conference on Data Mining(SDM-2003)Work-shop on Text Mining.Califf,M.E.,and Mooney,R.J.1999.Relational learning of pattern-match rules for information extraction.In Pro-ceedings of the Sixteenth National Conference on Artiﬁcial Intelligence(AAAI-99),328–334.Cohen,W.W.;Hurst,M.;and Jensen,L.S.2002.Aﬂexi-ble learning system for wrapping tables and lists in HTML documents.In Proceedings of the Eleventh International World Wide Web Conference(WWW-2002),232–241.Hon-olulu,HI:ACM.Cohen,W.W.1996.Learning to classify English text with ILP methods.In De Raedt,L.,ed.,Advances in Inductive Logic Programming.Amsterdam:IOS Press.124–143. Cohen,W.W.2003.Improving a page classiﬁer with an-chor extraction and link analysis.In Becker,S.;Thrun,S.; and Obermayer,K.,eds.,Advances in Neural Information Processing Systems15,1481–1488.Cambridge,MA:MIT Press.DARPA.,ed.1998.Proceedings of the Seventh Mes-sage Understanding Evaluation and Conference(MUC-98).Fairfax,V A:Morgan Kaufmann.Freitag,D.,and Kushmerick,N.2000.Boosted wrapper in-duction.In Proceedings of the Seventeenth National Con-ference on Artiﬁcial Intelligence(AAAI-2000),577–583. Austin,TX:AAAI Press/The MIT Press.Freund,Y.,and Schapire,R.E.1996.Experiments with a new boosting algorithm.In Saitta,L.,ed.,Proceed-ings of the Thirteenth International Conference on Ma-chine Learning(ICML-96),148–156.Morgan Kaufmann. Ghani,R.,and Fano,ing text mining to infer semantic attirbutes for retail data mining.In Proceedings of the2002IEEE International Conference on Data Min-ing(ICDM-2002),195–202.Maebash City,Japan:IEEE Computer Society.Ghani,R.;Jones,R.;Mladeni´c,D.;Nigam,K.;and Slat-tery,S.2000.Data mining on symbolic knowledge ex-tracted from the Web.In Mladeni´c,D.,ed.,Proceedings of the Sixth International Conference on Knowledge Dis-covery and Data Mining(KDD-2000)Workshop on Text Mining,29–36.Grobelnik,M.,ed.2003.Proceedings of the Eighteenth International Joint Conference on Artiﬁcial Intelligence(IJCAI-2003)Workshop on Text Mining and Link Analysis(TextLink-2003).Gusﬁeld,D.1997.Algorithms on Strings,Trees and Se-quences.New York:Cambridge University Press. Hearst,M. A.2003.What is text mining? /˜hearst/text-mining.html.McCallum,A.,and Jensen,D.2003.A note on the uni-ﬁcation of information extraction and data mining using conditional-probability,relational models.In Proceedings of the IJCAI-2003Workshop on Learning Statistical Mod-els from Relational Data.McCallum,A.,and Nigam,K.1998.A comparison of event models for naive Bayes text classiﬁcation.In Papers from the AAAI-98Workshop on Text Categorization,41–48.Nahm,U.Y.,and Mooney,R.J.2000.A mutually beneﬁ-cial integration of data mining and information extraction. In Proceedings of the Seventeenth National Conference on Artiﬁcial Intelligence(AAAI-2000),627–632.Nahm,U.Y.,and Mooney,R.J.2001.Mining soft-matching rules from textual data.In Proceedings of the Seventeenth International Joint Conference on Artiﬁcial Intelligence(IJCAI-2001),979–984.Nahm,U.Y.,and Mooney,R.J.2002.Mining soft-matching association rules.In Proceedings of the Eleventh International Conference on Information and Knowledge Management(CIKM-2002),681–683.Needleman,S.B.,and Wunsch,C.D.1970.A gen-eral method applicable to the search for similarities in the amino acid sequences of two proteins.Journal of Molecu-lar Biology48:443–453.Pierre,J.M.2002.Mining knowledge from text collec-tions using automatically generated metadata.In Karagian-nis,D.,and Reimer,U.,eds.,Proceedings of the Fourth International Conference on Practical Aspects of Knowl-edge Management(PAKM-2002),537–548.Vienna,Aus-tria:Springer.Lecture Notes in Computer Sicnece V ol. 2569.Quinlan,J.R.1993.C4.5:Programs for Machine Learn-ing.San Mateo,CA:Morgan Kaufmann.Salton,G.1989.Automatic Text Processing:The Trans-formation,Analysis and Retrieval of Information by Com-puter.Addison-Wesley.Savasere,A.;Omiecinski,E.;and Navathe,S.B.1998. Mining for strong negative associations in a large database of customer transactions.In Proceedings of the14th In-ternational Conference on Data Engineering(ICDE-98), 494–502.Orlando,FL:IEEE Computer Society. Schapire,R.E.,and Singer,Y.1998.Improved boosting al-gorithms using conﬁdence-rated predictions.In Piatetsky-Shapiro,G.;Agrawal,R.;and Stolorz,P.E.,eds.,Pro-ceedings of the11th Annual Conference on Computational Learning Theory,80–91.Madison,WI:ACM. Soderland,S.;Etzioni,O.;Shaked,T.;and Weld,D.S. 2004.The use of Web-based statistics to validate in-formation extraction.To appear in Papers from the AAAI-2004Workshop on Adaptive Text Extraction and Mining(ATEM-2004)Workshop,San Jose,CA.Wu,X.;Zhang,C.;and Zhang,S.2002.Mining both pos-itive and negative association rules.In Sammut,C.,and Hoffmann,A.G.,eds.,Proceedings of the Ninteenth Inter-national Conference on Machine Learning(ICML-2002), 658–665.Sydney,Australia:Morgan Kaufmann.。

英语泛读教程1参考答案

英语泛读教程1参考答案Unit 1: Introduction to English Reading1. Vocabulary Exercises- Words:- Vocabulary:- Ambiguous: having more than one possible meaning- Connotation: the emotional or cultural associations of a word- Context: the circumstances or setting in which something happens or is said- Denotation: the literal meaning of a word- Phrases:- "In the context of": when considering the situation or environment- "To deduce": to reach a conclusion based on evidence2. Comprehension Questions- What is the difference between denotation and connotation?- Denotation is the literal meaning of a word, while connotation refers to the emotional or cultural associations that a word may carry.- Why is context important in understanding a text?- Context provides the circumstances or setting in which something is said or happens, which can greatly affect the interpretation of the text.3. Reading Comprehension- Main Idea: The passage discusses the importance of understanding vocabulary in the context of a text.- Supporting Details: It explains the concepts of denotation and connotation, and how they contribute to the meaning of words in different contexts.4. Critical Thinking- How might a word's connotation affect the tone of a written piece?- A word's connotation can subtly influence the tone of a written piece by adding positive or negative emotional undertones that may not be explicitly stated.Unit 2: Strategies for Effective Reading1. Vocabulary Exercises- Words:- Skimming: to read quickly to get the general idea- Scanning: to look through text quickly to findspecific information- Summarizing: to give a brief statement of the main points- Phrases:- "To skim through": to read something quickly to get an overview- "To scan for": to search quickly for specific information2. Comprehension Questions- What is the purpose of skimming a text?- Skimming is used to get a general idea of the content without reading every detail.- How does scanning differ from skimming?- Scanning is the act of quickly looking through text to find specific information, whereas skimming is for getting an overall understanding.3. Reading Comprehension- Main Idea: The passage outlines various strategies for effective reading, including skimming, scanning, and summarizing.- Supporting Details: It provides examples of how to apply these strategies to improve reading efficiency and comprehension.4. Critical Thinking- Which reading strategy would be most helpful for a student preparing for an exam, and why?- Summarizing might be most helpful as it allows the student to condense large amounts of information into key points, making it easier to review and recall.Unit 3: Understanding Different Text Types1. Vocabulary Exercises- Words:- Expository: intended to explain or inform- Narrative: telling a story or describing an event- Persuasive: intended to convince or influence- Phrases:- "To persuade someone of": to convince someone tobelieve or do something- "An expository text": a piece of writing that explains or informs2. Comprehension Questions- What is the primary purpose of an expository text?- The primary purpose of an expository text is to explain or inform the reader about a particular subject.- How does a narrative text differ from a persuasive text? - A narrative text tells a story or describes an event, while a persuasive text aims to convince or influence the reader's opinion or actions.3. Reading Comprehension- Main Idea: The passage discusses the characteristics of different text types, including expository, narrative, and persuasive texts.- Supporting Details: It explains the purpose and features of each text type, providing examples of how they are structured and used.4. Critical Thinking- How might understanding the text type affect your approach to reading and interpreting it?- Knowing the text type can guide the reader's expectations and strategies, such as looking for evidence in an expository text or arguments in a persuasive text.Unit 4: Improving Vocabulary Through Reading1. Vocabulary Exercises- Words:- Etymology: the origin and history of a word- Collocation: the way words are often used together - Idiom: a group of words whose meaning is not predictable from the usual meanings of the individual words - Phrases:- "Word origin": the history of how a word came to be used in a particular way- "Common collocations": frequently occurring。

An Evaluation of Unstructured Text Mining Software

An Evaluation of Unstructured Text Mining SoftwareMicah J. Crowsey, Amanda R. Ramstad, David H. Gutierrez, Gregory W. Paladino, and K. P. White,Jr., Member, IEEEAbstract— Five text mining software tools were evaluated by four undergraduate students inexperienced in the text mining field. The software was run on the Microsoft Windows XP operating system, and employed a variety of techniques to mine unstructured text for information. The considerations used to evaluate the software included cost, ease of learning, functionality, ease of use and effectiveness. Hands on mining of text files also led us to more informative conclusions of the software. Through our evaluation we found that two software products (SAS and SPSS) had qualities that made them more desirable than the others.I NTRODUCTIONUnstructured data exists in two main categories: bitmap objects and textual objects. Bitmap objects are non-language based (e.g. image, audio, or video files) whereas textual objects are “based on written or printed language” and predominantly include text documents [1]. Text mining is the discovery of previously unknown information or concepts from text files by automatically extracting information from several written resources using computer software [15]. In text mining, the files mined are text files which can be in one of two forms. Unstructured text is usually in the form of summaries and user reviews whereas structured text consists of text that is organized usually within spreadsheets. This evaluation focused specifically on mining unstructured text files.Many industries are relying on the field of text mining to solve specific applications using a variety of software. Currently, there does not exist a comprehensive review or comparison of the top software suites. Our research was performed so users would have an unbiased reference for looking at the different software features and the pros, cons, and methods for how to most efficiently use them. In comparing the software, we looked at the features that the software had as well as the ease of use and learning of the software. We used a test corpus that we developed for this purpose.Manuscript received April 5, 2007. This work was supported in part by the University of Virginia under a grant provided by an anonymous consultancy.M. J. Crowsey is with the University of Virginia, Charlottesville, VA, 22902. (phone: 434-924-5393. e-mail: mjc5s@).A. R. Ramstad is with the University of Virginia, Charlottesville, VA, 22902. ( e-mail: arr9p@).D. H. Gutierrez is with the University of Virginia, Charlottesville, VA, 22902. (e-mail: dhg9u@)G. W. Paladino is with the University of Virginia, Charlottesville, VA, 22902. ( e-mail: gwp2z@)K. P. White is with the University of Virginia, Charlottesville, VA, 22902. (e-mail: kpw8h@)The five software suites we reviewed were Leximancer,SAS Enterprise Miner, several products of SPSS, Polyanalyst, and Clarabridge. Most of the software was acquired through an anonymous consulting firm. We also obtained software through the University of Virginia and by requesting online software demos.The evaluators of the software were four fourth year Systems Engineering students at the University of Virginiawho had some previous experience with data mining, butlittle experience with text mining. This inexperience proved useful in determining the ease of use and learning of the software, though it sometimes posed challenges when attempting to find out how the software operates.I.T EXT M INING S OFTWAREA.General Information and PricingThe chart below shows the company, product, version, andcost of the software. Unfortunately, certain vendors wereunwilling to disclose cost information for their products.Table 1Company Product Version CostSAS EnterpriseMiner4.3 notprovideddue to companypolicySPSS Text mining forClementine(needClementine)Text MiningBuilderLexiquestCategorize10.12.03.2$4,534$18,136Server (1 CPU)$69,824Megaputer Polyanalyst 6.0 ProfessionalServer $80,000Client $5,000Leximancer Leximancer pro $2500Clarabridge ClarabridgeContent MiningPlatform2.1 $75,000B.LearnabilityThe criteria used to assess the ease of learning were based on whether software had the following:• A demo version•Tutorials•User’s manual• Sample Solutions • Online helpThe functionality results of the software were recorded using a spots and dots methodology. The ease of learning results can be seen in Table 2.The project team attempted to learn how to use each of the software suites solely by referencing the help files, tutorials, and other learning tools that were provided along with the software package. These materials provided information on the general process each software suite uses to mine text and on the specific features offered at different steps in the text mining process.Table 2: LearnabilitySoftwareDemo version Tutorial User’s manual OnlinehelpClarabridge X X SAS X X Clementine X X Polyanalyst X X X X Leximancer X X X XAs we were unable to obtain working copies of Clarabridge and Polyanalyst, we were unable to gain experience using these software suites. The evaluation of this software was done by looking at product features, going through live internet demonstrations, and performing research on the software companies’ websites.Overall, the help documentation which accompanied the SAS, SPSS, and Leximancer software was sufficient to learn basic processes and features employed by each suite. SAS and SPSS both offer traditional help files which serve as a good starting point for learning the process that each software suite uses to mine text. These resources provide both an introduction to text mining as well as the process flows that each of the software suites use to accomplish different text mining functions such as text extraction, text link analysis, and categorization. At a more detailed level, the help documentation of both software suites provide information on how a user can manipulate various features at different nodes in the text mining process in order to affect the results yielded. Finally, the documentation also provides information on how to view, edit, and make use of results.SPSS help documentation presents basic information which provides as user with a quick start to text mining. Example projects also are provided to show how different nodes can work together to accomplish various text mining functions.SAS documentation presents information on the text mining process and its unique methods and features in much more detail. Step-by-step examples of how to achieve a few text mining functions are also very helpful.Leximancer presents its help documentation in a slightlydifferent fashion than SAS and SPSS. Leximancer provides several help topics describing how to accomplish certain text mining functions within its architecture. Leximancer’s text mining process consists of a preset stream of nodes which a user cannot alter, and information on how a user can manipulate the features of these nodes to achieve different results resides within the nodes themselves.The fact that Leximancer presents information on the features offered by different nodes in the text mining process within the nodes themselves makes for quick referencing, and the content is very helpful in general. Leximancer also allows the user to adjust the level of complexity of the features it offers.Although help documentation and tutorials are sufficient for the beginning of the learning curve with these software suites, the group found that more advanced knowledge of how to get the most out of each product is best achieved through an iterative process of hands on experimentation. Manipulating the unique features of each software suite in various ways provides the best knowledge of how to achieve desired results. Also, experimenting with a variety of nodes in sequence allows a user to see how these nodes can interact to achieve more advanced text mining functions.C. Data PreparationThe importance of data preparation in text mining cannot be stressed enough. Given the importance of proper data preparation to the success of a data mining effort, it is advised that the user perform some “cleaning” on the data to put it into a semi-structured form. Although text mining seeks to find relationships between concepts in unstructured data, we found through our evaluation that mining technology does not eliminate the need for data preparation. If the user wishes to achieve a high level of reliability and extract useful concepts from the data, then structuring the data, even in small ways, is helpful.To achieve useful information for our evaluation, we ran text files through the software in order to see how they were processed. We also looked at the quality of the results after the mining was completed. For this task, we used HTML documents that were gathered from the University of Virginia Cavalier Daily newspaper website, . A software program that copies entire websites was used to gather approximately 1200 HTML pages from the website, and these formed the corpus that we ran through the software.D. Software Pros and ConsLeximancer is a software suite that focuses on extracting concepts and showing the relationships between those concepts, along with the relationship strength. Although its Java-based user interface is somewhat different from the other software suites evaluated, Leximancer still offers many of the same features that allow a user to manipulate stages in the text mining process. Leximancer’s results browser is very effective at presenting several pieces ofinformation at once and allowing a user to browse extracted concepts and concept relationships. Leximancer’s resultsbrowser is shown in Figure 1 below.Figure 1: Leximancer concept map and list of linked termsSAS Enterprise miner uses a unique mining approach called SEMMA (Sampling, Exploration, Modification, Modeling, and Assessment). This package offers features that compare the results of the different types of modeling through statistical analysis and in business terms. The integrated environment allows for the statistical modeling group, business managers and the information technology department to work together. Although SAS supports loading text in a variety of file formats, all externally stored text files must first be converted to a SAS data set via the use of a prewritten macro, adding additional complexity and time consumption to this step in the text mining process. SAS’ user interface is less intuitive than other software, but it still offers many of the same features as other products which affect how text is parsed. In addition to offering these basic features, SAS also offers a user the ability to affect how its algorithm is run which represents parsed documents in a structured, quantitative form. SAS’ term extraction generally yields a larger number of terms than other software, but its results browser allows for easy browsing and filtering of these terms. SAS also simplifies the text mining process somewhat by automatically clustering documents and identifying links between terms when a term extraction is executed. Figure 2 below shows SAS’ text mining browser for documents, extracted terms, and clusters.Figure 2: SAS text mining browserSPSS is flexible in that it supports many text formats, including: plain text, PDF, HTML, Microsoft Office, and XML text files. It has open architecture which allows the program to join together with other text analytics applications including Text Mining Builder, LexiQuest Categorize, and all other SPSS data mining and predictive analytics applications. Clementine’s text extraction does well to offer the use several methods for limiting the scope of an extraction and therefore tends to yield the most informative terms and phrases in its results. Figure 3 below shows Clementine’s text extraction results browser.Figure 3: Ranked list of extracted concepts in ClementineClementine also includes a text link analysis which has two main functions. The first is that it recognizes and extracts sentiments (i.e. likes and dislikes) from the text with the help of dictionary files created in Text Mining Builder.Some of these dictionary files are already provided in the software package while others can be developed by the user. The second is that it detects correlations between things such as people/events, or diseases/genes. SPSS also allows a user to manually define a customized taxonomy with the help of LexiQuest Categorize, which uses training data to learn the characteristics of the taxonomy predict the classification of documents.Polyanalyst, manufactured by Megaputer, can process large scale databases. The software also has a low learning curve and step by step tutorials. The integration of an analysis performed on both structured and unstructured text is available. The results of the analysis can be incorporated into existing business processes.Clarabridge provides analysis of data and is used in conjunctions with commercially available business intelligence tools which are used to view and interpret the results of this analysis. It also allows parallel processing of large amounts of data. During the processing, entities, relationships, sections, headers, and topics, as well as proximal relationships, tables, and other data are recognized. This information is stored into the capture schema thus maintaining metadata and linking it back to its source. Clarabridge can contain large amounts of data and maintain high throughput. The GUI requires a minimal amount of coding from users and processes can be done without human intervention.Table 3 shows the different functions that the software have. For some of these functions, such as extraction, all of the software possesses some form of the function. For others, such as clustering, not all of the software have the feature. A table of this form is useful for a quick visual software functionality comparison.Table 3: FunctionalitySOFTWAREFUNCTIONS SPSS SASClarabridge Polyanalyst Leximancer Extraction X XX X X Summarization X Categorization X X X X XClustering X X XConceptLinking X XX X X DecisionTrees X X XDocumentLinkage XX X X MemorybasedReasoning X X X XRegression X X X Exports data for further analysis in other packagesTime Series X X X Exportsdata forfurtheranalysis inotherpackagesII.R ESULTSUnfortunately we were unable to run the Cavalier Dailyfiles through Clarabridge because this software packagerequires additional business intelligence tools in order toachieve readable results.After running the sample corpus through the other foursoftware suites and comparing the subjective ease of use ofthe software, two products rose to the top: SAS and SPSS.The immediate advantage of these pieces of software isthat they were developed for large projects, so whileprocessing 1200 documents took a significant amount oftime, they were able to display meaningful results. Thechoice between these products, however, rests on theparticular application in which they are used.Because of the non-intuitive interface and steep learningcurve, SAS is best used in situations where the user alreadyhas a general understanding of text mining. It is also anexcellent choice for processing large amounts of documents,however, it only gives truly meaningful information if theinput has been pre-processed and made to be semi-structured.When running the files through, SPSS proved to be thequickest at mining the files. SPSS also is a good choice forprocessing large amounts of documents and provides moreuseful results if the input has been pre-processed and madeto be semi-structured. Another benefit that SPSS has overSAS is that SAS extracts a large amount of useful terms.All of the software products tested primarily extractedsport-related concepts from the given corpus in theexplorative analysis that was done. This indicates that in theCavalier Daily newspaper, sports are the main topics that arereported on. Again, because we were unable to obtainworking copies of Clarabridge and Polyanalyst, we wereunable to test them using our sample corpus. Further resultscould be obtained with a deeper analysis, but as we wereusing our corpus to get only a preliminary idea of thefeatures of the software, we did not pursue a more advancedinvestigation.III.F UTURE W ORKFuture work with text mining software is alreadyunderway. While the test corpus that was used to evaluatethe software in this report was large, the problem that wasattempted to be solved was not well-defined. Therefore, anew problem has been proposed that is well-defined, andwork is underway to analyze and solve it.There is a current project underway in which a group isattempting to extract relationships between the results fromsocial security disability claims in the court systems and thecontent of the claims that are filed with the courts. This is aproblem that is semi-structured and well-defined, and isperfect for further testing of the SAS and SPSS suites.The data for these cases are being gathered from variousstate and federal websites that have cases on record havingto do with social security disability claims. This data will becollected, parsed, inputted into a database table, and thenprocessed by SAS and SPSS in order to extract relationships in the data.The hope is that this processing will lead to discoveries about what types of claims are most often approved or rejected by the courts, if there is such a relationship. For example, it might be the case that if a person mentions “Lou Gehrig’s disease” in their claims, that they are almost always approved for their claim. If such a relationship were true, then text mining software like SAS and SPSS should be able to extract it through predictive capabilities.IV.C ONCLUSIONThe following goals were achieved by the conclusion of this project:•Identified the common needs of users of textmining tools through researching the industries thatuse text mining and the applications for which textmining is employed.•Addressed the current state in text mining through background research in the field and hands onexperience.•Evaluated and compared text mining software. This goal can be improved upon in future projects byconsidering an expanded set of evaluation criteria.A PPENDIXThis Appendix provides a glossary of terms commonly used in discussions of text mining software.KDD-knowledge discovery and data miningQueries-a common way of extracting information from databasesTuples-finite sequence or ordered list of objectEase of Learning-how easy or hard it is to learn how to use the softwareEase of Use-once the software is learned, how easy or hard it is to use the softwareClustering-Clustering algorithms find groups of items that are similar. For example, clustering could be used by an insurance company to group customers according to income, age, types of policies purchased and prior claims experience.Decision tree-A tree-like way of representing a collection of hierarchical rules that lead to a class or value.Regression tree-A decision tree that predicts values of continuous variables. Time series model-A model that forecasts future values of a time series based on past values.Extraction-locating specific pieces of data and extracting it from the documentSummarization- summarization extracts the most relevant phrases or even sentences from a document.Concept Linking- Usually comes in the form of some web-like visualization in which the links between extracted concepts are shown based on their co-occurrence and proximity within documents.Document Linkage – The ability to view in the results wherein the documents the concept occurs. Results link back toinput documents.Categorization- Organization of documents into predefined categories based on existence of specified indicator concepts within the documents.Memory-based Reasoning- MBR uses training records totrain a neural network to learn to predict certain characteristics of new documentsA CKNOWLEDGMENTThe Capstone team would like to thank their technical advisor, Professor K. Preston White for guiding them through the capstone project. Also, the team would like to acknowledge Elder Research Incorporated for allowing themto participate in such a rewarding project through funding it. The team would like to also thank Debbie and Jordan for everything they have done for the team.R EFERENCES[1]Weglarz, G. (2004). Two worlds of data – unstructured and structured.DM Review, 14(9), 19-22.[2]J. Elder et al., An Evaluation of High-end Data Mining Tools forFraud Detection. Available:/Portals/0/tool eval articles/ smc98abbot mat eld.pdf.[3]Megaputer, (2002), Polyanalyst Powerpoint.[4] S. Grimes, The Word on Text Mining. Available: .[5]Saving Lives, Cutting Patient Costs: Louisville Hospitals Advancewith SAS Text Miner, SAS, 2006. Available: /success/louisville.html.[6]R. Rao, From Unstructured Data to Actionable Intelligence, IT Pro,9202(03), 2003, pp. 1-7.[7]P. Fule, J. Roddick, Detecting Privacy and Ethical Sensitivity in DataMining Results, School of Informatics and Engineering, FlindersUniversity, South Australia.[8]Text Mining Terms, SPSS White Paper, Sept. 2006.[9]K. Michel, J. Elder, Evaluation of Fourteen Desktop Data MiningTools, 1998.[10]M. Goebel, L. Gruenwald, A Survey of Data Mining and KnowledgeDiscovery Software Tools, SIGKDD Explorations, pp. 20-33, 1999. [11]Apte, C. (2002). Business applications of data mining.Communications of the ACM, 45(8), 49-53.[12]Blumberg, R. (2003). The problem with unstructured data. DMReview, 13(2), 42-4.[13]Grimes, S. (2005). The Developing Text Mining Market. [White paper,electronic version]. Retrieved October 12, 2006 from .[14]Hand, D., Mannila, H., Smyth, P. (2001). Principles of Data Mining.MIT Press: Cambridge, MA.[15]Marti Hearst. “What is Text Mining?” 17 October 2003./~hearst/text-mining.html (29 October2006).。

Text Knowledge Mining

Text Knowledge Mining:An Alternative to Text Data MiningD.S´a nchez,M.J.Mart´ın-Bautista,I.Blancoputer Science and A.I.,University of GranadaE.T.S.I.I.T.,C/Periodista Daniel Saucedo Aranda s/n18071Granada,Spain{daniel,mbautis,iblanco}@decsai.ugr.esC.Justicia de la TorreCenter of Informatics and Communication NetworksUniversity of Granada,18071Granada,Spainconsuelo@ugr.esAbstractIn this paper we introduced an alternative view of text mining and we review several alternative views proposed by different authors.We propose a classiﬁcation of text min-ing techniques into two main groups:techniques based on inductive inference,that we call text data mining(TDM, comprising most of the existing proposals in the literature), and techniques based on deductive or abductive inference, that we call text knowledge mining(TKM).To our knowl-edge,the TKM view of text mining is new though,as we shall show,several existing techniques could be considered in this group.We discuss about the possibilities and chal-lenges of TKM techniques.We also discuss about the appli-cation of existing theories in possible future research in this ﬁeld.1INTRODUCTIONText mining refers to the discovery of non-trivial,pre-viously unknown,and potentially useful knowledge from a collection of texts.Since its origin,text mining has been considered an analog of data mining(interpreted as Knowledge Discovery in Databases,or KDD)applied to text repositories.Text mining is very important since nowa-days,around80%of the information stored in computers (not considering audio,video,and images)consists of text.What seems to give text mining a clear distinction from data mining is that the latter deals with structured data, whereas text presents special characteristics and its explicit appearance is basically unstructured.Following this vision of text mining as data mining on unstructured data,most of the approaches to text mining have been mainly concerned with obtaining structured datasets from text,called inter-mediate forms,on which usual data mining techniques are applied.Table1shows the most usual intermediate forms that have been proposed and the minimal text units(atomic data)they are comprised of.For example,the bag of words represents a piece of text by means of a set of weighted terms,like that employed in Information Retrieval.Data mining techniques employed on the intermediate forms of Table1include clustering,dependence analysis,and asso-ciation rules among others.Text Unit Intermediate FormWord Bag-of-WordsN-gramsConcept Concept HierarchyConceptual GraphSemantic GraphConceptual DependencePhrase N-phrasesMulti-term text phrasesTrendsParagraph ParagraphN-phrasesMulti-term text phrasesTrendsDocument DocumentTable1.Minimal text units and their relatedIntermediate formsHowever,textual data is not inherently unstructured.On the contrary,text is characterized by a very complex implicit structure that has defeated many representation attempts,2008 IEEE International Conference on Data Mining Workshopswith a very rich semantics.On this basis,several authors disagree with the most usual vision of text mining as apply-ing data mining techniques,because these are designed to work on very simple data structures with a very limited ex-pressive power.By employing these data structures as inter-mediate forms,we are losing most of the semantics of text; in addition,when using data mining techniques on richer semantic structures,we are exploiting a very small part of their expressive power.Following this discussion,Hearst[7]distinguishes two types of text mining;ﬁrst,the aforementioned vision of text mining as data mining techniques providing simple patterns and regularities.Second,what she calls real text mining, that obtains new and useful knowledge pieces with seman-tics as richer as that of text.As an example of the latter, she cites the work of Swanson and Smalheiser[20]in the Arrowsmith project.They provide a software tool able to employ the“causal”relationships that appear in the medical literature,as appearing in titles of papers in the MEDLINE database,in order to generate new,previously unknown hy-pothesis.The following example is a quote from Hearst[7]:“For example,when investigating causes of mi-graine headaches,he[Swanson]extracted variouspieces of evidence from titles of articles in thebiomedical literature.Some of these clues can beparaphrased as follows:•stress is associated with migraines•stress can lead to loss of magnesium•calcium channel blockers prevent some mi-graines•magnesium is a natural calcium channelblocker•spreading cortical depression(SCD)is im-plicated in some migraines•high levels of magnesium inhibit SCD•migraine patients have high platelet aggre-gability•magnesium can suppress platelet aggrega-bilityThese clues suggest that magnesium deﬁciencymay play a role in some kinds of migraineheadache;a hypothesis which did not exist in theliterature at the time Swanson found these links.”The discussion about what is and what is not text min-ing has been also considered by other authors,together with reﬂections about the importance of knowledge and its rep-resentation in text mining.In this paper we offer our own view of text mining.In our view,text mining techniques can be classiﬁed into two main groups:techniques based on inductive inference,that we call text data mining(TDM, comprising most of the existing proposals in the literature), and techniques based on deductive or abductive inference, that we call text knowledge mining(TKM).Usual data min-ing techniques,based on induction,fall in theﬁrst group. The technique for generating new hypothesis proposed by Swanson and Smalheiser can be considered,as we shall see, as based on abduction;in this sense,is an example of tech-nique of the second group.To our knowledge,the TKM view of text mining is new though,as we shall show,several existing techniques could be considered in this group.We discuss about the possibil-ities and challenges of TKM techniques.We also discuss about the application of existing theories in possible future research in thisﬁeld.2TEXT KNOWLEDGE MINING2.1MOTIV ATION AND DEFINITIONData mining techniques for KDD are based on induc-tive inference,i.e.,starting from a set of particular cases described by some(structured)data model,they obtain gen-eral patterns consistent with the particular examples.These patterns are subject to objective and/or subjective assess-ment;the former is based on statistics in general,while the latter takes into account the background knowledge in the domain in order to assess the novelty,usefulness,and non-triviality of the patterns,in accordance with the objectives of KDD.Inductive inference is the natural way to obtain new knowledge from databases since they contain particu-lar examples,simple facts,arranged in data structures.As we have seen,the usual view of text mining consists of obtaining a structured dataset from text and then applying on it the usual inductive data mining techniques.However, in our opinion,this strong association between mining and inductive learning that seems to dominate the research in text mining,comes from the fact that data mining was the ﬁrst“mining”paradigm.Remarkably,the word induction does not appear in the deﬁnition of text mining.Text mining is concerned with obtaining new,non-trivial,and potentially useful knowledge from text repositories stored in comput-ers.Though inductive approaches have been shown to be very useful,they do not seem to be the“natural”choice for text mining as they are for data mining.In our view,the main difference between data and text is that the former is represented by means of data mod-els whilst the latter is represented by natural language. And these representations have very different characteris-tics.Data models have a limited expressive power,mainly restricted to the structured representation of simple facts. On the contrary,natural language has a far richer expressivepower,and it is able to represent not only simple facts like those stored in a database,but general knowledge of almost any kind,from relations like rules to complex procedures. Unfortunately,natural language has some well-known dis-advantages like lack of a simple,computationally easy to manage structure,vagueness,ambiguity,etc.Summarizing,and leaving apart the issue of structure and manageability,the main difference between data and text can be stated as follows:•Data represents a collection of simple facts,particu-lar cases,from where knowledge can be obtained in a natural way by means of inductive inference.•On the contrary,text expressed in natural language isa representation of general knowledge,a knowledgebase[7],whose semantics is usually much richer than simple facts,on which the natural way of obtaining new knowledge is by means of deductive or abductive inference.Let us point out that•the idea of using deductive and/or abductive inference does not contradict the deﬁnition of text mining,since there is no restriction about what kind of techniques should be employed to obtain the new knowledge,and •though we consider deductive and abductive inference to be the natural choice for text mining,this does not means that text mining based on inductive inference is not text mining.Almost all the text mining approaches existing in the literature,that have been shown to be very useful in practice,are based on induction.That is,by proposing text mining based on non-inductive inference,we propose a new direction that,as we shall show,leads to techniques able to provide new knowledge from text that cannot be obtained by approaches based on induction.In summary,we propose to classify text mining techniques into two groups:•Text Data Mining(TDM)techniques obtain new knowledge by applying inductive inference on a struc-tured data representation obtained from text.Most ex-isting text mining techniques fall in this category.•Text Knowledge Mining(TKM)techniques obtain new knowledge by applying deductive and/or abduc-tive inference on a structured knowledge representa-tion of the semantic content of text.Notice that we employ TKM meaning“mining by rea-soning using knowledge contained in text”,i.e.,the pro-cess of obtaining new knowledge taking the knowledge con-tained in the texts as the starting point.We understand TKM as a particular case of what we call knowledge mining, the latter deﬁned as“obtaining non-trivial,previously un-known,and potentially useful knowledge from knowledge repositories”.Notice that the main difference between our deﬁnition of knowledge mining and the usual deﬁnition of data mining is that the former intends to discover knowledge from knowledge repositories,whilst the latter tries to do the same from data repositories.This difference in the start-ing point,the basic material from which new knowledge is to be obtained,makes clear that the natural way to knowl-edge mining is deductive or abductive inference,whilst the natural way to data mining is inductive inference.We think that TKM techniques are examples of what Hearst[7]call“real text mining”because i)the real knowl-edge contained in text is employed instead of text metadata or intermediate forms with a poorer semantic content,and ii)new,non trivial knowledge is derived from data,with the same expressive power.The phases of the TKM process are exactly the same as those of TDM,i.e.,ﬁrst text reﬁning for obtaining a com-putationally manageable intermediate form representing the text,a mining procedure to obtain new knowledge,and aﬁ-nal step for assessing andﬁltering the knowledge obtained. However,there are substantial differences between TKM and TDM in the way these procedures are performed and the challenges they pose.We shall go into details in section 3.2.2KM:WHAT WE DO NOT MEANThe term“knowledge mining”has been employed by some authors in a different sense from that introduced in the previous section.In[16],knowledge mining is em-ployed to specify that the intermediate form employed con-tains knowledge richer than simple data structures(though inductive data mining techniques are used on the intermedi-ate forms).In[9],knowledge mining is employed to indi-cate that background knowledge and user knowledge is in-corporated in the mining process in order to ensure that the intermediate form and theﬁnal patterns contain only those concepts interesting for the user;again,inductive data min-ing techniques are used.In[18]knowledge mining refers to using background knowledge to evaluate the novel and interesting patterns after an inductive process.Another term appearing in the literature with a differ-ent meaning with respect to our proposal is“deductive min-ing”.We have not found a proper deﬁnition,but references to[10].In this paper,deductive mining refer to a group of text mining techniques,where the better known exam-ple is said to be Information rmation ex-traction is frequently referred as“the mapping of natural language texts into predeﬁned,structured representation,or templates,which,whenﬁlled,represent an extract of keyinformation from the original text”;or alternatively“the process ofﬁlling theﬁelds and records of a database from unstructured text”.Information extraction appears in relation to text mining in many papers[8,14,15],but it is concerned mainly with the task of obtaining structured intermediate representations from text,that is,with translating the knowledge existing in text into another,structured,representation;however,no new knowledge is derived.The term“deductive”is applica-ble to information extraction since its starting point is some knowledge about the kind of structures to beﬁlled and rules about how toﬁnd instances of values in text toﬁll them.However,the knowledge employed for theﬁlling task is not the knowledge contained in the text,but metaknowledge about text and how to translate it into a predeﬁned structure, and the deductive inference performed is translating from one representation to another,but no new knowledge is dis-covered.This fact is pointed out by Hearst[7]in relation to other techniques coming from computational linguistics and applied to particular problems like automatic augmentation of Wordnet relations.In[6],metaknowledge is employed for these purposes for(so-called)knowledge mining.In our view,information extraction performs TDM but not TKM; however,it can play a very important role in obtaining in-termediate representations for TKM.The use of deduction in combination with data mining has been proposed by some authors referring to the incor-poration of deductive capabilities to mining tools in order to improve the assessment of the knowledge obtained[19,5]. In[12],the term“deductive data mining”is employed to indicate that the knowledge should be obtained mathemati-cally and taking into account the hidden assumptions about data.Finally,in[17],a“deductive data mining”paradigm is proposed,where the term“deductive”refer to the fact that the discovery process is performed on the result of user queries(seen as the result of a deductive process)that limit the possibility to work on corrupt data.In all these cases, the generation of new knowledge is purely inductive.2.3RELATED WORKRelated to TKM,an interesting summary and discussion about[7]is provided in[11].In this paper,the authors dis-tinguish between non-novel investigation,like information retrieval,in which no new knowledge is generated;semi-novel investigation,like standard text mining and KDD, in which patterns/trends that already exist in data are dis-covered,and novel investigation or knowledge creation,in which new knowledge,telling something about the world outside the data collection itself,is created.The latter in-cludes solutions to problems like,“how can the linguis-tic features of text be used to create knowledge about the outside world?”,“does a new theme reﬂect reality?”,and “which business decisions are implied?”.For achieving these goals of knowledge creation,the au-thors[11]propose what they call Intelligent Text Mining, i.e.,the use of i)interaction between users and mining tools, and ii)AI techniques(no particular technique is speciﬁed). However,these techniques are proposed for the assessment of knowledge only.No indication about the kind of in-ference employed for obtaining the new knowledge is pro-vided.Some advanced text mining techniques that can be con-sidered TKM are related to literature-based discovery[3, 20,13,22].The objective of these techniques is to pro-vide hypothesis in the form of possible relations between concepts appearing in the literature about a speciﬁc topic. In this sense,they can be seen as abduction-based TKM techniques,though some authors consider that their work is not about discovery but about assisting human experts in formulating new hypothesis by means of an interactive process[13,22].The process is performed usually on the title and abstract of papers in a bibliographic database.The main work has been made in the medicalﬁeld by using the MEDLINE database[20,13,22],plus some additional background knowledge like the UMLS thesaurus in some cases[22];however,there are results in other areas[3].The basic idea of literature-based discovery[3,20,22] is the following:let{A1,...,A n,B1,...,B m,C}be con-cepts and suppose we are interested inﬁnding hypothesis involving possible relations between C and some of the A i. Suppose that no A i and C appear simultaneously in the col-lection,but they can appear simultaneously with some of the B i.Then,if the set of B j that appear simultaneously with A i and C(obviously,in different documents)is large enough,a relation between A i and C can be hypothesized. In[13],several statistical measures are employed.This pro-cedure has provided new hypothesis that have been later conﬁrmed by experimental studies,as that in the introduc-tion of this paper or that in[22].Let us remark that C and restrictions about what concepts can play the role A or B are to be speciﬁed in advance.A key point in this approach is the identiﬁcation of con-cepts to obtain the intermediate forms of documents.The main problem with this point is that a lot of background knowledge about the text is necessary,and varies a lot from one domain to another.In addition,let us remark that i)the relation between concepts is based in most of the cases on their simultaneous appearance in the same title or abstract only,and ii)the generation of hypothesis does not follow a formal procedure.However,it is inspired on one of the ways humans generate hypothesis.As stated in[22],using a better semantical analysis seems to be promising.Remark-ably,some formal model for conjectures,consequences and hypothesis are available in the literature[21,23],potentially providing a formal tool for literature-based discovery andTKM.A text mining technique that we also consider as TKM was proposed in[4]for discovering contradictions in a col-lection of texts.The starting point is a set of text documents T={t1,...,t n}of arbitrary length.The objective is to discover possible contradictions that may be deduced from the content of these texts.For this purpose,aﬁrst order logic representation in clausal normal form of the content of text is employed as intermediate form,and a deductive inference procedure based on resolution is performed.The intermediate form is obtained by a semi-automatic procedure in two steps.First,an interlingua representa-tion based on United Nation’s standard UNL[1,4]is ob-tained from text.This has the advantages that there are UNL encoders available for different languages,hence eliminat-ing the multilinguism problem typical of text mining,and that UNL is a computationally manageable representation. From the UNL representation,theﬁrst order clausal form is obtained by means of information extraction techniques. Background knowledge in the form of either text or clauses can be incorporated to the process as an additional docu-ment.The mining process is performed by using a resolution-based deductive inference procedure.This is applied to every subset of texts looking for contradictions.The pro-cedure is similar to the levelwise+candidate generation exploration in frequent itemsets discovery,but replacing the minimum support criteria by non-contradiction in the knowledge base obtained by the union of the intermediate forms of texts.First,individual text are evaluated to discard the possibility that they contain a contradiction.This corre-sponds to the exploration of theﬁrst level of the lattice of P(T)with respect to inclusion.Texts containing contradic-tions are not considered in the following level.Candidate generation for the next level is performed in the same way as in frequent itemset mining in the successive levels.The ﬁnal result is a collection of subsets of texts in the collec-tion that generate a contradiction.In addition,intermedi-ate clauses obtained by the resolution process are also new pieces of knowledge obtained by this TKM technique.More speciﬁcally,let res(t i)be a procedure performing resolution on the set of clauses obtained from t i,giving the value false in case itﬁnds a contradiction.Let t i t j be the concatenation of texts(the corresponding clauses)t i and t j. Let P T be the set of all the possible concatenations of sub-sets of T of any size.V represents the set of concatenations that do not contain contradictions.Then,the algorithm pro-posed in[4]for the discovery of contradictions is that of ﬁgure1.A problem with this approach is thatﬁrst order logic rep-resentation is monotonic and has some well-known draw-backs.However,this basic procedure could be extended to other kind of logics and the corresponding reasoning for1.V1←{t i∈T|res(t i)=false}2.V←V13.For k=2to n(a)V k←{t′t′′|t′∈V k−1,t′′∈V1,t∗t′′∈V∀t∗⊂t′and res(t′t′′)=false}(b)V←V∪V k4.Contradictions←P T\VFigure1.Algorithm for discovering contra-dictions proposed in[4].contradictions.Looking for contradictions is very useful.It can be em-ployed to assess the consistency of the text collection,or to assess the validity of a new text to be incorporated,in terms of the knowledge already contained in the collection.In ad-dition,if we consider collections in which contradictions are known to be(like for example repositories of papers containing opinions about topics),the search for contradic-tions can be employed in order to separate different groups of texts and,consequently,of authors in terms of consis-tency of their ideas.Finally,reasoning with ontologies can be reduced in the end to checking the consistency(i.e.,non contradiction)of a knowledge base hence looking for con-tradictions can be seen as a basic procedure for performing different types of reasoning.3CHALLENGES OF TKMAs data mining owes a lot to machine learning tech-niques,statistics,and other previously existing areas,so TKM would beneﬁt of the large amount of existing re-sults in areas like knowledge representation,reasoning al-gorithms for performing deductive and abductive inference, and knowledge-based systems,as well as natural language processing techniques.As in the case of data mining,these techniques must be adapted for TKM;new speciﬁc research results are needed as well.In the following we detail some of the challenges that,in our opinion,TKM must address in several aspects.3.1OBJECTIVESKnowledge-based systems and TKM systems have very different objectives that affect the way techniques com-ing from knowledge representation and reasoning can be adapted to TKM:•In a knowledge-based system,a knowledge base is built containing the knowledge needed by the system to solve a speciﬁc problem.In TKM,the knowledge managed by the systems is collected from a collection of texts,each of one regarded as a particular knowl-edge base;however,these knowledge bases in texts are not generated for the speciﬁc purpose of building an effective knowledge-based system,but generally as historical or normative reports.•Knowledge-based systems intend to give answer to ev-ery possible question,or to solve any possible problem posed to the system.On the contrary,the objective of TKM is just to provide new pieces of non-trivial, previously unknown,and potentially useful knowledge derived from the collection of text.A TKM is not re-quired to obtain every such piece of knowledge;in par-ticular,such set of knowledge pieces is unknown by deﬁnition.•A knowledge-based systems performs a reasoning pro-cess as a result of a query,stated by the user of a con-sultive system,or by the necessity to perform an action in a certain situation in the case of intelligent agents.In this sense,the inference procedures employed by knowledge-based systems are in general oriented and optimized to give an answer to speciﬁc questions.On the contrary,a TKM system is most generally asked toﬁnd new knowledge without specifying a speciﬁc question.Of course,the TKM process can also be focused by means of speciﬁc queries(e.g.“ﬁnd new knowledge involving X”,with X being and object,a kind of action,or anything else).3.2KNOWLEDGE REPRESENTATIONText mining requires to translate text into a computation-ally manageable intermediate form.We have introduced some of the most employed intermediate forms for TDM in Table1.Techniques coming form natural language pro-cessing and information retrieval have been employed for obtaining intermediate forms from text.This step of the text mining process is crucial and poses several challenges, common to TDM and TKM,like obtaining intermediate forms from multilingual text collections,deﬁning structures for information extraction,identifying semantic links be-tween new concepts,relating different identiﬁers,etc.It is important to remark that the intermediate forms and inference procedures employed are obviously linked to each other,e.g.,in order to perform logical deduction,a logic-based representation of the knowledge in texts is needed. Usually,abductive and(specially)deductive inference re-quire a more expressive,semantically richer intermediate form than inductive inference,that are more difﬁcult to ob-tain from texts in general.However,let us remark that many TDM techniques perform inductive inference on intermedi-ate forms that are also suitable for TKM,as they contain more information about the semantics of texts,like semantic graphs and conceptual dependencies in Table1,that allow certain kind of non-inductive inference.A key problem with obtaining intermediate forms for TKM is that the current state-of-the-art techniques for trans-lating text into those intermediate forms are mainly semi-automatic,i.e.,they require the participation of human ex-perts in order to verify that the translation is correct.On the other hand,several ongoing projects deal with the possibil-ity of expressing all or at least key parts of the knowledge in text by means of knowledge representation models di-rectly.This is the case for example of work on the Semantic Web,where ontologies represented by description logics or languages with the same expressive power are employed in order to represent the semantics of concepts and their re-lations[2].TKM in the Semantic Web seems a promising direction since efﬁcient deductive inference techniques are also available.The objective though of an intermediate form for TKM is not to represent every possible aspect of the semantic con-tent of text,but those related to the kind of inference we are interested in,that are on its turn determined by the kind of knowledge we want to discover.Even the lack of a text an-alyzer able to translate without loss from text to the IF of our choice has no more negative effect that,at most,pre-venting some pieces of knowledge from being discovered. However,as we discussed in the previous section,TKM is not expected to be exhaustive in obtaining new knowledge and,in this sense,this is not a major problem.Of course, better analyzers provide more effective TKM systems.On the other hand,if the analyzer obtain an inexact rep-resentation of text in an intermediate form,this can af-fect the knowledge discovered,leading to false discover-ies.Minimizing such misleading discoveries(that,in data mining and TDM,use to be the larger part of the patterns discovered,together with the uninteresting ones)is an im-portant challenge.In this sense,not only the misfunctioning of text analyzers,but also the reliability of text itself and the adequacy of the intermediate forms introduce new problems in TKM.The knowledge in a knowledge-based system is as-sumed to be reliable and consistent,but the same cannot be assumed in general in a collection of texts,even before the intermediate form of each one is obtained.Coping with the reliability of knowledge in the inference process is a major challenge in TKM.In this point,existing techniques for the assessment of opinions,importance,reliability and conﬁ-dence of knowledge may be very useful.。

高三英语学术研究方法创新思路练习题30题

高三英语学术研究方法创新思路练习题30题1. In academic research, which is the best way to collect data?A. Conducting interviews.B. Reading old textbooks.C. Searching online databases.D. Guessing randomly.答案：C。

解析：选项 A 进行访谈是一种收集数据的方法，但不一定是最好的方式。

选项B 阅读旧课本获取的数据可能比较有限且不一定最新。

选项C 搜索在线数据库可以获得大量且较为准确和最新的数据，是较好的收集数据的方式。

选项D 随机猜测不能作为收集数据的方法。

2. When collecting data for academic research, what should you avoid?A. Biased sources.B. Multiple sources.C. Verified data.D. Well-documented sources.答案：A。

解析：选项 A 有偏见的来源会导致数据不准确，应该避免。

选项B 多个来源可以提供更全面的数据。

选项C 经过验证的数据是可靠的，不应避免。

选项 D 有详细记录的来源也是可靠的，不应避免。

3. Which of the following is not a valid method of data collection in academic research?A. Making up data.B. Observing real-life situations.C. Analyzing historical records.D. Surveying a target group.答案：A。

解析：选项 A 编造数据是不道德且不被允许的方法。

选项 B 观察现实生活情况是有效的收集数据方法。

高二英语区块链技术练习题50题含答案解析

高二英语区块链技术练习题50题含答案解析1.Blockchain technology is known for its high level of _____.A.securityB.insecurityC.vulnerabilityD.weakness答案解析：A。

选项A“security”意为安全；选项B“insecurity”意为不安全；选项C“vulnerability”意为脆弱性；选项D“weakness”意为弱点。

区块链技术以其高度的安全性著称。

2.One of the key features of blockchain is _____.A.centralizationB.decentralizationC.concentrationD.unification答案解析：B。

选项A“centralization”意为集中化；选项B“decentralization”意为去中心化；选项C“concentration”意为集中；选项D“unification”意为统一。

区块链的关键特征之一是去中心化。

3.Blockchain uses _____ to ensure the integrity of data.A.cryptographyB.plaintextC.weak encryptionD.no encryption答案解析：A。

选项A“cryptography”意为密码学；选项B“plaintext”意为明文；选项C“weak encryption”意为弱加密；选项D“no encryption”意为无加密。

区块链使用密码学来确保数据的完整性。

4.The nodes in a blockchain network are connected through a _____ structure.A.hierarchicalB.linearC.distributedD.centralized答案解析：C。

Abstract A Foundational Approach to Mining Itemset Utilities from Databases

A Foundational Approach to Mining Itemset Utilities from DatabasesHong Yao,Howard J.Hamilton,and Cory J.ButzDepartment of Computer ScienceUniversity of ReginaRegina,SK,Canada,S4S0A2{yao2hong,hamilton,butz}@cs.uregina.caAbstractMost approaches to mining association rules implicitly con-sider the utilities of the itemsets to be equal.We assume that the utilities of itemsets may diﬀer,and identify the high utility itemsets based on information in the transac-tion database and external information about utilities.Our theoretical analysis of the resulting problem lays the foun-dation for future utility mining algorithms.1IntroductionWe describe utility mining,whichﬁnds all itemsets in a transaction database with utility values higher than the minutil threshold.Standard methods for mining association rules[1,7]are based on the support-conﬁdence model.Theyﬁrstﬁnd all frequent itemsets, i.e.,itemsets with support of at least minsup,and then, from these itemsets,generate all association rules with conﬁdence of at least minconf.The support measure is used because it is assumed that only highly frequent itemsets are likely to be of interest to users.The frequency of an itemset may not be a suﬃcient indicator of interestingness,because it only reﬂects the number of transactions in the database that contain the itemset.It does not reveal the utility of an itemset, which can be measured in terms of cost,proﬁt,or other expressions of user preference.For example,the small transaction database shown in Figure1indicates the quantity sold of each item in each transaction. This information is of high utility.In association rule mining,support is deﬁned over the binary domain {0,1},where1indicates the presence of an item in a transaction,and0its absence.The share measure[2]was proposed to overcome the shortcomings of support.It reﬂects the impact of the quantity sold on the cost or proﬁt of an itemset.Lu et al.proposed a scheme for weighting each item using a constant value without regard to the signiﬁcance of transactions[5].In their scheme,the utilities are attached to the the items rather than the transactions. Wang et al.[9]suggested that it remains unclearTID Item A Item B Item C Item DT110114T20060T31024T40040T50031T600113T70080T84007T901110T1000018Figure1:An Example Transaction Databaseto what extent patterns can be used to maximize the business proﬁt for an enterprise.For example, two association rules{Perfume}−→Lipstick and {Perfume}−→Diamond may suggest diﬀerent utilities to a sales manager,although both are interesting rules. They assume that the same item can have several utility functions with corresponding proﬁt margins. The goal for proﬁt mining is to recommend a reasonable utility function for selling target items in the future. Chan et al.[3]recently described an alternate approach to mining high utility itemsets.In this paper,we generalize previous work on itemset share[2].We deﬁne two types of utilities for items,transaction utility and external utility.The transaction utility of an item in a transaction is deﬁned according to the information stored in the transaction. For example,thequantity of an item sold in the transaction might be used as the transaction utility. The external utility of an item is based on information not available in the transaction.For example,it might be stored in a utility table,such as that shown in Figure2,which indicates the maximum proﬁt for each item.The external utility is proposed as a new measure for describing user preferences.By processing a transaction database and a utilityItem Name Proﬁt($)Item A3Item B150Item C10Item D1Figure2:An Example Utility Tabletable together,data mining can be guided by the utilities of itemsets.The discovered knowledge may be useful for maximizing a user’s goal.For example,if the goal of a supermarket manager is to maximize proﬁt, the itemset utilities should be decided by the quantity of items sold and the unit proﬁt on these items.The quantity sold is a transaction utility,and the unit proﬁt is a external utility.The discovered knowledge is the itemsets that produce the largest proﬁts.The reminder of the paper is organized as follows. In Section2,the problem of utility mining is stated. In Section3,we propose a theoretical model for our utility mining approach.In Section4,conclusions are stated.2Statement of ProblemIn this section,a formal description of the problem of utility mining is given and related concepts are described.The utility mining problem is deﬁned as follows. Definition2.1.(Utility Mining).Let I= {i1,i2,...i m}be a set of items,D be a transaction database,and UT I,U be a utility table,where U is a subset of the real numbers that reﬂect the utilities of the items.The utility mining problem is to discover all itemsets in a transaction database D with utility values higher than the minutil threshold,given a utility table UT.According to the above problem statement,we need to describe and deﬁne the utility of an item and an itemset.In the remainder of this section,weﬁrst deﬁne the utility of an item and then give the deﬁnition of the utility of an itemset.We also extend deﬁnitions from[2]to deﬁne transaction utility.Definition2.2.The transaction utility value in a transaction,denoted o(i p,T q),is the value of an item i p in a transaction T q.The transaction utility reﬂects the utility in a trans-action database.The quantity sold values in Figure1 are the transaction utility values of the items in each transaction.For example,o(D,T1)=14.In general,a transaction utility value may need to be scaled or normalized to correctly reﬂect thetransaction utility of items.For example,if the items represent temperatures,we cannot simply say that thetransaction utility of item A is six times that of item B because o(A)=6and o(B)=1.For simplicity,weignore the required transformations in this paper. Definition2.3.The external utility value of an item is a numerical value s(i q)associated with an item i qsuch that s(i q)=U(i q),where U is a utility function,a function relating speciﬁc values in a domain to user preferences.The external utility reﬂects the utility per item that is independent of transaction.Figure2shows proﬁt values, e.g.,s(A)=3and s(B)=150.It reveals that the supermarket can earn$3for each item A that is sold.Often,in practice,it is feasible and more convenient to specify the utility function by roster as a utility table.Definition2.4.(Utility Table).A utility table is a two dimensional table UT I,U over a collection of items I= i1,i2,...i m ,where U is the set of real numbers such that s(i q)=U(i q)for each i q∈I.Before deﬁning the utility value of an item,theutility function needs to be deﬁned.Piatetsky-Shapiro et al.[8]and Kl¨o sgen[4]suggested that a quantitative measure of a rule may be computed as a function and introduced axioms for such quantitative measures.We deﬁne a utility function based on their axioms. Definition2.5.A utility function f(o,s)is a two variable function,that satisﬁes:(1)f(o,s)monotonically increases in f(o,s)forﬁxed o.(2)f(o,s)monotonically increases in f(o,s)forﬁxed s.The utility value of an item is deﬁned as follows. Definition2.6.The utility of an item i q in a trans-action T q,denoted u(i q,T q),is f(o(i q,T q),s(i q)),where o(i q,T q)is the transaction utility value of i q,s(i q)is the external utility value of i q,and f is a utility function.The utility of an item can be an integer value,such as the quantity sold,or a real value,such as a proﬁt margin,total revenue,or unit cost.Example2.1.Given the transaction database in Fig-ure1,the utility table in Figure2,and utility func-tion f(o(i q,T q),s(i q))=Amount×Item Proﬁt,we obtain u(A,T1)=1×3=3.The supermarket earns $3by selling one item A in transaction T1.Similarly, u(B,T1)=0,u(C,T1)=10,and u(D,T1)=14.Example2.1shows that the utility of a item depends on the external item utility and the transaction utility as well as the frequency.The utility of item D,a freq uent item sold67times in total,is less than the utility of item B,an infrequent item sold only once.Association rule mining approaches based on the support measure could miss such high utility items.The utility of an itemset is deﬁned according the sum of the utilities of its items.Definition2.7.[2]A k−itemset is an itemset,X= {x1,x2,...,x k},X⊆I,1≤k≤m,of k distinct items. Each itemset X has an associated set of transactions T X={T q∈D|T q⊇X},which is the set of transactions that contain itemset X.Definition2.8.The local utility of an item x i in an itemset X,denoted l(x i,X),is the sum of the utilities of the item x i in all transactions containing X,i.e.,l(x i,X)=T q∈T X u(x i,T q).(2.1)For example,using the transaction database in Figure1 and the utility table in Figure2,l(A,ACD)=2×3=6. Definition2.9.The utility of an itemset X,denotedu(X),is the sum of the local utilities of each item in X in all transactions containing X,i.e.,u(X)=ki=1l(x i,X).(2.2)If f(o(i q,T q),s(i q))=Amount×Item Proﬁt,then for itemset ACD,u(ACD)=l(A,ACD)+l(C,ACD)+ l(D,ACD)=2×3+3×10+18×1=54.3Theoretical Model of Utility MiningA key property of itemsets is the Apriori property(or downward closure property)[1,7],which states that if an itemset is frequent by support,then all its subsets must also be frequent by support.It guarantees that the set of frequent itemsets is downward closed with respect to the lattice of all its subsets[2].This closure property has permitted the development of eﬃcient algorithms that traverse only a portion of the itemset lattice.However,when calculating the utility of an itemset,the itemset utility can increase or decrease as the itemset is extended by adding items.For example, u(A)=3×6=18but u(AD)=43>u(A), u(B)=1×150=150but u(AB)=0<u(B).Thus, itemset utility does not satisfy downward closure,and it is necessary to discover other properties of itemset utility to enable the design of eﬃcient algorithms.In this section,we describe two important proper-ties of utility that allow an upper bound on the utility of a k−itemset to be calculated from the utilities of the discovered(k−1)−itemsets.Furthermore,a heuristic model for estimating itemset utility is proposed that prunes the search space by predicting whether item-sets should be counted.These results provide the the-oretical foundation for eﬃcient algorithms for utility mining.Definition3.1.Given a k−itemset I k={i1,i2,...i k}and i p∈I k,we deﬁne I k−1i p=I k−{i p} as the(k−1)−itemset that includes all items in I k except item i p.For the4-itemset I4={A,B,C,D},we have I3A= {B,C,D},I3B={A,C,D},I3C={A,B,D},and I3D={A,B,C}.Definition3.2.Given a k-itemset I k={i1,i2,...i k},we deﬁne S k−1={I k−1i1,I k−1i2,...,I k−1i k} as the set of all(k−1)−subsets of I k.For the4-itemset I4={A,B,C,D},we have S3= {BCD,ACD,ABD,ABC}.Lemma3.1.The cardinality of S k−1,denoted|S k−1|, is k.Lemma3.1indicates that the number of the subsets of size(k−1)of I k is k.Definition3.3.Let I k={i1,i2,...i k}be a k-itemset,and let S k−1be the set of all subsets of I kof size(k−1).For a given item i p,the set S k−1i p ={I k−1|i p∈I k−1and I k−1∈S k−1}is the set of (k−1)−itemsets that each includes i p.For the4-itemset I4={A,B,C,D},we have S3A= {ACD,ABD,ABC}.BCD/∈S3A because A/∈BCD.Lemma3.2.The cardinality of S k−1i p,denoted|S k−1i p|, is k−1.Lemma3.2indicates that among the k subsets of I k of size(k−1),there are(k−1)subsets that include item i p and only one that does not.Theorem3.1.Given a k-itemset I k={i1,i2,...i k} and a(k−1)−itemset I k−1such that I k−1⊂I k,then ∀i p∈I k−1,l(i p,I k)≤l(i p,I k−1).Proof:since I k−1is a subset of I k,any transaction T q∈T I k must satisfy T q∈T I k−1.Thus,according to Equation2.1in Deﬁnition2.8,l(i p,I k)≤l(i p,I k−1) holds.For the4-itemset I4={A,B,C,D}in Figure1and the utility table in Figure2,we have a3-itemset ACD and a2-itemset AD,and AD⊂ACD.Here,l(A,ACD)=2×3=6,l(A,AD)=6×3=18,and thus l(A,ACD)≤l(A,AD).Theorem3.1indicates that the local utility value of an item i p in an itemset I must be less than or equal to the local utility value of the item i p in any subset of I that includes item i ing Lemma 3.2 and Theorem3.1,we obtain Theorem3.2. Theorem3.2.The local utility of an item i p in an itemset I must be less than or equal to the local utility of item i p in any subset of I of size(k−1).Formally, local utility satisﬁesl(i p,I k)≤minI k−1∈S k−1{l(i p,I k−1)}(3.3)≤I k−1∈S k−1l(i p,I k−1)k−1(3.4)where i p∈I k−1.Proof:According to Theorem3.1,theﬁrst inequality holds.According to Lemma 3.2,there are(k−1) subsets of size(k−1).By substituting each l(i p,I k−1) into min I k−1∈S k−1{l(i p,I k−1)},the second inequalitycan be shown.Example3.1.For a4-itemset I4={A,B,C,D},we havel(A,I4)≤min{l(A,ABC),l(A,ABD),l(A,ACD)}≤l(A,ABC)+l(A,ABD)+l(A,ACD)3Theorem3.2allows the calculation of an upper bound for the utility of an item in a k−itemset I by calculating the utilities of subsets of I of size(k−1).Using Theorem3.2,an upper bound for the utility of an itemset can also be inferred.Theorem3.3.(Utility Bound Property).The utility of an k−itemset I k must satisfyu(I k)≤ ki=1u(I k−1i)k−1(3.5)Proof:Since Equation 3.4holds for each i p∈I. According to Equation 2.1,Equation 3.5is obtained.Theorem3.3indicates that the upper bound of the k−itemset utility is limited by the utilities of all its subsets of size(k−1).Thus,a level-wise method,such as that provided by Mannila et al.[6],can be used to prune itemsets with utilities less than a threshold.As a result,the search space can be reduced.Example3.2.For a4-itemset I4={A,B,C,D},we haveu(I4)≤u(ABC)+u(ACD)+u(ABD)+u(BCD)3Although the utility bound property limits the utility of a k−itemset I to a fraction of the sum of the utilities of all subsets of size(k−1),the upper bound is still high. Thus,we further reduce this bound by considering the support of the itemset,denoted sup(I),the percentage of all transactions that contain the itemset[1,7]. Theorem3.4.(Support Bound Property).IfI k={i1,i2,...i k}is a k−itemset and I k−1i pis a(k−1)−itemset such that I k−1i p=I k−{i p},where i p∈I k,then they satisfysup(I k)≤min∀I k−1i p⊂I k{sup(I k−1i p)}(3.6)Proof:Since I k−1i pis a subset of I k,any transaction T q∈T I k must satisfy T q∈T I k−1i p.Thus,according tothe deﬁnition of support,sup(I k)≤sup(I k−1i p)holds.Theorem3.4indicates that the support of an item-set always decreases as its size increases.Theorem3.4 also indicates that if the support of an itemset is zero, then the support of any superset of the itemset is also zero.That is to say that if sup(I k−1)=0and I k−1⊂I k then sup(I k)=0.For example,for the4-itemset I4={A,B,C,D} in Figure1,sup(ABCD)is less than or eq ual tomin{sup(ABC),sup(ABD),sup(ACD),sup(BCD)}.Based on the utility bound property and the support bound property,a heuristic model to predict the expected utility value of a k−itemset I k,denoted u (I k),is given as follows.u (I k)=supmink−1ki=1u(I k−1i)sup(I k−1i)(3.7)wheresupmin=min∀I k−1i p⊂I k{sup(I k−1i p)}(3.8)Since the support bound property only allows us to estimate the number of transactions,Equation3.7may sacriﬁce some accuracy,and thus our proposed model is a heuristic model.Example 3.3.For the 3-itemset I 3={B,C,D }in Figure 1and the utility table in Figure 2,we have supmin =min {sup (BC ),sup (BD ),sup (CD )}=min {0.1,0.1,0.5}=0.1and u (BCD )=supmin 3−1×(u (BC )sup (BC )+u (BD )sup (BD )+u (CD )sup (CD ))=0.12×(150+100.1+150+100.1+24+24+31+23+200.5)=0.12×(1600.1+1600.1+1220.5)=172.2We can directly obtain u (BCD )=1×150+1×10+10×1=170from Figure 1and Figure 2.Equation 3.7requires that the utilities of all subsets of size (k −1)take part in calculation,which leads to ineﬃciencies in algorithms.Next,we relax this constraint.Definition 3.4.If I is an itemset with utility u (I )such that u (I )≥minutil ,where minutil is the utility threshold,then itemset I is called a high utility itemset;otherwise I is called a low utility itemset.The following theorem guarantees that only the high utility itemsets at level (k −1)are required to estimate the utility of a k −itemset.Theorem 3.5.Let u (I k )be the expected utility of I kas described in Equation 3.7,and let u (I k −11),u (I k −12),...u (I k −1k )be the k utility values of all subsets of I kof size (k −1).Suppose,I k −1i (1≤i ≤m )are high utility itemsets,and I k −1i (m +1<i ≤k )are low utility itemsets.Thenu(I k)≤supmin k −1mi =1u (I k −1i )sup (I k −1i )+k −mk −1×minutil wheresupmin =minI k −1i ⊂I k ,(1≤i ≤m ){sup (I k −1i )}(3.9)Proof:since u (I k −1i )≤minutil when m +1<i ≤k ,wecan substitute minutil for each u (I k −1i )term (m +1<i ≤k )in Equation 3.7,obtaining the desired result.Example 3.4.For the 3-itemset I 3={B,C,D }in Figure 1and the the utility table in Figure 2,suppose minutil =130.Since u (CD )=122<minutil ,then supmin =min {sup (BC ),sup (BD )}=min {0.1,0.1}=0.1.The estimated utility of BCD is calculated as follows u (BCD )≤supmin3−1(u (BC )sup (BC )+u (BD )sup (BD ))+3−23−1minutil=0.12×(150+100.1)+150+100.1)+12×130=0.12×(1600.1+1600.1)+12×130=225Theorem 3.5is the mathematical model of utilitymining that we will use to design an algorithm to estimate the expected utility of a k −itemset from the known utilities of its high utility itemsets of size (k −1).4ConclusionsIn this paper,we deﬁned the problem of utility min-ing.By analyzing the utility relationships among item-sets,we identiﬁed the utility bound property and thesupport bound property.Furthermore,we deﬁned the mathematical model of utility mining based on these properties.In the future,we will design an algorithm and compare it to other itemset mining algorithms.References[1]R.Agrawal,T.Imielinski,and A.N.Swami.Min-ing association rules between sets of items in large databases.In Proceedings of the 1993ACM SIGMOD International Conference on Management of Data ,pages 207–216,Washington,USA,1993.[2] B.Barber and H.J.Hamilton.Extracting share fre-quent itemsets with infrequent subsets.Data Mining and Knowledge Discovery ,7(2):153–185,2003.[3]R.Chan,Q.Yang,and Y.-D.Shen,Mining high util-ity itemsets.In Proceedings of the 2003IEEE Inter-national Conference on Data Mining (ICDM 2003),pages 19–26,Melbourne,FL,November 2003.[4]W.Klosgen.Explora:a multipattern and multistrat-egy discovery assistant.In U.M Fayyad,G.Piatetsky-Shapiro,P.Smyth,and R.Uthurusamy,editors,Ad-vances in Knowledge Discovery and Data Mining ,pages 249–271.AAAI/MIT Press,1996.[5]S.Lu,H.Hu,and F.Li.Mining weighted associationrules.Intelligent Data Analysis ,5(3):211–225,2001.[6]H.Mannila and H.Toivonen.Levelwise search andborders of theories in knowledge discovery.Data Mining and Knowledge Discovery ,1(3):241–258,1997.[7]H.Mannila,H.Toivonen,and A.I.Verkamo.Ef-ﬁcient algorithms for discovering association rules.In AAAI Workshop on Knowledge Discovery in Databases (KDD’94),pages 181–192,Seattle,Wash-ington,July 1994.AAAI Press.[8]G.Piatetsky-Shapiro.Discovery,analysis and pre-sentation of strong rules.In G.Piatetsky-Shapiro and W.J.Frawley,editors,Knowledge Discovery in Databases ,pages 229–248.AAAI/MIT Press,1991.[9]K.Wang,S.Q.Zhou,and J.W.Han.Proﬁtmining:From patterns to actions.In Advances in Database Technology,8th International Conference on Extending Database Technology (EDBT’2002),pages 70–87,Prague,Czech Republic,2002.Springer.。

大数据环境下文本信息挖掘系统设计

大数据环境下文本信息挖掘系统设计赵逸智;张云峰【摘要】The traditional text information mining technology system can carry out the systematic information mining for text information,but is easy to generate the data identification messy code of the system and data interference in the big data environ-ment. Aiming at these problems,a design scheme of text information mining system in big data environment is put forward. The data reducer is added on the hardware device of the system,which can filter the data,ensure the accuracy of data entered into the recognition stage,and improve the efficiency of data mining. The prime number matrix model is used in the process of infor-mation mining to mine the text information deeply. The Aprioirt computing method is optimized to ensure the priority recognition of text information,avoid the data chaos and data interference of the traditional method. In order to verify the effectiveness of text information mining system in large data environment,the contrast simulation experiment was designed. The experimental data verifies that the text information mining system in large data environment is effective,and can avoid the data chaos and data in-terference of the traditional methods.%传统文本信息挖掘技术系统能够对文本信息进行系统的信息挖掘,但是在大数据环境下容易产生系统的数据识别乱码以及数据干扰.针对上述问题,提出一种大数据环境下文本信息挖掘系统设计方案,在系统的硬件设备上增加数据简化器,通过数据简化器能够对数据进行一定的过滤筛选,保证数据进入识别阶段的准确率,同时促进了数据挖掘过程的效率,对文本信息挖掘的过程使用质数矩阵模型,通过建立的质数矩阵模型能够有效地对文本信息进行深层次的挖掘.同时优化了Aprioirt计算方法,保证了对文本信息的优先识别度,避免了传统方法中出现的数据混乱以及数据干扰问题.为了验证设计的大数据环境下文本信息挖掘系统的有效性,设计了对比仿真实验,通过实验数据的分析,有效地证明了设计的大数据环境下文本信息挖掘系统的有效性,避免了传统方法中出现的数据混乱以及数据干扰问题.【期刊名称】《现代电子技术》【年(卷),期】2018(041)001【总页数】4页(P125-128)【关键词】大数据环境;文本信息;关联密度;Aprioirt计算方法;挖掘系统【作者】赵逸智;张云峰【作者单位】北华航天工业学院,河北廊坊065000;北华航天工业学院,河北廊坊065000【正文语种】中文【中图分类】TN911.1-34;TP391伴随互联网时代的快速崛起，互联网的数据信息已经用海量来比拟[1-2]。

Text data mining

1
enhanced with proximity search functionality and keyword combination according to Boole's algebra. Other approaches rely on document similarity measures based on a vector representation of the various texts. What is still missing, however, are tools providing assistance for explorative search in text collections. Exploration of document archives may be supported by organizing the various documents into taxonomies or hierarchies. In parentheses we should note that such an organization has been in use by librarians for centuries. In order to achieve such a classi cation a number of approaches are available. Among the oldest and most widely used ones is statistics, especially cluster analysis. The use of cluster analysis for document classi cation has a long tradition in information retrieval research and its speci c strength and weaknesses are well explored 38, 44]. The eld of arti cial neural networks in a wide variety of applications has attracted renewed interest, which is at least partly due to increased computing power available at reasonable prices. There is general agreement that the application of arti cial neural networks may be recommended in areas that are characterized by (i) noise, (ii) poorly understood intrinsic structure, and (iii) changing characteristics. Each of these characteristics is present in text classi cation. Noise is present due to the fact that no completely satisfying way of representing text documents has been found so far. Secondly, poorly understood intrinsic structure is due to the non-existence of an authority knowing the contents of each and every document. In other words, in document classication we cannot rely on the availability of a person who is able to describe the decisions needed for automating the classi cation process in an algorithmic fashion based on that person's knowledge of the contents of each document. Hence approaches based on learning techniques have their potential in document classi cation. Finally, the changing characteristics of document collections are due to the fact that the collections are regularly enlarged to include additional documents. As a consequence, the contents of the document collection varies constantly. Learning techniques are favorable in such an environment compared to algorithmic or knowledge-based approaches. From the wide range of proposed arti cial neural network architectures we regard unsupervised models as especially well suited for the exploration of text collections. This is due to the fact that in a supervised environment one would have to de ne proper input-output-mappings anew each time the text archive changes, and such changes can be expected to occur quite frequently. By inputoutput-mapping we refer to the manual assignment of documents to classes which, obviously, is only possible when one assumes the availability of considerable insight in the structure of the text archive. On the other hand, in an unsupervised environment it remains the task of the arti cial neural network to uncover the structure of the document archive. Hence, the unrealistic assumption of being able to provide proper input-output-mappings is obsolete in an unsupervised environment. A number of successful applications of unsupervised neural networks in the area of text archive exploration have already been 2

高三英语学术研究方法创新不断探索单选题30题

高三英语学术研究方法创新不断探索单选题30题1.In an academic research discussion, what is the most important aspect of a research method?A.AccuracyB.SpeedC.CreativityD.Popularity答案：A。

解析：在学术研究中，准确性是至关重要的，它确保研究结果的可靠性。

速度在某些情况下可能重要，但不是最主要的。

创造力也很重要，但不是最重要的方面。

而受欢迎程度与研究方法的重要性关系不大。

2.What does a good research method ensure?A.Lots of dataB.Accurate resultsC.Fast completionD.High popularity答案：B。

解析：一个好的研究方法能确保得到准确的结果。

大量的数据不一定能保证结果准确。

快速完成也不是主要目的。

高人气与研究方法的主要作用无关。

3.In academic research, the definition of a research method mainly includes?A.Question asking and data collectionB.Guessing and intuitionC.Opinion sharing and discussionD.Random selection and chance答案：A。

解析：学术研究方法主要包括提出问题和收集数据。

猜测和直觉不是科学的研究方法。

意见分享和讨论是研究的一部分但不是研究方法的定义。

随机选择和偶然也不是研究方法的主要内容。

4.Which of the following is not a characteristic of an effective research method?A.Biased data collectionB.Systematic approachC.ReliabilityD.Validity答案：A。

opinion miningand sentimente analysis

opinion miningand sentimente analysisOpinion mining and sentiment analysis are two integral components of natural language processing (NLP) that aim to identify and extract subjective information from textual data. These techniques have gained immense popularity in recent years due to their wide range of applications, including market research, social media analysis, customer feedback analysis, and brand reputation management.Opinion mining, also known as sentiment analysis, involves determining the sentiment or polarity of a given piece of text, whether it is positive, negative, or neutral. It helps organizations and businesses understand the sentiment of their customers, whether it is towards their products, services, or brand. By analyzing customer opinions, businesses can make informed decisions about their marketing strategies, product development, and customer support.There are various approaches to sentiment analysis, including rule-based methods, machine learning techniques, and hybrid models. Rule-based methods involve using a pre-defined set of linguistic rules to determine sentiment. These rules are developed based on linguistic patterns, such as the presence of positive or negative words in the text. While rule-based methods are relatively straightforward to implement and interpret, they often struggle with understanding sarcasm, irony, and context-dependent sentiment.Machine learning techniques, on the other hand, involve training a model on a large dataset of labeled examples. The model thenlearns to predict sentiment based on patterns and features in the text. Common machine learning algorithms used for sentiment analysis include Naive Bayes, Support Vector Machines (SVM), and Recurrent Neural Networks (RNN). Machine learning models exhibit better performance than rule-based approaches in dealing with complex linguistic patterns and context.Hybrid models combine both rule-based and machine learning approaches to leverage their respective advantages. These models utilize linguistic rules for initial sentiment classification and then employ machine learning techniques to refine the sentiment analysis process. Hybrid models have shown promising results in achieving higher accuracy and better contextual understanding compared to standalone rule-based or machine learning methods.The application of opinion mining and sentiment analysis extends beyond customer feedback and brand reputation management. They are also widely used in social media analysis to understand public sentiment towards various topics, events, or public figures. Political surveys often employ sentiment analysis to gauge public opinion before elections. In addition, sentiment analysis is utilized in stock market prediction by analyzing news articles, social media posts, and financial reports.There are several challenges and limitations associated with opinion mining and sentiment analysis. One common challenge is dealing with ambiguity and sarcasm in text. Since sentiment analysis mainly relies on text-based features, it often struggles to identify and interpret sarcasm or ironic statements accurately. The context of the text is also crucial, as sentiment can change based onthe surrounding words and phrases. Additionally, sentiment analysis models might be biased due to imbalanced training data or misclassified samples.In conclusion, opinion mining and sentiment analysis play a vital role in understanding and quantifying subjective information in textual data. By harnessing these techniques, businesses can gain valuable insights into customer opinions, market trends, and brand reputation. Ongoing research and advancements in NLP, such as the integration of deep learning and contextual understanding, hold great potential for further enhancing the accuracy and applicability of sentiment analysis in various domains.。

阅读的方式英语作文

Reading is an essential skill that enriches our knowledge and broadens our horizons. There are various ways to approach reading,and each method has its own advantages and purposes.Here are some common ways to read in English:1.Skimming:This is a quick reading technique where you scan the text to get a general idea of the content.It is useful for identifying the main points and deciding whether the material is worth a more indepth read.2.Scanning:Similar to skimming,scanning is used to locate specific information withina text.You might scan a document to find a particular word,date,or fact without reading the entire text.3.Intensive Reading:This method involves a thorough and detailed reading of the text.It is often used for academic purposes where comprehension and analysis are crucial. Intensive reading helps in understanding the nuances of the language,the authors style, and the deeper meaning of the text.4.Extensive Reading:Extensive reading is about reading a large volume of text to improve fluency and vocabulary.It is less about understanding every detail and more about enjoying the reading process and getting a sense of the overall message.5.Critical Reading:This approach requires a reader to evaluate the text critically, considering the authors arguments,the evidence presented,and the overall logic of the text.Critical reading is essential for academic research and writing.6.Selective Reading:In selective reading,you choose to read only certain parts of the text that are most relevant to your needs or interests.This can be particularly useful when dealing with long documents or when you need to find specific information quickly.7.Reading Aloud:Reading aloud can help improve pronunciation,fluency,and comprehension.It is a good way to practice speaking English and to hear how words and sentences are structured.8.NoteTaking:While reading,taking notes can help you remember important points, summarize the text,and organize your thoughts.This is especially useful for studying and for writing essays or reports.9.Interactive Reading:Engaging with the text by asking questions,making predictions, and discussing the content can enhance understanding and retention.This can be done in a group setting or as a personal exercise.10.Reading for Pleasure:Reading for enjoyment is an important aspect of language learning.It helps to develop a natural feel for the language and to appreciate the cultural context in which the language is used.Each of these reading methods can be applied depending on the purpose of your reading and the type of material you are engaging with.Developing a habit of reading regularly and varying your reading techniques can significantly improve your English language skills.。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

An Approach to Text Mining using Information ExtractionHaralampos Karanikas, Christos Tjortjis and Babis TheodoulidisCentre for Research in Information ManagementDepartment of Computation, UMIST, P.O. Box 88,Manchester, M60 1QD, UKEmail: {karanik, christos and babis}@ AbstractIn this paper we describe our approach to Text Mining by introducing TextMiner. We perform term and event extraction on each document to find features that are likely to have meaning in the domain, and then apply mining on the extracted features labelling each document. The system consists of two major components, the Text Analysis component and the Data Mining component. The Text Analysis component converts semi structured data such as documents into structured data stored in a database. The second component applies data mining techniques on the output of the first component. We apply our approach in the financial domain (financial documents collection) and our main targets are: a) To manage all the available information, for example classify documents in appropriate categories and b) To “mine”the data in order to “discover” useful knowledge. This work is designed to primarily support two languages, i.e. English and Greek.1.IntroductionThe explosive growth of databases in almost every area of human activity has created a great demand for new, powerful tools for turning data into useful knowledge. To satisfy this need researchers from various technological areas, such as machine learning, pattern recognition, statistical data analysis, data visualization, neural networks, econometrics, information retrieval, information extraction etc., have been exploring ideas and methods. All these efforts have led to a new research area often called data mining (DM) or Knowledge Discovery in Databases (KDD). We can define it as the nontrivial extraction of implicit, previously unknown, and potentially useful information from given data (Piatetsky-Shapiro and Frawley [1991]).Until now, most of the work in Knowledge Discovery in Databases (KDD) has been concerned with structured data. However, a lot of information nowadays is available in the form of text, for example in documents, manuals, email, presentations on the Web, and so forth. Unlike the tabular information typically stored in Traditional Databases, this form of information has only limited internal structure. In order to take advantage of all this information we apply Text Mining (TM) technology.2.The innovation of our approachText Mining uses unstructured textual information and examines it in attempt to discover structure and implicit meanings “hidden” within the text. Most approaches to TM apply mining algorithms on labels associated with each document. These labels may be keywords extracted from the document or just a list of words within the document of interest.In our approach the main contribution is that we apply mining algorithms on terms (meaningful sequence of words e.g. Department of Computation) combined with events (meaningful set of terms e.g. take-over, company1, company2) extracted from the documents. We believe that the most characteristic factors, which describe a document, are the terms and events mentioned within the document. It is very important to extract the “right” features in order to have accurate and useful results. Information Extraction (IE) is the technology [6], which involves in this pre-processing step. The resulting document representations are used as input to a clustering algorithm. We have developed a clustering algorithm appropriate for categorical data. Using this algorithm we are in a position to discover structure within the document collection. In addition classification techniques are used to further validate the results from clustering.3.Case studyAs we have mentioned above the domain in which we are going to apply the TextMiner approach is the stock market. The domestic capital markets are of the most valuable financial institutions for the global economy. In micro-economic level the capital market offers to companies the opportunity to gather low priced investment capital. In macro-economic level the domestic capital market accumulates significant valuable income, as a result of the daily operations, which through the taxation assists the reduction of the state deficits. In addition in macro-economic level the investments contribute to the improvement of the economy’s competitiveness, create new jobs and forward the exportation of products and services. As a consequence they increase the inflow of exchange for the country.Users of financial news (Financial newspapers, news agencies, etc.), financial analysts, employees of Securities Companies and individual’ users with financial knowledge, have to deal daily with vast amounts of information. All these users know that the progress of the capital market depends a lot on relevant news. That’s why they have to analyse all the news that appear on newspapers magazines and other textual resources. Until now the processing of the unstructured textual databases has to be done manually by the users.Our target is to automate much of the users’ work. Firstly we want to manage information stored in textual databases (collection of documents) and secondly to extract useful knowledge.rmation ExtractionInformation Extraction is the mapping of natural language texts (such as newswire reports, newspaper and journal articles, electronic mail, World Wide Web pages, any textual database, etc.) into predefined, structured representation, or templates, which, when filled, represent an extract of key information from the original text [14]. The information concerns entities of interest in the application domain (e.g. companies or persons), or relations between such entities, usually in the form of events in which the entities take part (e.g. company takeovers, management successions etc.). Once extracted, the information can then be stored in databases to be queried, data mined, summarised in natural language, etc.The first necessary step is the linguistic pre-processing. It consists of a number of linguistic techniques such as tokenization, part of speech tagging, lemmatization etc in order to feed the information extraction system.We aim to extract from the documents, financial terms (Company names, actions between companies, people, places, exchanges etc.) and financial events of interests. We are interested in specific events between specific companies/persons. In other words, a specific event (e.g. take-over) between company A and company B will be used as a feature to label the document as well as the same event (i.e. Take-over) between different companies.In our case we define a number of events within the financial domain. We keep this information in a table called Event Type. Each event has a different number of attributes to be described.For example:Event type: Take-over.Date:March 15Company Predator:Neurosoft SA.Company Target:IBM.Type of takeover:FriendlyValue:240,000,000 poundsDocument ID:Doc1…We have described this particular event (Takeover) with these attributes (Date, Company Predator, Company Target, Type of takeover, Value, Document ID…); another event could be described with different attributes.After extracting from the document collection all the events, which concerns us, we fill a table, which has the following structure.Event ID EventTypeDate Company(increasedcapital)CompanyPredatorCompanyTargetType oftakeoverValue DocIDEvent 1TakeoverMarch15NeurosoftSA.Microsoft Friendly240,000,000Doc1Event 2SharecapitalincreaseApril10Microsoft250,000,000Doc2Table 1 Extracted EventsAfter extracting the financial terms and events we are going to construct the table, which will be used as input for the clustering algorithm. Documents in the database can be viewed as records, and the corresponding terms/events (that belong to each document) as attributes of the record. This is demonstrated better in the following table.Documents Extracted financial terms and eventDoc1{Microsoft, Neurosoft, Take-over, Department of computation, …Event1…}Doc2{Microsoft, IBM, Share capital increase, … Event2…}……Table 2 Data input for the Clustering algorithm.5.Clustering algorithmWe apply clustering algorithms in the resulting database in order to discover structure within the document collection. Clustering in TM means the creation of subsets from a collection of documents. A cluster is here defined as a group of documents having features, which are more similar to each other than to the features of any other group. In other words, documents from one cluster share some common features, which distinguish them from the other documents.Clustering does not need any predefined categories in order to group the documents. Thus, the aim of this analysis is to produce a set of clusters in which the internal similarity of documents (inside the cluster) is maximised and the external similarity is minimised.Many algorithms have been applied depending on the data collection and the task to be accomplished. Hierarchical Clustering and the Binary Relational Clustering are of the well known ones. In the former approach, clusters are arranged in cluster trees, where related clusters appear in the same branch of the tree, while the latter creates a flat cluster structure.This automatic analysis can be useful in many tasks. First of all, it can provide an overview of the contents of a large document collection. Another task is identification of hidden structures within groups of objects. The process of finding related information can then become easy and accurate. In addition, new trends (or new policies, depending on the content of the document), which have not been mentioned in other documents, can be discovered in documents. Finally duplicate documents in a collection can be detected. These are just few of the tasks, which can be supported by clustering.Cluster analysis is a traditional statistical technique. Objects can be partitioned into groups, which are easy to discover by algorithms making use of similarity, as measured by the numeric distance between objects. But, not every kind of similarity can be estimated numerically. For example, the distance between the concepts, UMIST and LSE. Even though these distances can be transformed into numbers, any such transformation would be difficult and subjective. So traditional distance-based cluster analysis is not possible to produce meaningful results when processing concepts.In Conceptual clustering clusters are not just collection of entities with numerical similarity. Clusters are understood as groups of objects that together represent a concept. It does not produces only clusters, but also descriptions of the related concepts.For Conceptual clustering we need a set of attribute descriptions of some entities, a description language for characterising clusters of such entities, and a classification quality criterion. Thetarget is to partition entities into clusters in a way that maximises the quality criterion and in the same time to determine general descriptions of these clusters.It is important to mention that in conceptual clustering the properties of cluster descriptions are taken into consideration in the process of determining the clusters. This is a major difference between conceptual clustering and conventional clustering. In conventional clustering method we determine clusters according to a similarity measure. This similarity measure is a function of only the properties of the objects being compared and of no other factor. In contrast conceptual clustering takes into account not only the properties of the objects but also two other factors: The language that the system uses to describe the clusters and the environment, which is the set of neighbouring examples.In our case we consider a document database (textual database) containing financial terms and events per document. These data can be used to cluster the documents such that documents with similar referring patterns (to terms/events) are in a single cluster. For example, one cluster may consist of documents referring to the local market, while another may consist of documents referring to the international, European etc. market. The clusters then can be used to characterize the different documents groups, and these characterization can be used in other data mining tasks, such as relevance analysis, classification etc. In addition it is very important that the user can have a quick overview of a large document collection.In our database the attributes of the data points are non-numeric. Documents in the database can be viewed as records with Boolean attributes, each attribute corresponding to a single financial term/event. The value of the attribute is True if and only if the document contains the corresponding term/event; otherwise it is False. Boolean attributes are a special case of categorical attributes. Of course categorical attributes are not always true or false, but could have any finite set of values.Traditional clustering algorithms that use distances or vector products methods between points for clustering are not appropriate for Boolean and categorical attributes. This observation forces us to apply the general framework of the ROCK algorithm [17] and the concept of links in orderto overcome the limitation of the traditional algorithms according to categorical data. As we have mentioned above we are going to use categorical attributes in order to characterize the documents.An additional important factor is the concepts of Relative Closeness (RC) and Relative Interconnectivity (RI) presented in Chameleon algorithm [16]. Existing clustering algorithms find clusters that fit some static model. Although effective in some cases, these algorithms can break down; that is, cluster the data incorrectly, if the user does not select appropriate static-model parameters. The problem is that they use this static model of the clusters and they do not use information about the nature of individual clusters as they are merged. Rock belongs in the scheme of algorithms that ignores information about the closeness of two clusters as defined by the similarity of the closest items across two clusters. This last observation leads us to inherit these concepts (RI, RC) and to try combining them with the Rock algorithm, which will be the general framework.In our approach we apply the clustering algorithm Rock [17]. R ock introduces a novel concept of clustering that is based on links in order to measure the similarity between a pair of data points. Clustering points based on only the closeness or similarity between them is not strong enough to distinguish two “not well-separated” clusters because it is possible for points in different clusters to be neighbors. As link(pi, pj) is defined the number of common neighbours between pi and pj. From this definition follows that if link(pi, pj) is large, then it is more probable that pi and pj belong to the same cluster.The algorithm (Figure 1) accepts as input the set S of n sampled points to be clustered, and the number of desired clusters k.Procedure Cluster (S, k)Begin1.link := compute_links(S)2.for each s ∈ S do3.q[s] := build_local_heap(link, s)4.Q := build_global_heap(S, q)5.while size(Q) > k do {6. u := extract_max(Q)7. v := max(q[u])8. delete(Q,v)9. w := merge(u,v)10. for each x ∈ q[u] ∪ q[v] do {11. link[x, w] := link[x, w] + link[x, v]12. delete(q[x] , u ); delete(q[x],v)13. insert(q[x], w, g(x, w));insert(q[w], x, g(x, w))14. update(Q, x, q[x])15. }16. insert(Q, w, q[w])17. deallocate(q[u]); deallocate(q[v])18.}EndFigure 26.Classification algorithmOur approach makes further use of KDD techniques to exploit the knowledge discovered after applying clustering. More specifically classification is applied using the descriptions derived by clustering as the class attribute. This gives further insight of the document collection and possibly verifies the previous findings.By applying decision tree classification algorithms on the templates under examination, we can retrieve hierarchies of concepts and test the validity of the discovered descriptions. This can possibly iterate a further clustering step following refinement of the database by feature subset selection.7.Greek, a particular languageModern Greek is a language difficult to describe computationally. One reason is that several different inflectional endings, prefixes, infixes and a stress marker participate to the inflection orconjugation of Modern Greek words, resulting to 4-7 word-forms for a Noun and up to 250 word-forms for a Verb. Another reason is that, the sentence constituents can be found at various positions (e.g. subject, verb, and objects). The syntactic role of word-forms can, most of the time, be extracted from the syntactic information attached to their inflectional endings, without considering their position in the sentence. One last reason, which makes Greek language difficult to describe computationally, is the existence of numerous characteristics carried over from Ancient Greek. Many old forms are in use interchangeably to new forms, along with old syntactic structures and archaic expressions.Modern Greek inflection system exhibits a narrower range of morph syntactic ambiguity, compared to the corresponding ambiguity revealing in low inflected languages (e.g. in English almost every Noun can be a verb), and is mainly attributed to the frequent occurrence of functional words, such as Articles and Pronouns.8.Conclusion and future workIn this paper, we introduced our approach for data mining textual data (Text Mining). We presented TextMiner, a text-mining tool designed and developed at UMIST. We are currently in the stage of optimising the performance of the tool and also currying out extensive testing. At the same time we carry out tests in other domains.Our innovation is that we label the documents, in order to cluster them, by using the terms (sequence of words with particular meaning e.g. Department of Computation) and events (set of terms with a particular meaning) mentioned within the documents. Additionally we have developed a clustering algorithm appropriate for categorical data. We apply further mining algorithms in the extracted information.In the future, our target is to extract valuable knowledge by correlating information from the structured and the unstructured databases (see overall architecture in figure 3). For example in the financial case study we will investigate how the share price behave according to the announcement of the event “split”. We believe that the need for analysing data from diverse types of data sources is significant.Unstructured Database (Collection of Documents)StructuredDatabase with financial dataevents of interestWe provide facilities that will automate many of the tasks that the user had to do manuallyon documents (Information management)Classification Clustering Summarization Correlationextracted data from knowledge.Input 1Input 2Output 1Figure 3References[1] Cutting D., Karger D., Pedersen J., and Tukey J. (1992), “Scatter/gather: A cluster-based approach tobrowsing large document collections”, In Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.[2] El-Hamdouchi A. and Willet P. (1986), “ Hierarchic document clustering using ward’s method”, Inproceedings of the Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.[3] El-Hamdouchi A. and Willet P. (1989), “Comparison of hierarchic agglomerative clustering methodsfor document retrieval”, The Computer Journal, 32(3)[4] Feldman R., and Hirsh h. (1996), “Exploiting Background Information in Knowledge Discovery fromText”, Journal of Intelligent Information Systems.[5] Feldman R. and Dagan I. (1995), “Knowledge Discovery in Texts”, In Proceedings of the FirstInternational Conference on Knowledge Discovery.[6] Grishman R. (1997), “Information Extraction: Techniques and Challenges”, International SummerSchool, SCIE-97[7] Gaizauskas R. and Humphreys K. (1997), “Concepticons vs. Lexicons: An Architecture forMultilinqual Information Extraction”, International Summer School, SCIE-97[8] Hahn U., and Schnattinger K. (1997), “Deep Knowledge Discovery from Natural Language Texts”, InProceedings of the 3rd International Conference of Knowledge Discovery and Data Mining.[9] Lent B., Agrawal R. and Srikant R. (1997), “Discovering Trends Text Databases”, In Proceedings ofthe 3rd International Conference on Knowledge Discovery.[10] Michalski R., Bratko I., and Kubat M. (1998), “Machine Learning and Data Mining”, Wiley[11] Piatesky-Shapiro G. and Frawley W. (Eds.) (1991), “Knowledge Discovery in Databases”, AAAIPress, Menlo Park, CA.[12] Rajman M. and Besancon R. (1997), “Text Mining: Natural Language Techniques and Text MiningApplications”, In Proceedings of the seventh IFIP 2.6 Working Conference on Database Semantics (DS-7), Chapan & Hall IFIP Proceeding series. Leysin, Switzerland, Oct 7-10, 1997.[13] Raymond T.Ng and Jiawei Han. (1994), “Efficient and effective Clustering methods for spatial datamining”, In Proc. Of the VLDB Conference, Santiago, Chile, Sept 1994.[14] Wilks Yorick (1997), “Information Extraction as a Core Language Technology”, InternationalSummer School, SCIE-97[15] Willet P. (1988), “Recent trends in hierarchic document clustering: A critical review”, InformationProcessing and Management.[16] Karypis G., Eui-Hong (Sam) Han and Vipin Kumar, “Chameleon: Hierarchical Clustering AlgorithmUsing Dynamic Modeling”, Computer Magazine, August 1999.[17] Guha Subipto, Rastogi Rajeev, and Kyuseok Shim, “ Rock: A Robust Clustering Algorithm forCategorical Attributes”.[18] G. Koundourakis, M.Saraee and B.Theodoulidis, “Data Mining in Temporal Databases”。