当前位置:文档之家› top-k闭序列图模式挖掘(IJIEEB-V8-N4-1)

top-k闭序列图模式挖掘(IJIEEB-V8-N4-1)

top-k闭序列图模式挖掘(IJIEEB-V8-N4-1)
top-k闭序列图模式挖掘(IJIEEB-V8-N4-1)

I.J. Information Engineering and Electronic Business, 2016, 4, 1-9

Published Online July 2016 in MECS (https://www.doczj.com/doc/ee12682656.html,/)

DOI: 10.5815/ijieeb.2016.04.01

Top-k Closed Sequential Graph Pattern Mining

K. Vijay Bhaskar

GITAM University/CSE, Visakhapatnam, 530045, India

E-mail: vbreddy.vijay@https://www.doczj.com/doc/ee12682656.html,

Dr. R.B.V Subramanyam

NIT Warangal/CSE, Warangal, 506004, India

E-mail: rbvs66@nitw.ac.in

Dr. K. Thammi Reddy

GITAM University/CSE, Visakhapatnam, 530045, India

E-mail: thammireddy@https://www.doczj.com/doc/ee12682656.html,

S. Sumalatha

NIT Warangal/CSE, Warangal, 506004, India

E-mail:katam.suma@https://www.doczj.com/doc/ee12682656.html,

Abstract—Graphs have become increasingly important in modeling structures with broad applications like Chemical informatics, Bioinformatics, Web page retrieval and World Wide Web. Frequent graph pattern mining plays an important role in many data mining tasks to find interesting patterns from graph databases. Among different graph patterns, frequent substructures are the very basic patterns that can be discovered in a collection of graphs. We extended the problem of mining frequent subgraph patterns to the problem of mining sequential patterns in a graph database. In this paper, we introduce the concept of Sequential Graph-Pattern Mining and proposed two novel algorithms SFG(Sequential Frequent Graph Pattern Mining) and TCSFG(Top-k Closed Sequential Frequent Graph Pattern Mining). SFG generates all the frequent sequences from the graph database, whereas TCSFG generates top-k frequent closed sequences. We have applied these algorithms on synthetic graph database and generated top-k frequent graph sequences.

Index Terms—Data mining, graph mining, frequent sequential patterns, closed sequential patterns.

I.I NTRODUCTION

Frequent graph patterns are substructures that appear in a graph data set to a frequency not less than a user-specified threshold [9]. Structural forms such as subtrees, subgraphs, sublattices are referred as substructures. A frequent structural pattern is a substructure that appears frequently in a graph database. Sequential pattern mining, the mining of frequently occurring sub-sequences as patterns, was introduced in [13] and has become an important problem in data mining. Sequential pattern mining discovers frequent subsequences as patterns. According to the GSP algorithm [14], sequential pattern mining is to find all sequences whose support is greater than the user-specified minimum support, such sequence is called a frequent sequence. In SPADE [8], the set of all frequent sequences is discovered by reducing database scans and it minimizes I/O costs. The GSP and SPADE follow candidate generation and test approach.

The PrefixSpan algorithm [5] is a pattern-growth approach to sequential pattern mining and it follows divide-and-conquer strategy. BIDE [6] is an algorithm used for mining frequent closed sequences without candidate generation. These algorithms [5,6] grow patterns by constructing projected databases to reduce the search space for a pattern. In this paper, we investigated the problem of mining sequential patterns in graph databases. Our approach finds sequential patterns occurring in graph databases.

Fig.1. Three graphs where each vertex represents a web page

In ―Fig. 1‖each node represents a web page. Let us consider that the edges of the given graphs represent the order of visiting the web pages. For example, Graph1 represents the sequence q,s,t. Similarly Graph2 represents the sequence q,r,t and Graph3 represents the sequence q,p,t. Observing the three given graphs, it is clear that page t is accessed after page q(not immediately after q). The resulting pattern is which is not a subgraph. Given the above three graphs as input, any existing subgraph mining algorithm will not generate as a pattern. Many graph mining algorithms exist to find the frequent subgraphs, maximal frequent subgraphs, closed frequent subgraphs, and constraint-based closed frequent subgraphs. The sequence is not considered as an

output pattern by the existing graph mining algorithms. Though is not a subgraph, it represents a pattern showing a sequence q followed by t. In this paper, we proposed an algorithm to find sequential patterns from graph databases known as Sequential Graph-Pattern Mining.

We proposed SFG(Sequential Frequent Graph pattern mining) algorithm to mine frequent sequences in a graph database. As the number of edges in a graph increases, the number of sequential patterns also increases exponentially. This requires further analysis of frequent sequential graph patterns. To overcome this difficulty, we proposed TCSFG(Top-k Closed Sequential Frequent Graph pattern mining) algorithm to generate top-k closed sequential graph patterns.

The rest of the paper is organized as follows. Section II describes the related work in subgraph mining. Section III present problem definition and section IV describe SFG, TCSFG algorithms. Our experimental results and performance analysis are presented in section V and our work is concluded in section VI.

II. R ELATED W ORK

Many algorithms exist in the literature for mining frequent graph patterns [1,7,9,16,18]. AGM [1] algorithm efficiently mines frequently appearing induced subgraphs from a graph data database. Experimental results in [1] show that AGM finds all frequent induced subgraphs containing 300 chemical compounds in 40 minutes to 8 days, as the minimum support threshold varied from 20% to 10%. FSG[9] algorithm uses a sparse graph representation which minimizes both storage and computation. The performance of FSG was worse where the number of vertex and edge labels was small. AGM [1] and FSG [9] are Apriori-based algorithms for mining frequent substructures from graph data. Apriori-based algorithms generate subgraph candidates from frequent subgraphs and prunes false positives. These algorithms [1,9] suffer from the problem of subgraph isomorphism test and candidate generation. Candidate generation and pruning false positives are costly.

The gSpan [18] was the first algorithm to discover all the frequent subgraphs without candidate generation and it prunes false positives. gSpan discovers frequent substructures without candidate generation. A new lexicographic ordering among graphs was built in [18] and it maps each with a unique minimum DFS code as its canonical label. gSpan finds both frequent subgraphs and frequent induced subgraphs. It [18] uses rightmost extension technique to minimize the generation of the same subgraphs and explores the depth-first search in frequent subgraph mining.

Frequent subgraph mining is a challenging issue due to the generation of an exponential number of frequent subgraphs. Closed graph pattern mining was introduced in [16] to mine only closed frequent graph patterns. A graph g is closed if there do not exist any proper super graph of g with the same support as g. CloseGraph [16] mines closed frequent graph patterns using the concepts of equivalent occurrence and early termination to prune the pattern search space. It [16] reduces unnecessary subgraphs and also increases the efficiency of the mining process in the presence of large graph patterns. TSP [12], TFP [4] and CloSpan [17] are the recent work for closed sequential pattern mining. CloSpan is the first algorithm to solve closed sequential pattern mining problem. It [17] mines frequent closed sequential patterns and produces significantly less number of discovered sequences. The CloSpan mines long frequent sequences with low minimum support and without any information loss. Mining top-k frequent closed patterns without minimum support was introduced in [4]. Top-down and bottom-up combined Fp-tree mining strategy was developed in [4] to raise the support and to discover closed frequent patterns. TSP algorithm mines top-k frequent closed sequential patterns of length no less than min_len, the minimum length of each pattern. It [12] doesn’t require the minimum support threshold, it makes use of the length constraint and outperforms the closed sequential pattern mining algorithm that make use of minimum support threshold.

Several algorithms were proposed previously ranging from mining graph patterns with constraints [2,11,19] and without constraints [6,18] for mining closed graph patterns [16].Frequent graph-pattern mining may result in a large number of patterns. To make the mining more effective, constraints are often used to confine the pattern search. In gPrune algorithm[2] the frequent graph patterns are generated based on the constraints such as density and diameter. The Density of a graph is defined as the ratio of the edge set to vertex set. The diameter of a graph is the maximum length of the shortest path over all pairs of vertices. The problem of incorporating structural constraints in mining frequent graph patterns was solved in gPrune.

Finding closed frequent graphs with connectivity constraints in relational graphs was introduced in [19]. Two approaches, namely CLOSECUT and SPLAT were proposed to speed up the mining process. The former was pattern growth approach and it has better performance on patterns with high support and low connectivity, later approach was pattern reduction approach and it achieves better performance for the high connectivity constraints. These [19] algorithms mines closed frequent graphs with connectivity constraints in relational graphs. DS-search algorithm [11] mines frequent subgraphs of limited diameter and symmetry. It [11] employs the tree decomposition structure of database graphs during the mining process and generates more structurally interesting patterns in the database. Recently two sequential techniques[3,10] were developed to solve the problem of graph-coloring. Graph-coloring concept [10 ] was used in processor allocation to represent the busy processor and the busy processors are mapped in the graph using different colors. The authors in [3] investigated the problem of Star coloring and they used DNA sequence to construct a solution space for the star coloring problem

Two algorithms RP-FP and RP-GD were proposed in

[7]. These algorithms mine a representative set that summarizes frequent subgraphs. In these algorithms [7] a representative set RS is mined such that any frequent subgraph is covered by a representative in RS, and the value of |RS| is minimized. RP-FP and RP-GD algorithms have been proposed to summarize graph patterns efficiently using the concept of δ-cover and generates δ-jump patterns for a user specified δ. Even though there exist many subgraph mining algorithms, the problem of sequential pattern mining of graphs has not been investigated in the literature.

In this paper, we proposed a new algorithm SFG to mine sequential patterns from a graph database using pattern growth approach. We also extended our algorithm to mine top-k closed sequential patterns. We demonstrate that generating top-k closed frequent sequences reduce the number of output graphs without losing much information.

III. P ROBLEM D EFINITION

In this section, we define sequential graph pattern and frequent sequential graph pattern mining problem.

A Labeled directed Multigraph can be represented as a tuple, G= (V,E,L,F) where V denotes a set of vertices, E denotes a set of edges, L denotes a set of labels and F is a labeling function that assigns labels to the vertices and edges. Every edge is a 4 tuple , where s is the source vertex, d is destination vertex, Gid is Graph id and lb is an edge label. A sample Graph data set is shown in ―Fig. 2‖.

Fig.2. A sample graph data set

Definition 1 (Graph Sequence)

An ordered collection of edges in a graph is called a graph sequence. Every edge is assigned a time constant (time of the creation of the relationship between the vertices) as a label. A Graph Sequence can be represented as,

GS=<(v a,v b),(v b,v c),….(v l,v m),(v m,v n)> where v a, v n are the source and destination of the Graph sequence and each (v i,v j) represents an edge from v i to v j. Compact representation of Graph sequence is given as GS=.

Definition 2 (Length of the Graph Sequence).

Given a Graph Sequence GS = Where each v i represents a vertex. Length of the Graph sequence is the number of edges in the sequence.

Definition 3 (Support of Graph sequence)

Support of Graph Sequence GS is the total number of graphs containing the sequence GS as a subsequence.

Fig.3. A sample graph

Graph sequence for the graph in ―Fig. 3‖ is

<(p-q),(q-r),(r-p),(p-s)> where t1,t2,t3,t4 are edge labels such that t1. Length of the sequence is 4.

Definition 4 (Projected database of a sequence S)

Let S be a sequential pattern in a Graph database GD. The S projected database is the collection of subgraph sequences in GD with S as the first occurring sequence.

Definition 5 (Projected database of a vertex)

Given a Graph Database GD = {G1, G2,… Gn}, The projected database of a vertex V is the set of subgraphs in which the sequence of the subgraph starts with V.

PD v = { SG i / i ? 1,2,..n and v is the starting vertex in each SG i}.For example, in the given graph database, the sequence that starts with vertex p as the source forms projected database of p. In ―Fig. 2‖edge 1 of Graph 1, edge 3 of Graph 2, edge 2 of Graph 3, edge 1 of Graph 4 and edge 2 of Graph 5 forms the sequence starting with p. Instead of storing the projected database separately for every vertex, considering the space constraints, we store only the pairs. Hence p-projected database is <(1,1),(2,3),(3,2),(4,1), (5,2)>

Definition 6 ( Support of a vertex)

Support of a vertex V is the number of sequences that start with V. A graph may contain multiple subsequences starting with V, but the count is only 1 for every graph. Definition 7 (Canonical form)

Given a multigraph G i with n vertices, m edges and a graph sequence GS , Canonical form of G i is denoted as (v1,v2,i,1),(v2,v3,i,2),….(v n-1,v n,i,m) Definition 8 (Sequential Graph Pattern)

Given a graph sequence, a Sequential Graph Pattern is all possible subsequences.

For the Graph given in ―Fig. 3‖ Graph Sequence is

, the possible sequential graph patterns are

,,,,,,,< pqrs>,,,,,,,,,,,,,

Definition 9 (Frequent Sequential Graph Pattern Mining) Given a Graph database and minimum support, Frequent Sequential Graph-Pattern Mining is to find all the frequent sequences from the graph database.

Table 1. Canonical representation of Graph database

Table 2. Projected database of vertices

Definition 10 (Top-k Closed Graph Sequence)

A Graph sequence S is said to be a closed graph sequence if S is frequent and there is no proper super sequence with the same support. A closed graph sequence is said to be a top-k closed sequence of minimum length l if there exists no more than (k-1) closed graph sequences whose length is at least l whose support is higher than that of S.

IV. A LGORITHMS

In this section, we present SFG and TCSFG algorithms to generate frequent sequential graph patterns and Top-k closed frequent sequential graph patterns.

A. Algorithms Description

A Graph database is given as input to each of these algorithms. The concept of SFG algorithm generates set of sequential frequent graph patterns. To reduce redundant patterns, we followed an order based on the frequency of sequences in the database. Sequences with higher frequency are grown first, followed by sequences of lower frequency.

B. SFG

SFG algorithm generates set of sequential frequent graph patterns. The main steps include the generation of the set of all frequent length-1 sequences S’, pruning infrequent edges and vertices from Graph database D, sorting S’ in descending order of its fre quency, and growing each edge in the sequence.

Algorithm 1. SFG

Input: A Graph database GD, a minimum support threshold min_sup.

Output: Set of sequential frequent graph patterns SP. 1: Scan the Graph database once,

a.Find the projected database of each vertex v in

the input vertex set V and let it be PD v. Find

the support count of each vertex.

b.Find the set of vertices V u that can be visited

from each vertex v and also find the support

count of the sequence where u ? V u . 2: Sort the vertex set V in descending order of their support count and insert them in a priority queue Q. Do not insert the vertex whose support count is less than the min_sup.

3: For every vertex v in the priority queue Q

4: For every vertex u ? V u

5: a. If the support count of the sequence

is greater than or equal to

minsup threshold then add the

sequence to S.

b. SP ← SP U S.

6: Sort S in descending order of their supports. 7: SequenceMining (PD v, S)

In Line 1 of the algorithm, single scan of the graph database is done. The projected database of each vertex and their support counts are calculated. For every sequence that grows with a vertex, it is ensured that only its projected database is scanned. This reduces the search space and scan time. Find the set of vertices that can be visited from each vertex v and the support count of length-1 sequences. Line 2 of the algorithm removes the projected database of an infrequent vertex. The vertex whose support count is less than the min_sup threshold is an infrequent vertex. The remaining vertices are sorted in descending order of their support count. Lines (3-5) finds length-1 frequent sequences and calls the subprocedure SequenceMining to find the frequent sequences of length

more than one. SFG algorithm iterates until there are no vertices that can grow the sequence.

Subprocedure1. SequenceMining(PD,S)

Input: Projected database PD, Sequential Pattern S Output: Set of sequential frequent graph patterns SP. 1: if S is empty then return.

2: For every sequence S? in S

PD s←Find_projected_database(PD,S?)

3: Scan the projected database PD s and find the vertices visited from the sequence S? and their

support counts.

4: For every vertex v visited from S?,

a. Add the sequence to S?? only if the

support count of the sequence satisfies

the min_sup threshold.

b. SP← S??.

5: SequenceMining(PD s, S?’,min_sup) SequenceMining is a recursive procedure that grows every sequence in S. Lines 2-3 of the procedure finds the projected database of the sequences in S and scans the projected database of each sequence to find the vertices visited from the sequence and the support count of the new sequence. Line 4 of the procedure grows the sequence and returns if there is no sequence to be grown. The recursive procedure finds length-2 frequent sequences in the first run, length-3 frequent sequences in the second run and so on. It finds the set of all frequent sequences of length more than one and performs a one edge growth of the sequence in each recursive call. Subprocedure2. Find_projected_database(PD,S?) Input: Projected database PD, Sequential Pattern

S?, where S?=

Output: PDS, where PDS is a Projected database of S?. 1: PDS←Φ

2. for every graphid, edgeid pair (i,j) in PD

3: Scan graph i from j to find k, where k is the first occurrence of v n.

4: PDS←PDS ? (i,k)

5: return PDS

Example: Given the graph sequence database as shown in Table 1 with min_sup=2, frequent sequential patterns are mined in the following manner.

1.Scan the graph database once and find the

projected database PD v of each vertex and their

support counts. The database is divided into 5

partitions, the first partition starts with p as the

source vertex, the second partition starts with q as

the source and so on.

2.Find the vertices visited from each vertex and their

support count. Find length 1 frequent graph

sequences. Projected database and frequent length

1 graph sequences are shown in Table 2.

3.Vertices visited from vertex p and their support

count is . Length 1 frequent

graph sequences are .The

sequence represents that vertex r is visited

7 times after vertex p. Similarly represents

that vertex q is visited 5 times after p and so on. 4.Descending order of vertices based on their

support count is ,,,,.

Call the sub procedure SequenceMining with the

Projected database of the vertex p and the

sequences grown from p, ,,, as

input.

5.Grow the sequence with the highest support first.

In our example, grow the sequence first.

Scan the projected database of p to construct the

projected database of . Projected database of

p is { (1,1), (2,3), (3,2), (4,1), (5,2)}.

Subprocedure2 is called to find the projected

database of the sequence as shown below.

a.Scan (1,1)as follows, the first occurrence of p as

a source vertex in graph 1 from the edge 1 is 1.

Now check the first occurrence of r as a source

after the first occurrence of p, it is found at 3.

Note the graph-id, edge-id pair (1,3).

b.Similarly, scan (2,3) as follows, the first

occurrence of p as a source vertex in graph 2

from edge 3 is 3. Now check the first occurrence

of r as source after the first occurrence of p, it is

not found. This indicates that the sequence

cannot be grown in graph 2.

c.Similarly, scan (3,2) and the sequence

cannot be grown in graph 3.

d.Scan (4,1) and note the pair (4,3), the first

occurrence of r as a source in graph 4 after first

occurrence of p is 3.

e.Scan (5,2) and note the pair (5,4), the first

occurrence of r as a source in graph 5 after first

occurrence of p is 4.

Now the list of pairs noted in the steps a to e form the projected database of the sequence , {(1,3),(4,3),(5,4)}.

6.Scan the projected database of and find the

vertices visited from .

pr , vertex t is visited 3 times

after , vertex q is visited 2 times , vertex p is

visited 1 time, and vertex s is visited 1 time.

Remove the infrequent sequences , .

Length-2 frequent sequences obtained so far in

descending order of their supports are

.

7.Grow the sequences , in the similar

manner.

All the frequent sequential graph patterns generated from the vertex p are shown in ―Fig 4‖. The Time complexity of SFG depends on the time taken for scanning a graph database and a number of subsequences generated. It is given by t g+ O ( st p), where t g is the time taken to scan graph database, s is the number of subsequences and t p is the time taken to scan the

projected database of each subsequence.

C. TCSFG

The number of sequential patterns generated for a given min_sup increase exponentially with an increase in the average number of edges in the graph database. To reduce the output search space, we proposed TCSFG algorithm which generates top-k closed sequential graph patterns.

Given a Graph database, minimum support, and k number of patterns, Top-k Closed Sequential Graph Pattern Mining is to find only top-k closed frequent sequences from the graph database.

Fig.4. Frequent sequential graph patterns generated from vertex p.

Algorithm 2. TCSFG

Input: A Graph database GD, minimum support threshold min_sup, minimum length of the

sequence min_len, number of closed patterns

k .

Output: Top-k closed sequential patterns.

1: Scan the Graph database GD to find projected databases PD of all the vertices and their support counts.

2: Sort the vertices in descending order of their supports.

3: Find all frequent Length-1 sequences S and add all of the sequences to the priority queue, Q.

4: Top-kClosedSeqMining(PD,Q)

Line 3 of algorithm2 sorts all frequent length 1 sequences in descending order of their supports. In our example, out of 16 sequences, 15 are frequent as shown in Table 2. Line 4 of algorithm2 calls the sub-procedure Top-kClosedSeqMining to mine top-k sequences whose length is not less than min_len.

Subprocedure3.Top-kClosedSeqMining(PD,S)

Input: Projected database PD, Set of Sequential Patterns S

Output: Top-k Closed sequential frequent graph patterns CSP.

1: if S is empty then return

2: if a number of closed sequential graph patterns

is equal to k then return.

3: For every sequence S? in S

PD s←find_projected_database(PD,S?,

min_sup)

4: Scan the projected database PD s and find the vertices visited from the sequence S? and their support counts.

5: For every vertex v visited from S?,

Add the seq uence to S?? only if the

support count of the sequence satisfies

the min_sup threshold.

6. If there i s no sequence S?? with support equal

to the support of S? and if the length of the

sequence S? is not less than l then add S? to set

of closed patterns.

CSP←CSP U S?.

7: Top-kClosedSequenceMining(PD s,S?’)

Let s? be the number of closed sequences obtained and s?<

Fig.5. Top-k closed graphs

Example: Given the graph sequence database as shown in Table 1 with min_support=2, min_len =3 , k=6, top-k closed sequential patterns are mined as follows:

1.Descending order of frequent length 1 Sequences

are,

,,,,,,

>,,,,,,,

2>,.

2.Select the first sequence and grow.

Length2, length3, sequences generated from

are ,,, . Among

these 4 sequences only and are

added to set of closed patterns. is not

closed as it is having a super sequence

with the same support. is closed, but the

length of the sequence is less than l.The Closed

patterns generated from the sequence are

highlighted as shown in ―Fig.4‖.

3.The next sequence to be grown is . Closed

Sequences whose length greater than or equal to l

are , , , .

The algorithm stops as the number of sequences

obtained till now are 6.

The Top-k closed output graphs are shown in ―Fig. 5‖.

V. R ESULTS

We implemented the SFG and TCSFG algorithms and tested them on synthetic dataset produced by a graph database generator [15]. It is based on the IBM Quest Synthetic Data Generation Code for Association and sequential patterns.

The graph generator generates the data sets based on the four parameters: D be the total number of graphs in the database, V be the number of vertex labels and E be the number of edge labels, T be the average size of each graph based on the number of edges and M be the average density of each graph which is defined as the number of edges in the graph divided by the number of edges in a complete graph. ―Fig. 6‖shows the result where the size of the data set (D) is varied between 100 and 1000 graphs. Other values of the parameters used are: V = 20, E = 20 , T = 20 and M=0.3.

―Fig. 6‖and ―Fig. 7‖shows the variations in the number of sequential graph patterns generated and the time taken by the SFG and TCSFG algorithms as the minimum support is varied. This shows that in the case of SFG algorithm the growth of frequent sequences increases exponentially with reduced minimum support and hence more analysis time. This might result in less scalability for large graphs because the number of subsequences increases exponentially.

Fig.6. Number of patterns generated with respect to change in

minimum support

These results point us to the problem of reducing these output sequences by generating top-k closed sequences as we discussed in TCSFG. This gives us an idea of how important it is to reduce the number of frequent sequences in the output using constraints, some of which are based on the support, graph structure and generating summarized patterns. These constraints are added in a TCSFG algorithm to generate top-k closed graph sequences. The running time and the number of patterns generated for TCSFG for the same input database is reduced by a factor when compared to SFG for smaller values of minimum support. ―Fig. 8‖ shows the running time of SFG and TCSFG algorithms with varying number of graphs. During this experiment, the minimum support threshold is kept 20% of the size of the graph data set. These experimental results show that top-k closed patterns are generated in less time compared to all the sequential patterns with varying number of graphs as input. We also tested performance of the TCSFG algorithm by changing the value of k. We found that running time of TCSFG algorithm is varying linearly with the k value. ―Fig. 9‖ shows the running time of the TCSFG algorithm for different values of k.

Fig.7. Performance of SFG and TCSFG

Fig.8. Running time with 20% minimum support

Fig.9. Performance of TCSFG on different k

As the size of the graph database increases, the number of frequent graph sequences increases much faster than the number of frequent closed graph sequences. The effectiveness of an algorithm depends on the number of useful patterns generated. As shown in the results, useful patterns without much loss of information can be obtained using TCSFG algorithm.

VI. C ONCLUSION

In this paper, we studied the problem of mining sequential patterns and closed sequential patterns in large graph data sets. To the best of our knowledge, the problem of mining closed graph sequence is not dealt with much and this is the first piece of work to mine Top-k closed sequential graph patterns. We proposed two algorithms, SFG generates all the frequent sequences from the graph database, whereas TCSFG generates top-k frequent closed sequences. These algorithms use the concept of projected database for a graph to reduce the search space and an order based on the frequency of the patterns to generate the set of all patterns. These algorithms were verified on a synthetic database [15]. Based on this study, we conclude that mining top-k closed sequential graph patterns are preferable than the traditional closed graph mining. We further extend our study to push constraints in generating top-k closed sequential graph patterns.

R EFERENCES

[1] A. Inokuchi, T. Washio, and H. Motoda. ―An apriori-

based algorithm for mining frequent substructures from

graph data,‖In Proc. PKDD, ser. LNCS, Springer,

Vol.1910,pp. 13-23, July 2002. ―doi:10.1007/3-540-

45372-5_2‖.

[2] F. Zhu, X.Yan, J. Han, and P. S.Yu, ―GPrune: A

constraint pushing framework for graph pattern mining,‖

In Proc. PAKDD, ser. LNCS, Springer, vol. 4426, pp.

388–400, May 2007. ―doi: 10.1007/978-3-540-71701-

0_38‖.

[3]G. Sethuraman, and Kavith a Joseph. ―Star Coloring

Problem:The DNA Solution,‖ International Journal of

Information Technology and Computer Science, Vol. 4,

No.3, pp.31-37, Apr.2012.―doi:

10.5815/ijitcs.2012.03.05‖.

[4]J. Han, J. Wang, Y. Lu, and P. Tzvetkov. ―Mining top-k

frequent closed patterns without minimum support,‖. In Proc. ICDM 2002, Maebashi, Japan, pp. 211-218, Dec.

2002. ‖doi:10.1109/ICDM.2002.1183905‖.

[5]J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q.

Chen, U.Dayal, and M.C.Hsu. ―Mining sequential patterns by pattern growth: Prefix span approach,‖IEEE Transactions on Knowledge and data engineering, Vol.16, No.11,pp.1424-1440,Nov2004, ―doi:10.1109/TKDE.2004.

77 ‖.

[6]Jianyong Wang and Jiawei Han, ―Bide:Efficient mining of

frequent closed sequences,‖In Proc. of 20th IEEE international conference on Data Engineering ,pp. 79-90, ―doi:10.1109/ICDE.2004.1319986‖.

[7]Jianzhong Li, Yong Liu, and Hong Gao ―Efficient

Algorithm for Summarizing Graph Patterns,‖IEEE Transactions on Knowledge and Data Engineering, Vol.23, Issue:9,pp.1388-1405,2011,―doi:10.1109/TKDE.

2010.48‖.

[8]M. Zaki, ―SPADE:An Efficient Algorithm for Mining

Frequent Sequences,‖Machine Learning, Kluwer Academic Publishers, Vol.42, issue:1, pp.31-60, 2001, ―doi: 10.1023/A:1007652502315‖.

[9]M. Kuramochi and G. Karypis, ―Frequent Subgraph

discovery,‖In Proc. 2001. Int. conf. Data Mining 2001, pp. 313-320, San Jose, CA, November 2001, ―doi:

10.1109/ICDM.2001.989534‖.

[10]Mohammed Hasan Mahafzah. ―An Efficient Graph-

coloring Algorithm for processor Allocation,‖

International Journal of Information Technology and

Computer Science, Vol.5, No.7, pp. 43-48, June 2013,

―doi: 10.5815/ijitcs. 2013.07.05‖.

[11]Natalia Vanetik. ‖Mining Graphs with Constraints on

Symmetry and Diameter,‖ In WAIM 2010 International

Workshops, Vol. 6185,Springer, pp. 1-12, July 2010, ―doi:

10.1007/978-3-642-16720-1_1‖.

[12]Petre Tzvetkov, Xifeng Yan, Jiawei Han. ―TSP: Mining

Top-K Closed Sequential Patterns,‖ Third IEEE

International Conference on Data mining, pp. 347-354,

November 2003, ―doi:10.1109/ICDM.2003.1250939‖. [13]R. Agrawal and R. Srikant.‖ Mining sequential patterns‖.

In ICDE’95, Taipei, Taiwan, pp. 3-14, Mar. 1995.

[14]R. Srikant and R. Agrawal. ―Mining sequential patterns:

Generalizations and performance improvements‖. In

EDBT’96 Proceedings of the 5th International Conference

on Extending Database Technology: Advances in

Database Technology,pp.3-17,―doi:10.1007/

BFb0014140‖

[15]Synthetic graph generated by IBM Quest Synthetic Data

Generation Code for Associations and Sequential Patterns.

[https://www.doczj.com/doc/ee12682656.html,t.hk/graphgen/].

[16]X. Yan and J. Han. ―CloseGraph: Mining closed frequent

graph patterns‖. In KDD’03, Washington, D.C., Aug ust

2003, ‖doi:10.1145/956750.956784‖.

[17]X. Yan, J. Han, and R. Afshar. ―CloSpan: Mining closed

sequential patterns in large datasets,‖In SDM’03, San

Fransisco, CA, pp. 166-177 May 2003.

[18]Xifeng Yan, Jiawei Han. ―gSpan: Graph-Based

Substructure Pattern Mining,‖In Proc ICDM 2002, pp.

721-724, ―doi:10.1109/ICDM.2002.1184038‖.

[19]Yan X, Zhou XJ, Han J. ―Mining closed relational graphs

with connectivity constraints‖. In Proc of ACM SIGKDD

international conference on knowledge discovery in

databases (KDD’05), Chicago, IL, pp. 324–333, 2005,

―doi:10.1145/1081870.1081908‖.

Authors’ Profiles

K. Vijay Bhaskar is currently a PhD

student in the department of Computer

Science and Engineering, Gandhi Institute

of Technology (GITAM). His current

research areas of interest include Data

Mining, Graph Databases, Network

security, and Mobile Computing.

Dr. R.B.V.Subramanyam is an associate

professor in the department of computer

science and Engineering, National Institute

of Technology, Warangal. He received his

Master of Technology and Doctor of

Philosophy from Indian Institute of

Technology, Kharagpur. His areas of

interest include Data mining, Distributed data mining, Graph databases, Fuzzy data mining, Big data analytics, Pattern recognition, High performance computing, Soft computing, Game theory, Outlier analysis.

Dr. K. Thammi Reddy is the Director of

Internal Quality Control (IQC) and

Professor of CSE at Gandhi Institute of

Technology(GITAM).He is having Over 18

years of experience in Teaching, Research,

Curriculam Design and consultancy. His

research areas include Data warehousing

and Mining, Distributed computing,

Network Security etc.

S. Sumalatha is currently a PhD student in

the department of Computer Science and

Engineering, National Institute of

Technology, Warangal. Her research areas

of interest include Data mining, Big data

analytics and Graph databases.

How to cite this paper: K. Vijay Bhaskar, R.B.V Subramanyam, K. Thammi Reddy, S. Sumalatha,"Top-k Closed Sequential Graph Pattern Mining", International Journal of Information Engineering and Electronic Business(IJIEEB), Vol.8, No.4, pp.1-9, 2016. DOI: 10.5815/ijieeb.2016.04.01

并行序列模式挖掘

并行序列模式挖掘研究概况 1序列模式挖掘问题 R.Agrawal等人在1995年首先提出了序列模式挖掘的概念[1],其问题描述如下。 在由多个交易组成的交易数据库中,某个交易描述了某个顾客在某时间购买物品的集合。物品的集合叫做项集。相同顾客在不同交易中包含的项集的子集,按时间先后关系排列称为一个序列。每个子集称为一个元素(element)。并且给定由用户确定的最小支持度阈值(min_support threshold),序列模式挖掘就是去发现所有子序列的出现频率不小于给定的最小支持度的频繁子序列。 2序列模式挖掘研究概况 在序列模式挖掘的问题被提出以后,人们开始在时间序列数据库中挖掘序列模式和其他频繁模式的算法方面不断地进行研究和改进。现有的序列模式挖掘算法主要可以分为两类:第一类是基于Apriori特性的算法[2]。它是由R.Agrawal和R.srikant在1994年提出的。其基本思想是任一个频繁模式的子模式必定是频繁的。基于这一特性,人们提出了一系列类APriori的序列模式挖掘算法,这些算法中有采用水平数据格式(horizontal data format)的算法,如AprioriAll算法[1]、GSP算法[3]、PSP算法[4]等;有采取垂直数据格式(vertica data format)的算法,如SPADE算法[5]、SPAM算法[6]等。第二类是J.Han等人提出的基于模式增长(pattern-growth)策略的算法,如Freespan算法[7]、Prefixspan算法[8],[9]等。另外,C.Antunes 等人提出了SPARSE算法[10]。在很多序列模式挖掘任务中,用户不需要找出数据库中所有可能存在的序列模式,而是加入一定的约束,找出用户感兴趣的序列模式[11],[12],Agrawal 等人将序列模式挖掘问题加以泛化,引入了时间约束、滑动时间窗口(sliding time window)和分类约束,并提出了GSP算法[3]。 上述序列模式挖掘算法都是在一维信息中挖掘序列模式。在多维序列模式方面,H.Pint 等人提出的UniSeq算法[13]能有效地挖掘多维序列模式,但在维度较高时其挖掘性能会有所下降。当前国内外学者研究方向还包括增量式(Incermental)序列模式挖掘[14]、闭合(Closed)序列模式挖掘[15]、周期性(Peirodic)模式挖掘[16]、结构(Strueture)模式挖掘[17]、近似(Approximate)序列模式挖掘[18]等。 3并行序列模式挖掘研究概况 由于并行关联规则挖掘方面的研究起步较早,出现了许多优秀的算法[19]~[24]。而序列模式是关联规则的扩展,并行序列模式挖掘也很快被人们重视起来。 当前大多数并行序列模式挖掘算法都是对串行算法并行化的结果。随着研究的深入与问题的需要,序列模式挖掘的经典算法GSP的思想也逐渐被扩展到了并行序列模式挖掘中。Shintani等人提出了基于GSP算法的三种并行策略NPSPM、SPSPM、HPSPM[25]。在算法NPSPM中,将候选序列复制到所有处理器中,每个处理器利用本地数据库计算其本地支持度,并在每次迭代之后执行一个归约操作,最后得到其全局支持度。NPSPM在每个节点上都复制完整的候选集,在处理大型数据库时常常会导致内存溢出;SPSPM策略将候选集划分成大小相等的块后分别置于各个处理器中。但是SPSPM利用系统的聚合内存,容易引起额外的通信开销,因为为了得到序列的全局支持度,每个处理器的本地数据库必须广播到其他所有的处理器上;HPSPM采用了一个更加智能的策略,一方面基于Hash机制对候选序列进行划分,另一方面,它减少了所需的通信时间。相比前面两个并行策略,HPSPM的性能最好。 Guralnik等人提出了一类基于树投影技术的并行序列模式算法[26],[27]。根据并行策略,算法可被分为两种。一种是数据并行模式DPF:原始数据库被划分成p个大小相等的块存于p个处理器上,每个处理器拥有一个相同的字典树,各处理器计算本地支持度然后通过通

数据挖掘考试题库

1.何谓数据挖掘?它有哪些方面的功能? 从大量的、不完全的、有噪声的、模糊的、随机的数据中,提取隐含在其中的、人们事先不知道的、但又是潜在有用的信息和知识的过程称为数据挖掘。相关的名称有知识发现、数据分析、数据融合、决策支持等。 数据挖掘的功能包括:概念描述、关联分析、分类与预测、聚类分析、趋势分析、孤立点分析以及偏差分析等。 2.何谓粒度?它对数据仓库有什么影响?按粒度组织数据的方式有哪些? 粒度是指数据仓库的数据单位中保存数据细化或综合程度的级别。粒度影响存放在数据仓库中的数据量的大小,同时影响数据仓库所能回答查询问题的细节程度。按粒度组织数据的方式主要有: ①简单堆积结构 ②轮转综合结构 ③简单直接结构 ④连续结构 3.简述数据仓库设计的三级模型及其基本内容。 概念模型设计是在较高的抽象层次上的设计,其主要内容包括:界定系统边界和确定主要的主题域。 逻辑模型设计的主要内容包括:分析主题域、确定粒度层次划分、确定数据分割策略、定义关系模式、定义记录系统。 物理数据模型设计的主要内容包括:确定数据存储结构、确定数据存放位置、确定存储分配以及确定索引策略等。在物理数据模型设计时主要考虑的因素有: 存取时间、空间利用率和维护代价等。 提高性能的主要措施有划分粒度、数据分割、合并表、建立数据序列、引入冗余、生成导出数据、建立广义索引等。 4.在数据挖掘之前为什么要对原始数据进行预处理? 原始业务数据来自多个数据库或数据仓库,它们的结构和规则可能是不同的,这将导致原始数据非常的杂乱、不可用,即使在同一个数据库中,也可能存在重复的和不完整的数据信息,为了使这些数据能够符合数据挖掘的要求,提高效率和得到清晰的结果,必须进行数据的预处理。 为数据挖掘算法提供完整、干净、准确、有针对性的数据,减少算法的计算量,提高挖掘效率和准确程度。

时间序列模式挖掘

第6章时间序列和序列模式挖掘(讲稿) 6.1时间序列及其应用 时间序列(Time Series)挖掘是从大量的时间序列数据中提取人们事先不知道的但又是潜在有用的信息和知识,是数据挖掘中的一个重要研究分支,有广泛的应用价值。 近年来,时间序列挖掘在宏观的经济预测、市场营销、客流量分析、太阳黑子数、月降水量、河流流量、股票价格变动(长期的观察,有周期性)等众多领域得到应用。事实上,社会、科学、经济、技术等领域中广泛存在着大量的时间序列数据有待进一步的分析和处理。 时间序列数据挖掘通过研究信息的时间特性,深入洞悉事物进化的机制,是获得知识的有效途径。 从统计意义上来讲,所谓时间序列就是将某一指标在不同时间上的不同数值,按照时间先后顺序排列而成的数列。它可以是观察值也可以是记录值。 这种数列由于受到各种偶然因素的影响。往往表现出某种随机性,彼此之间存在着在统计上的依赖关系。虽然每一时刻上的取值或数据点的位置具有一定的随机性,不可能完全准确地用历史值来预测将来。但前后时刻的数值或数据点的相关性往往呈现某种趋势性或周期性变化----这是时间序列挖掘的可行性之所在。 时间序列挖掘通过对过去历史行为的客观记录分析,揭示其内在规律(如波动周期,振幅,趋势),进而完成预测未来行为等决策性工作。人们希望通过对时间序列的分析,从大量的数据中发现和揭示某一现象的发展变化规律或从动态的角度刻画某一现象与其他现象之间的内在数量关系,以掌握和控制未来行为。 简言之,时间序列数据挖掘就是要从大量的时间序列数据中提取人们事先不知道的、但又是潜在有用的与时间属性相关的信息和知识,并用于短期、中期或长期预测,指导人们的社会、经济、军事和生活等行为。 从数学意义上来讲,如果我们对某一过程中的某一变量进行X(t)观察测量,在一系列时刻t1,t2,…,t n(t为自变量,且t1

研究生论文开题报告--基于隐私保护的多源数据挖掘高效算法研究

研究生学位论文开题报告 题目名称:基于隐私保护的多源数据挖掘高效算法研究 姓名: 学号: 专业名称: 研究方向: 攻读学位: 学院: 导师姓名: 导师职称: 填表时间年月日

填表说明 1.开题报告是研究生培养的重要环节,研究生需在认真完成。 2.完成时间:硕士研究生的开题报告应于第三学期末前完成 3.打印要求:此表用A4纸双面打印。 4.此表与中期考核审核表、成绩单、实践报告、学术活动列表等材料一起交于学院,参加中期考核

一、课题来源,国内外研究现状、水平及发展趋势,选题的研究意义、目的,参考文献(一)课题来源 1、问题的提出 数据挖掘,顾名思义即是从大型数据库中提取人们感兴趣的知识,这些知识是隐含的、事先未知的、潜在的、有用信息,提取的知识表示为概念、规则、规律、模式等形式[1]。数据挖掘要处理的问题,就是在庞大的数据库中寻找有价值的隐藏事件,加以分析,并将这些有意义的信息归纳成结构模式,提供给有关部门决策时参考。目前已经提出的常用方法有关联规则、决策树、聚类、神经网络等方法。 然而,在对数据进行挖掘的时候,都不可避免的会出现敏感信息泄露的问题,随着数据挖掘技术的日益发展,数据隐私和信息安全逐渐引起人们的关注。为了保护数据的隐私,人们不愿提供正确的信息给服务商,以免个人信息泄露造成不必要的麻烦,但是数据挖掘结果准确的重要前提是提供的数据正确。由于数据挖掘主要任务是对汇总数据的模式开发,这使得构造一个不需要访问精确的单个信息而获得准确的模式的挖掘技术成为可能。目前,基于隐私保护的数据挖掘技术已经成为一个新颖热门的研究领域,国内外已有很多成熟的研究算法和技术。 通过众多文献比对我们发现,目前已有的这些基于隐私保护的数据挖掘算法和技术大多是针对单源数据库进行挖掘和保护,而在实际应用中,有很多情况必须面对多个数据源。例如,许多大型企业、跨国公司都拥有过个子公司,每个子公司都有自己相应的数据库。这就迫切需要数据库挖掘系统具有针对多数据源进行挖掘和保护的能力。已有的国内外文献中,针对多源数据进行挖掘的模型和算法已经出现,但是基于隐私保护技术的多源数据挖掘研究却很少提及。这可能是由于多源数据挖掘本身的技术局限性,导致在对多个数据源进行挖掘时,泄露敏感信息都成为了不可避免的操作。因此,本文在对当前已有的多源序列模式挖掘技术研究的基础上,分析结合并行和隐私保护技术的特点,提出新的基于隐私保护的多源数据挖掘高效算法,使得在多源环境下既可以高效率高准确度的挖掘出高投票率模式(全局模式),又可以隐藏敏感序列模式,达到较好的隐私保护效果。 (二)国内外研究现状、水平及发展趋势 1、隐私保护技术的研究进展 关于数据的隐私保护问题,首次是由Adam N等学者在《Security-control methods for statistical databases: A comparison study》[2]一文中提出,文章中提出了一种用扰动的方式来解决数据的隐私保护。所谓“扰动”就是发布数据集失真,数据获得者无法通过其他途径构建出原始数据集,但是这个失真的数据集又仍然保持数据获得者所希望保留的某种特性。基于数据失真的技术还有随机扰动、阻塞和凝聚等。目前常用的隐私保护技术大多都是以统计模型和概率模型为主理论,应用在较低层次的数据隐私保护。在分布式环境中,Clifton C等提出使用SMC (Secure Multi-party Computation)安全多方计算加密技术保证数据的通信安全[3],这种基于加密的隐私保护技术可适用于科学计算、分布式安全查询、几何计算、分布式数据挖掘等应用。当前,关于SMC的研究主要集中在减低计算开销、以SMC为工具解决问题以及优化分布式计算协议。在国内,关于隐私保护技术的研究主要集中在基于数据失真或数据加密技术方面的研究,如基于隐私保护分类挖掘算法、关联规则挖掘、分布式数据的隐私保护协同过滤推荐、网格访问控制等。(国内研究现状) 对数据进行隐私保护,主要可分为在数据发布过程中和在数据挖掘过程中进行。目前已有的针对数据发布的隐私保护技术已经有很多,本文主要讨论数据挖掘中的隐私保护技术。 2、隐私保护数据挖掘的研究进展

4种序列模式挖掘算法的特性研究

第28卷 第2期 2006年2月武 汉 理 工 大 学 学 报JOURNA L OF WUHAN UNIVERSIT Y OF TECHN OLOG Y Vol.28 No.2 Feb.2006 4种序列模式挖掘算法的特性研究 吕 锋,张炜玮 (武汉理工大学信息工程学院,武汉430070) 摘 要: 序列模式挖掘是数据挖掘中的一个重要研究方向,对序列模式挖掘中的4种算法(AprioriAll 、GSP 、FreeSpan 、Prefixspan )的执行过程及其特点进行了研究,并对这几种算法的时空执行效率进行了定性和定量的分析比较,指出了4种算法各自的适用范围,得出的结果对序列模式挖掘系统的设计具有一定的参考价值。 关键词: 序列模式挖掘; AprioriAll ; GSP ; FreeS pan ; Prefixspan 中图分类号: TP 301.6文献标志码: A 文章编号:167124431(2006)022******* R esearch on the Characters of Four Sequential Patterns Mining Algorithms L U Feng ,ZHA N G Wei 2wei (School of Information Engineering ,Wuhan University of Technology ,Wuhan 430070,China ) Abstract :  Sequential patterns mining was a very important data 2mining problem with board application.We researched four sequential patterns mining algorithms namely AprioriAll ,GSP ,FreeS pan ,Prefixspan ,and studied their characters.A qualita 2tive analysis had also been made on the algorithm ’s time and space efficiency.We also pointed out the conditions in which each algorithm was applied.The conclusion of the research would be beneficial to the desi gn of data mining system. K ey w ords : sequential patterns mining ; aprioriAll ; GSP ; freespan ; prefixspan 收稿日期:2005209202. 基金项目:教育部重点实验室开放研究基金(T K LJ0203)1 作者简介:吕 锋(19572),男,教授.E 2mail :lufengwut @https://www.doczj.com/doc/ee12682656.html, 序列模式挖掘即从序列数据库中发现频繁子序列以作为模式,它是一类重要的数据挖掘问题,有着非常广泛的应用前景,被应用在包括顾客购买行为的分析、网络访问模式分析、科学实验的分析、疾病治疗的早期诊断、自然灾害的预测、DNA 序列的破译等方面。 序列模式首先是由Agrawal R 和Srikant R 提出的[1,2],此后,许多这方面的研究都将注意力放在如何提高序列模式挖掘的效率上。但是到目前为止,大多数挖掘序列模式的方法都是Apriori 类方法的改进。下面就对序列模式挖掘中的4种算法进行分析和比较。 1 序列模式挖掘中的4种算法及其特点 1.1 AprioriAll 算法 AprioriAll [1]算法为Apriori 类算法,主要思想为:在每一次扫描(pass )数据库时,利用上一次扫描时产生的大序列生成候选序列,并在扫描的同时计算它们的支持度(support ),满足支持度的候选序列作为下次扫描的大序列。第1次扫描时,长度为1的频繁序列模式作为初始的大1—序列。 AprioriAll 算法的不足:1)容易生成庞大众多的候选序列;2)需要多次扫描数据库。候选序列的长度增加1,就需要扫描1次数据库;3)不易发现长序列模式,因为随着需要挖掘的序列模式长度的增加,侯选序列

序列模式挖掘综述

收稿日期:2007-08-24;修回日期:2007-11-17 作者简介:陈卓,博士,主要研究方向为数据挖掘(chenzhou613@https://www.doczj.com/doc/ee12682656.html,);杨炳儒,教授,博导,主要研究方向为数据挖掘、推理机制与知识发现等. 序列模式挖掘综述 陈 卓,杨炳儒,宋 威,宋泽锋 (北京科技大学信息工程学院,北京100083) 摘 要:综述了序列模式挖掘的研究状况。首先介绍了序列模式挖掘背景与相关概念;其次总结了序列模式挖掘的一般方法,介绍并分析了最具代表性的序列模式挖掘算法;最后展望序列模式挖掘的研究方向。便于研究者对已有算法进行改进,提出具有更好性能的新的序列模式挖掘算法。关键词:数据挖掘;序列模式;周期模式;增量式挖掘 中图分类号:TP 311 文献标志码: A 文章编号:1001-3695(2008)07-1960-04 Sur vey of sequen tial pat ter n m inin g CHE N Zhuo,YAN G Bing-ru,S ON G Wei,S ON G Ze-feng (S chool of Infor mation Engineering,Beijing Univer sity of S cience &Technology,Beijing 100083,C hina) Abst ract :This pa per prov ided a review of the res ea rch of sequential pa tt ern m ining.Firstly,introduced the ba ckground and context .S econdly,sum m a rized the genera l m et hods of sequence pa tt ern m ining,introduced and analyz ed the m os t represent ative a lg orithm to prov ide a basis for im proving old algorit hm s or developing new effect iv e ones.Fina lly,dis cussed som e future re-s ea rch t rends on t his area . Key words:dat a m ining ;sequent ia l pat tern;periodic pa tt ern;increm enta l m ining 数据挖掘作为知识发现的核心步骤,旨在从海量数据中提取有效的、新颖的、潜在有用的、易被理解的知识。序列模式挖掘(sequent ia l pa tt ern m ining)是数据挖掘中非常重要的一个研究领域,最早是由Ra kesh Agraw al 和Ram a krishna n S rikant 在针对超市中购物篮数据的分析提出来的。序列模式挖掘是要找出序列数据库中所有超过最小支持度阈值的序列模式 [1] 。它 有着广泛的应用领域:商业组织利用序列模式挖掘去研究客户购买行为模式特征、计算生物学中序列模式挖掘用来分析不同氨基酸突变模式、用户Web 访问模式预测以及DN A 序列分析和谱分析。序列模式挖掘与关联规则挖掘在许多方面相似,但它更关心数据之间顺序的关联性。 1 序列模式挖掘任务定义 基本概念: 定义1 事务数据库(t ransaction da taba se):以超市数据为例来说明,即由顾客交易记录组成的数据库。Custom_ID 、T ra nsaction_Tim e 、It em set 分别代表顾客标志、交易时间和交易物品集合。 定义2 项集(it em s et):各个项(it em )组成的集合。定义3 序列(sequence):不同项集的有序排列。序列S 可以表示为S =〈s 1,s 2,…,s n 〉。其中:s j (1≤j ≤n )为项集,也称为序列S 的元素。 定义4 序列的元素(elem ent):表示为(x 1,x 2,…,x n )。其中:x k (1≤k ≤m )为不同的项。 定义5 序列长度:一个序列包含的所有项集的个数,长度为1的序列记为1-序列。 定义6 序列的包含:设存在两个序列α,β。其中:α=〈a 1,a 2,…,a n 〉,β=〈b 1,b 2,…,b n 〉。如果存在整数1≤j 1

生物信息学期末考试答案分析解析

一、名词 Bioinformatics:生物信息学——是一门综合运用生物学、数学、物理学、信息科学以及计算机科学等诸多学科的理论方法,以互联网为媒介、数据库为载体、利用数学和计算机科学对生物学数据进行储存、检索和处理分析,并进一步挖掘和解读生物学数据。 Consensus sequence:共有序列——决定启动序列的转录活性大小。各种原核启动序列特定区域内(通常在转录起始点上游-10及-35区域)存在共有序列,是在两个或多个同源序列的每一个位置上多数出现的核苷酸或氨基酸组成的序列。 Data mining:数据挖掘——数据挖掘一般是指从大量的数据中自动搜索隐藏于其中的有着特殊关系性的信息的过程。数据挖掘通常是利用计算方法分析生物数据,即根据核酸序列预测蛋白质序列、结构、功能的算法等,实现对现有数据库中的数据进行发掘。 EST:(Expressed Sequence Tag)表达序列标签——是某个基因cDNA克隆测序所得的部分序列片段,长度大约为200~600bp。 Similarity:相似性——是直接的连续的数量关系,是指序列比对过程中用来描述检测序列和目标序列之间相同DNA碱基或氨基酸残基顺序所占比例的高低。 Homology:同源性——是两个对象间的肯定或者否定的关系。如两个基因在进化上是否曾具有共同祖先。从足够的相似性能够判定二者之间的同源性。 Alignment:比对——从核酸以及氨基酸的层次去分析序列的相同点和不同点,以期能够推测它们的结构、功能以及进化上的联系。或是指为确定两个或多个序列之间的相似性以至于同源性,而将它们按照一定的规律排列。 BLOSUM:模块替换矩阵——是指在对蛋白质数据库搜索时,采用不同的相似性分数矩阵进行检索的相似性矩阵。以序列片段为基础,从蛋白质模块数据库BLOCKS中找出一组替换矩阵,用于解决序列的远距离相关。在构建矩阵过程中,通过设置最小相同残基数百分比将序列片段整合在一起,以避免由于同一个残基对被重复计数而引入的任何潜在的偏差。在每一片段中,计算出每个残基位置的平均贡献,使得整个片段可以有效地被看作为单一序列。通过设置不同的百分比,产生了不同矩阵。 PAM(Point Accepted Mutation):突变数据矩阵PAM即可接受点突变——指1个PAM表示100个残基中发生一个残基突变概率的进化距离。在序列比对中,能够反映一个氨基酸发生改变的概率与两个氨基酸随机出现的概率的比值的矩阵。 Contig:叠连群——是指一组相互两两头尾拼接的可装配成长片段的DNA序列克隆群,也指彼此间可通过重叠序列而连接成连续的、扩展的、不间断的DNA序列的交叠片段产物。通过比对不同的序列,我们能够发现片段的顺序,并且contigs能被添加、删除、重排列来形成新的序列。 Phylogenetic tree:系统发生树又称为演化树(evolutionary tree)——是表明被认为具有共同祖先的各物种间演化关系的树,是一种亲缘分支分类方法。在树中,每个节点代表其各分支的最近共同祖先,而节点间的线段长度对应演化距离(如估计的演化时间)。它用来表示系统发生研究的结果,用它描述物种之间的进化关系。 In Silico Cloning:电子克隆——是近年来发展起来的一门基于表达序列标签(ESTs)的快速克隆基因的新技术,其利用种子序列从EST及UniGene数据库中搜索相似性序列,进行拼装、检索、分析等,以此获得目标基因的全长cDNA,在此基础上也能够实现基因作图定位。 二、问题思考 1、生物信息学这门学科是如何发展起来的? 答:生物学数据爆炸式增长 生物大分子数据库相继建立 生物技术与计算机技术并行飞速发展

基于Hadoop平台的并行数据挖掘算法工具箱与数据挖掘云

基于Hadoop平台的并行数据挖掘算法工具 箱与数据挖掘云 来源:南京大学计算机科学与技术系作者:高阳,杨育彬,商琳时间:2011-06-27 浏览次数:60 一基于云计算的海量数据挖掘 2008年7 月,《Communications of the ACM》杂志发表了关于云计算的专辑,云计算因其清晰的商业模式而受到广泛关注,并得到工业和学术界的普遍认可。目前工业界推出的云计算平台有Amazon公司的EC2和S3,Google公司的Google Apps Engine, IBM公司的Blue Cloud,Microsoft公司的Windows Azure, Salesforce公司的Sales Force, VMware公司的vCloud,Apache软件开源组织的Hadoop等。在国内,IBM与无锡市共建了云计算中心,中石化集团成功应用IBM的云计算方案建立起一个企业云计算平台。阿里巴巴集团于2009年初在南京建立电子商务云计算中心。 严格的讲,云计算是一种新颖的商业计算模型,它可以将计算任务分布在大量互连的计算机上,使各种应用系统能够根据需要获取计算资源、存储资源和其他服务资源。Google公司的云平台是最具代表性的云计算技术之一,包括四个方面的主要技术:Google文件系统GFS、并行计算模型MapReduce、结构化数据表BigTable和分布式的锁管理Chubby。基于以上技术,云计算可以为海量数据处理和分析提供一种高效的计算平台。简单来说,将海量数据分解为相同大小、分布存储,然后采用MapReduce模型进行并行化编程,这种技术使Google公司在搜索引擎应用中得到了极大的成功。 然而MapReduce计算模型适合结构一致的海量数据,且要求计算简单。对于大量的数据密集型应用(如数据挖掘任务),往往涉及到数据降维、程序迭代、

数据挖掘考试题库

1.何谓数据挖掘它有哪些方面的功能 从大量的、不完全的、有噪声的、模糊的、随机的数据中,提取隐含在其中的、人们事先不知道的、但又是潜在有用的信息和知识的过程称为数据挖掘。相关的名称有知识发现、数据分析、数据融合、决策支持等。 数据挖掘的功能包括:概念描述、关联分析、分类与预测、聚类分析、趋势分析、孤立点分析以及偏差分析等。 2.何谓粒度它对数据仓库有什么影响按粒度组织数据的方式有哪些 粒度是指数据仓库的数据单位中保存数据细化或综合程度的级别。粒度影响存放在数据仓库中的数据量的大小,同时影响数据仓库所能回答查询问题的细节程度。按粒度组织数据的方式主要有: ①简单堆积结构 ②轮转综合结构 ③简单直接结构 ④连续结构 3.简述数据仓库设计的三级模型及其基本内容。 概念模型设计是在较高的抽象层次上的设计,其主要内容包括:界定系统边界和确定主要的主题域。 逻辑模型设计的主要内容包括:分析主题域、确定粒度层次划分、确定数据分割策略、定义关系模式、定义记录系统。 物理数据模型设计的主要内容包括:确定数据存储结构、确定数据存放位置、确定存储分配以及确定索引策略等。在物理数据模型设计时主要考虑的因素有: I/O存取时间、空间利用率和维护代价等。 提高性能的主要措施有划分粒度、数据分割、合并表、建立数据序列、引入冗余、生成导出数据、建立广义索引等。4.在数据挖掘之前为什么要对原始数据进行预处理 原始业务数据来自多个数据库或数据仓库,它们的结构和规则可能是不同的,这将导致原始数据非常的杂乱、不可用,即使在同一个数据库中,也可能存在重复的和不完整的数据信息,为了使这些数据能够符合数据挖掘的要求,提高效率和得到清晰的结果,必须进行数据的预处理。 为数据挖掘算法提供完整、干净、准确、有针对性的数据,减少算法的计算量,提高挖掘效率和准确程度。 5.简述数据预处理方法和内容。 ①数据清洗:包括填充空缺值,识别孤立点,去掉噪声和无关数据。 ②数据集成:将多个数据源中的数据结合起来存放在一个一致的数据存储中。需要注意不同数据源的数据匹配问题、数值冲 突问题和冗余问题等。 ③数据变换:将原始数据转换成为适合数据挖掘的形式。包括对数据的汇总、聚集、概化、规范化,还可能需要进行属性的 重构。 ④数据归约:缩小数据的取值范围,使其更适合于数据挖掘算法的需要,并且能够得到和原始数据相同的分析结果。 6.简述数据清理的基本内容。 ①尽可能赋予属性名和属性值明确的含义; ②统一多数据源的属性值编码; ③去除无用的惟一属性或键值(如自动增长的id); ④去除重复属性(在某些分析中,年龄和出生日期可能就是重复的属性,但在某些时候它们可能又是同时需要的) ⑤去除可忽略字段(大部分为空值的属性一般是没有什么价值的,如果不去除可能造成错误的数据挖掘结果) ⑥合理选择关联字段(对于多个关联性较强的属性,重复无益,只需选择其中的部分用于数据挖掘即可,如价格、数据、金额) ⑦去掉数据中的噪音、填充空值、丢失值和处理不一致数据。 7.简述处理空缺值的方法。 ①忽略该记录; ②去掉属性; ③手工填写空缺值; ④使用默认值; ⑤使用属性平均值; ⑥使用同类样本平均值; ⑦预测最可能的值。 8.常见的分箱方法有哪些数据平滑处理的方法有哪些 分箱的方法主要有: ①统一权重法(又称等深分箱法) ②统一区间法(又称等宽分箱法) ③最小熵法 ④自定义区间法 数据平滑的方法主要有:平均值法、边界值法和中值法。

数据挖掘试卷一

数据挖掘整理(熊熊整理-----献给梦中的天涯) 单选题 1.下面哪种分类方法是属于神经网络学习算法?() A. 判定树归纳 B. 贝叶斯分类 C. 后向传播分类 D. 基于案例的推理 2.置信度(confidence)是衡量兴趣度度量( A )的指标。 A、简洁性 B、确定性 C.、实用性 D、新颖性 3.用户有一种感兴趣的模式并且希望在数据集中找到相似的模式,属于数据挖掘哪一类任务?(A) A. 根据内容检索 B. 建模描述 C. 预测建模 D. 寻找模式和规则 4.数据归约的目的是() A、填补数据种的空缺值 B、集成多个数据源的数据 C、得到数据集的压缩表示 D、规范化数据 5.下面哪种数据预处理技术可以用来平滑数据,消除数据噪声? A.数据清理 B.数据集成 C.数据变换 D.数据归约 6.假设12个销售价格记录组已经排序如下:5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215 使用如下每种方法将它们划分成四个箱。等频(等深)划分时,15在第几个箱子内? (B) A 第一个 B 第二个 C 第三个 D 第四个 7.下面的数据操作中,()操作不是多维数据模型上的OLAP操作。 A、上卷(roll-up) B、选择(select) C、切片(slice) D、转轴(pivot) 8.关于OLAP和OLTP的区别描述,不正确的是: (C) A. OLAP主要是关于如何理解聚集的大量不同的数据.它与OTAP应用程序不同. B. 与OLAP应用程序不同,OLTP应用程序包含大量相对简单的事务. C. OLAP的特点在于事务量大,但事务内容比较简单且重复率高. D. OLAP是以数据仓库为基础的,但其最终数据来源与OLTP一样均来自底层的数据库系统,两者面对的用户是相同的 9.下列哪个描述是正确的?() A、分类和聚类都是有指导的学习 B、分类和聚类都是无指导的学习

每章习题

每章习题 第一章 1、数据挖掘是从大量数据中( )或( )知识。 2、数据挖掘的知识发现由哪些步骤组成? 3、数据挖掘系统具有哪些主要成分? 4、数据库管理系统的概念? 5、什么是事务数据库? 6、什么是面向对象数据库? 第二章 1、数据库的主要特征; 2、什么是数据方? 3、数据仓库模型有哪些? 4、什么是概念分层? 5、从结构的角度看,有三种数据仓库模型:()数据集市、() 6、OLAP服务器实现包括哪些? 第三章 1、数据变换将数据转换成适合于挖掘的形式,数据变换可能涉及哪些内容? 2、数据归约的策略? 3、多元回归是()的扩充,响应变量是多维特征向量的线性函数。 4、数据归约技术有哪些表示? 5、什么是数据集成? 6、数据域处理的概念? 第四章 1、定义数据挖掘任务或查询需要哪些原语? 2、模式分层的概念;`x 3、兴趣度度量评估模式的性质:简洁性、()、()和新颖性; 4、为什么有一个数据挖掘查询语言很重要? 5、数据挖掘GUI可能包含哪些成分? 6、基于不同的结构设计应选哪些耦合模式可以将DM系统与DB/DW系统集成? 第五章 1、概念描述的概念; 2、大型数据库的概念描述和数据仓库的联机分析处理有何不同? 3、什么是数据泛化? 4、对于面向属性归纳,现在数据已经准备好,如何进行面向属性归纳? 5、对基本方法稍加扩充,概念描述挖掘可以增量地、()、或分布地进行。 6、概念描述的属性相关分析执行步骤; 第六章 1、如何由大型数据库挖掘关联规则?

2、购物篮分析只是关联规则挖掘的一种形式,还有其他哪些方法? 3、怎样能够提高Apriori的有效性? 4、对于购物篮分析中经常用() 5、对于具有递减支持度的多层关联规则挖掘,有许多可用的搜索策略如: 6、挖掘多维关联规则的技术可以根据量化属性的处理分为哪三类。 第七章 1、分类和预测方法可以根据哪些标准进行比较和评估? 2、分类和预测包括哪些广泛的应用()、()、()、()。 3、什么叫做判定树? 4、广义线性模型的常见形式包括()、()。 5、如何设计神经网络拓扑? 6、贝叶斯定理公式 第八章 1、聚类的典型应用是什么? 2、数据挖掘对聚类的典型要求? 3、聚类分析中的两个数据类型是什么? 4、满足对距离函数的四点数学要求。 5、描述凝聚的和分裂的两种层次聚类方法。 6、基于模型的聚类方法有哪两个()、()。 第九章 1、名词解释空间分类 2、什么是多媒体数据库? 3、基于小波的特征标识的相似检索方法是怎样的? 4、多媒体数据中可以挖掘的三种关联()、()、()。 5、四种主要的变化成分用于特化时序数据是哪些? 6、许多有关序列模式挖掘的研究主要针对()模式。 第十章 1、电信数据分析中可视化工具的使用有()可视化()可视化()可视化。 2、零售业中的几个数据挖掘的例子是什么? 3、商用数据挖掘系统的例子有哪些请简单介绍。 4、数据可视化和数据挖掘的相同和不同之处有哪些? 5、查询应答机制可以根据它们反应方式的不同分为哪两类()、()。 6、数据挖掘整个生命周期应包含哪几个阶段请简单介绍。 习题答案

相关主题
文本预览