top-k闭序列图模式挖掘(IJIEEB-V8-N4-1)
- 格式:pdf
- 大小:449.84 KB
- 文档页数:9
数据挖掘中的关联规则与序列模式挖掘技术随着互联网和大数据技术的发展,数据挖掘技术在各个领域得到了广泛的应用。
其中,关联规则与序列模式挖掘技术是数据挖掘中的两个重要内容。
本文将介绍关联规则与序列模式挖掘技术的基本概念、应用场景以及挖掘方法,以帮助读者更好地理解数据挖掘中的这两种技术。
一、关联规则挖掘技术1.1基本概念关联规则挖掘是一种发现数据集中变量之间相互关联的方法,其目标是找出一组频繁出现在一起的物品或属性。
在关联规则挖掘中,我们可以使用频繁项集和支持度、置信度等指标来描述变量之间的关联规则。
1.2应用场景关联规则挖掘技术在市场营销、交叉销售、协同过滤等领域有着广泛的应用。
例如,在电商平台中,可以利用关联规则挖掘技术来分析用户购买行为,从而推荐相关商品或提供个性化的服务。
在医疗领域,可以利用关联规则挖掘技术来发现疾病之间的关联规律,从而辅助医生提出诊断和治疗方案。
1.3挖掘方法常见的关联规则挖掘方法包括Apriori算法、FP-growth算法等。
Apriori算法是一种基于候选集生成的方法,其基本思想是先找出频繁1项集,然后利用频繁1项集生成频繁2项集,再利用频繁2项集生成频繁3项集,依次类推。
FP-growth算法是一种基于条件模式基与频繁模式树的方法,其基本思想是利用频繁模式树来存储数据集,并通过条件模式基来高效地挖掘频繁项集。
二、序列模式挖掘技术2.1基本概念序列模式挖掘是一种发现数据序列中频繁出现的模式的方法,其目标是找出一组经常出现在一起的事件序列。
在序列模式挖掘中,我们可以使用频繁序列、支持度、长度等指标来描述事件序列之间的模式。
2.2应用场景序列模式挖掘技术在时间序列分析、生产流程优化、网络行为分析等领域有着广泛的应用。
例如,在生产流程中,可以利用序列模式挖掘技术来发现生产线上的优化模式,从而提高生产效率和节约成本。
在网络行为分析中,可以利用序列模式挖掘技术来发现用户在互联网上的行为模式,从而改善用户体验和提供个性化服务。
序列模式数据挖掘算法研究摘要:序列模式的发现是数据挖掘领域一个活跃的研究分支,即在序列数据库中找出所有的频繁子序列。
本文先介绍序列模式挖掘中的一些基本概念,然后详细描述FreeSpan和PrefixSpan2个基于投影、分治的模式增长的重要算法。
关键词:序列模式;算法一、基本术语和定义设I={i1,i2,……,in}是一个项目集合,项目集或者项集(items)就是各种项目组成的集合,即I 的所有子集。
一个序列就是若干项集的有序列表,一个序列S可表示为〈s1,s2,……,sn〉,其中sj为项集,也称作S的元素。
元素由不同的项组成,可表示为(x1,x2,……,xn)。
当元素只包含一项时,一般省去括号,如(x2)一般表示为x2。
元素之间是有顺序的,但元素内的项是无序的,一般定义为词典序。
序列包含项的个数称为序列的长度,长度为L的序列记为L- 序列. 序列数据库就是元组(tuples)〈sid, s 〉的集合,其中s是序列,sid 是该序列的序列号,元组的个数称为序列数据库的大小,记作|SDB|。
1、 FreeSpan算法思想FreeSpan ,即频繁模式投影的序列模式挖掘,其基本思想为:利用频繁项递归地将序列数据库投影到更小的投影数据库集中,在每个投影数据库中生成子序列片段.这一过程对数据和待检验的频繁模式集进行了分割,并且将每一次检验限制在与其相符合的更小的投影数据库中.2、FreeSpan 算法分析:它将频繁序列和频繁模式的挖掘统一起来,把挖掘工作限制在投影数据库中,还能限制序列分片的增长.它能有效地发现完整的序列模式,同时大大减少产生候选序列所需的开销,比基于Apriori 的GSP算法快很多.不足之处,它可能会产生许多投影数据库,如果一个模式在数据库中的每个序列中出现,该模式的投影数据库将不会缩减;另外,一个长度为k 的序列可能在任何位置增长,那么长度为k + 1的候选序列必须对每个可能的组合情况进行考察,这样所需的开销是比较大的. 对FreeSpan 中频繁项矩阵F占用存储空间的定量分析如下:设序列数据库中有m个频繁项,频繁项矩阵共需要|M|=m +32×(m-1)×(m-2)个计数单元。
并行序列模式挖掘研究概况1序列模式挖掘问题R.Agrawal等人在1995年首先提出了序列模式挖掘的概念[1],其问题描述如下。
在由多个交易组成的交易数据库中,某个交易描述了某个顾客在某时间购买物品的集合。
物品的集合叫做项集。
相同顾客在不同交易中包含的项集的子集,按时间先后关系排列称为一个序列。
每个子集称为一个元素(element)。
并且给定由用户确定的最小支持度阈值(min_support threshold),序列模式挖掘就是去发现所有子序列的出现频率不小于给定的最小支持度的频繁子序列。
2序列模式挖掘研究概况在序列模式挖掘的问题被提出以后,人们开始在时间序列数据库中挖掘序列模式和其他频繁模式的算法方面不断地进行研究和改进。
现有的序列模式挖掘算法主要可以分为两类:第一类是基于Apriori特性的算法[2]。
它是由R.Agrawal和R.srikant在1994年提出的。
其基本思想是任一个频繁模式的子模式必定是频繁的。
基于这一特性,人们提出了一系列类APriori的序列模式挖掘算法,这些算法中有采用水平数据格式(horizontal data format)的算法,如AprioriAll算法[1]、GSP算法[3]、PSP算法[4]等;有采取垂直数据格式(vertica data format)的算法,如SPADE算法[5]、SPAM算法[6]等。
第二类是J.Han等人提出的基于模式增长(pattern-growth)策略的算法,如Freespan算法[7]、Prefixspan算法[8],[9]等。
另外,C.Antunes 等人提出了SPARSE算法[10]。
在很多序列模式挖掘任务中,用户不需要找出数据库中所有可能存在的序列模式,而是加入一定的约束,找出用户感兴趣的序列模式[11],[12],Agrawal 等人将序列模式挖掘问题加以泛化,引入了时间约束、滑动时间窗口(sliding time window)和分类约束,并提出了GSP算法[3]。
不产生候选项集的TOP-K高效用模式挖掘算法
王乐;冯林;王水
【期刊名称】《计算机研究与发展》
【年(卷),期】2015(052)002
【摘要】目前TOP-K高效用模式挖掘算法需要产生候选项集,特别是当数据集比较大或者数据集中包含较多长事务项集时,算法的时间和空间效率会受到更大的影响.针对此问题,通过将事务项集和项集效用信息有效地保存到树结构HUP-Tree,给出一个不需要候选项集的挖掘算法TOPKHUP; HUP-Tree树能保证从中计算到每个模式的效用值,不需要再扫描数据集来计算模式的效用值,从而使挖掘算法的时空效率得到较大的提高.采用7个典型数据集对算法的性能进行测试,实验结果证明TOPKHUP的时间和空间效率都优于已有算法,并对K值的变化保持平稳.
【总页数】11页(P445-455)
【作者】王乐;冯林;王水
【作者单位】宁波大红鹰学院信息工程学院浙江宁波315175;大连理工大学电子信息与电气工程学部计算机科学与技术学院辽宁大连116024;大连理工大学电子信息与电气工程学部计算机科学与技术学院辽宁大连116024;宁波大红鹰学院信息工程学院浙江宁波315175
【正文语种】中文
【中图分类】TP311.13
【相关文献】
1.一种不产生候选项集的关联规则挖掘算法 [J], 刘晓玲;李玉忱
2.一种不产生候选项集的关联规则挖掘算法 [J], 李重周;杨君锐
3.一种不产生候选项集的关联规则挖掘算法 [J], 李重周; 杨君锐
4.基于DBP的TOp-k高效用项集挖掘算法 [J], 蒋华;路昕宇;王慧娇;宋佳璐
5.含负项top-k高效用项集挖掘算法 [J], 孙蕊;韩萌;张春砚;申明尧;杜诗语
因版权原因,仅展示原文概要,查看原文内容请购买。
闭合序列模式挖掘算法
沙金;邓成玉;张翠肖;刘伟峰
【期刊名称】《计算机工程与设计》
【年(卷),期】2006(27)3
【摘要】提出了一种新的挖掘闭合序列模式的PosD算法,该算法利用位置数据保存数据项的顺序信息,并基于位置数据列表保存数据项的顺序关系提出了两种修剪方法:逆向超模式和相同位置数据.为了确保栅格存储的正确性和简洁性,另外还针对一些特殊情况做处理.试验结果表明,在中大型数据库和小支持度的情况下该算法比CloSpan算法更有效.
【总页数】5页(P514-518)
【作者】沙金;邓成玉;张翠肖;刘伟峰
【作者单位】石家庄铁道学院,计算机系,河北,石家庄,050043;燕山大学,信息工程学院,河北,秦皇岛,066004;石家庄铁道学院,计算机系,河北,石家庄,050043;石家庄铁道学院,计算机系,河北,石家庄,050043
【正文语种】中文
【中图分类】TP311.13
【相关文献】
1.闭合序列模式的一种增量挖掘算法 [J], 林颖
2.闭合序列模式的一种增量挖掘算法 [J], 林颖
3.一种无候选项的闭合序列模式挖掘算法 [J], 杨斐;张万桢;陆垂伟
4.基于二级索引结构无候选项闭合序列模式挖掘算法 [J], 缪裕青;吴孔玲;朱晓雁;张锦杏
5.基于项位置索引的闭合连续序列模式挖掘算法 [J], 矫春兰;刘建宾
因版权原因,仅展示原文概要,查看原文内容请购买。
关联规则挖掘与序列模式挖掘关联规则挖掘(Association Rule Mining)和序列模式挖掘(Sequence Pattern Mining)都是数据挖掘中的重要技术。
它们可以从大规模的数据集中发现隐藏的关联关系和序列模式,帮助人们对数据进行深入分析和决策支持。
一、关联规则挖掘关联规则挖掘是一种数据挖掘技术,用于发现事物之间潜在的相关性、依赖性和关联性。
它通常用于市场篮子分析、交叉销售和推荐系统等领域。
关联规则通过挖掘出频繁项集(Frequent Itemset)来实现。
频繁项集是在数据集中频繁出现的项目组合。
一旦频繁项集被发现,关联规则就可以通过计算置信度(Confidence)和支持度(Support)来评估项目之间的关联性。
举个例子,假设我们有一个超市的销售数据集,其中包含了顾客购买的商品清单。
通过关联规则挖掘,我们可以找到一些频繁项集,比如“牛奶”和“面包”,意味着这两个商品经常被一起购买。
然后,我们可以计算置信度来评估关联规则,比如“牛奶->面包”的置信度是70%,表示在购买牛奶的情况下,有70%的概率会购买面包。
关联规则挖掘的一些常用算法包括Apriori算法和FP-Growth算法。
Apriori算法是一种基于候选生成和剪枝的方法,通过逐层搜索来发现频繁项集。
FP-Growth算法利用FP树(Frequent Pattern Tree)来存储和挖掘频繁项集,具有较高的效率。
二、序列模式挖掘序列模式挖掘是一种针对有序数据的挖掘技术,用于发现数据中的序列模式。
它通常用于日志分析、网络访问分析和生物信息学等领域。
序列模式可以定义为有序项目的序列,这些项目在数据中以特定顺序出现。
序列模式挖掘的目标是发现频繁序列模式(Frequent Sequence Pattern),即在数据中频繁出现的序列模式。
和关联规则挖掘类似,序列模式挖掘也需要计算支持度和置信度来评估模式的重要性。
简述序列模式挖掘的一般步骤
序列模式挖掘是数据挖掘领域中的一个重要技术,它用于从序列数据集中发现频繁出现的模式。
以下是序列模式挖掘的一般步骤:
1.数据预处理
在进行序列模式挖掘之前,需要对原始数据进行预处理。
这包括数据清洗、去噪、缺失值处理等操作。
确保数据的质量和完整性。
2.序列表示
将预处理后的数据转换为适合挖掘的序列表示形式。
常见的序列表示方法包括序列编码、序列索引和序列矩阵表示等。
3.模式提取
使用合适的算法或方法从序列数据集中提取频繁出现的模式。
常用的序列模式挖掘算法包括Apriori、FP-growth、PrefixSpan等。
4.模式评估
对挖掘得到的序列模式进行评估和分析。
常见的评估指标包括支持度、置信度、序列长度等。
通过评估可以筛选出具有实际意义的模式。
5.模式解释
根据领域知识和分析结果对挖掘得到的模式进行解释和理解。
将模式转化为可理解的业务规则,为决策提供支持。
以上是序列模式挖掘的一般步骤。
通过对数据的预处理、序列表示、模式提取、模式评估和模式解释等环节的处理,可以从序列数据中挖掘出有用的模式,为实际应用提供支持和指导。
I.J. Information Engineering and Electronic Business, 2016, 4, 1-9 Published Online July 2016 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijieeb.2016.04.01
Copyright © 2016 MECS I.J. Information Engineering and Electronic Business, 2016, 4, 1-9 Top-k Closed Sequential Graph Pattern Mining
K. Vijay Bhaskar GITAM University/CSE, Visakhapatnam, 530045, India E-mail: vbreddy.vijay@gmail.com
Dr. R.B.V Subramanyam NIT Warangal/CSE, Warangal, 506004, India E-mail: rbvs66@nitw.ac.in
Dr. K. Thammi Reddy GITAM University/CSE, Visakhapatnam, 530045, India E-mail: thammireddy@gitam.edu
S. Sumalatha NIT Warangal/CSE, Warangal, 506004, India E-mail:katam.suma@gmail.com
Abstract—Graphs have become increasingly important in modeling structures with broad applications like Chemical informatics, Bioinformatics, Web page retrieval and World Wide Web. Frequent graph pattern mining plays an important role in many data mining tasks to find interesting patterns from graph databases. Among different graph patterns, frequent substructures are the very basic patterns that can be discovered in a collection of graphs. We extended the problem of mining frequent subgraph patterns to the problem of mining sequential patterns in a graph database. In this paper, we introduce the concept of Sequential Graph-Pattern Mining and proposed two novel algorithms SFG(Sequential Frequent Graph Pattern Mining) and TCSFG(Top-k Closed Sequential Frequent Graph Pattern Mining). SFG generates all the frequent sequences from the graph database, whereas TCSFG generates top-k frequent closed sequences. We have applied these algorithms on synthetic graph database and generated top-k frequent graph sequences. Index Terms—Data mining, graph mining, frequent sequential patterns, closed sequential patterns. I. INTRODUCTION Frequent graph patterns are substructures that appear in a graph data set to a frequency not less than a user-specified threshold [9]. Structural forms such as subtrees, subgraphs, sublattices are referred as substructures. A frequent structural pattern is a substructure that appears frequently in a graph database. Sequential pattern mining, the mining of frequently occurring sub-sequences as patterns, was introduced in [13] and has become an important problem in data mining. Sequential pattern mining discovers frequent subsequences as patterns. According to the GSP algorithm [14], sequential pattern mining is to find all sequences whose support is greater than the user-specified minimum support, such sequence is called a frequent sequence. In SPADE [8], the set of all frequent sequences is discovered by reducing database scans and it minimizes I/O costs. The GSP and SPADE follow candidate generation and test approach. The PrefixSpan algorithm [5] is a pattern-growth approach to sequential pattern mining and it follows divide-and-conquer strategy. BIDE [6] is an algorithm used for mining frequent closed sequences without candidate generation. These algorithms [5,6] grow patterns by constructing projected databases to reduce the search space for a pattern. In this paper, we investigated the problem of mining sequential patterns in graph databases. Our approach finds sequential patterns occurring in graph databases.
Fig.1. Three graphs where each vertex represents a web page In ―Fig. 1‖ each node represents a web page. Let us consider that the edges of the given graphs represent the order of visiting the web pages. For example, Graph1 represents the sequence q,s,t. Similarly Graph2 represents the sequence q,r,t and Graph3 represents the sequence q,p,t. Observing the three given graphs, it is clear that page t is accessed after page q(not immediately after q). The resulting pattern is which is not a subgraph. Given the above three graphs as input, any existing subgraph mining algorithm will not generate as a pattern. Many graph mining algorithms exist to find the frequent subgraphs, maximal frequent subgraphs, closed frequent subgraphs, and constraint-based closed frequent subgraphs. The sequence is not considered as an 2 Top-k Closed Sequential Graph Pattern Mining Copyright © 2016 MECS I.J. Information Engineering and Electronic Business, 2016, 4, 1-9 output pattern by the existing graph mining algorithms. Though is not a subgraph, it represents a pattern showing a sequence q followed by t. In this paper, we proposed an algorithm to find sequential patterns from graph databases known as Sequential Graph-Pattern Mining. We proposed SFG(Sequential Frequent Graph pattern mining) algorithm to mine frequent sequences in a graph database. As the number of edges in a graph increases, the number of sequential patterns also increases exponentially. This requires further analysis of frequent sequential graph patterns. To overcome this difficulty, we proposed TCSFG(Top-k Closed Sequential Frequent Graph pattern mining) algorithm to generate top-k closed sequential graph patterns. The rest of the paper is organized as follows. Section II describes the related work in subgraph mining. Section III present problem definition and section IV describe SFG, TCSFG algorithms. Our experimental results and performance analysis are presented in section V and our work is concluded in section VI. II. RELATED WORK Many algorithms exist in the literature for mining frequent graph patterns [1,7,9,16,18]. AGM [1] algorithm efficiently mines frequently appearing induced subgraphs from a graph data database. Experimental results in [1] show that AGM finds all frequent induced subgraphs containing 300 chemical compounds in 40 minutes to 8 days, as the minimum support threshold varied from 20% to 10%. FSG[9] algorithm uses a sparse graph representation which minimizes both storage and computation. The performance of FSG was worse where the number of vertex and edge labels was small. AGM [1] and FSG [9] are Apriori-based algorithms for mining frequent substructures from graph data. Apriori-based algorithms generate subgraph candidates from frequent subgraphs and prunes false positives. These algorithms [1,9] suffer from the problem of subgraph isomorphism test and candidate generation. Candidate generation and pruning false positives are costly. The gSpan [18] was the first algorithm to discover all the frequent subgraphs without candidate generation and it prunes false positives. gSpan discovers frequent substructures without candidate generation. A new lexicographic ordering among graphs was built in [18] and it maps each with a unique minimum DFS code as its canonical label. gSpan finds both frequent subgraphs and frequent induced subgraphs. It [18] uses rightmost extension technique to minimize the generation of the same subgraphs and explores the depth-first search in frequent subgraph mining. Frequent subgraph mining is a challenging issue due to the generation of an exponential number of frequent subgraphs. Closed graph pattern mining was introduced in [16] to mine only closed frequent graph patterns. A graph g is closed if there do not exist any proper super graph of g with the same support as g. CloseGraph [16] mines closed frequent graph patterns using the concepts of equivalent occurrence and early termination to prune the pattern search space. It [16] reduces unnecessary subgraphs and also increases the efficiency of the mining process in the presence of large graph patterns. TSP [12], TFP [4] and CloSpan [17] are the recent work for closed sequential pattern mining. CloSpan is the first algorithm to solve closed sequential pattern mining problem. It [17] mines frequent closed sequential patterns and produces significantly less number of discovered sequences. The CloSpan mines long frequent sequences with low minimum support and without any information loss. Mining top-k frequent closed patterns without minimum support was introduced in [4]. Top-down and bottom-up combined Fp-tree mining strategy was developed in [4] to raise the support and to discover closed frequent patterns. TSP algorithm mines top-k frequent closed sequential patterns of length no less than min_len, the minimum length of each pattern. It [12] doesn’t require the minimum support threshold, it makes use of the length constraint and outperforms the closed sequential pattern mining algorithm that make use of minimum support threshold. Several algorithms were proposed previously ranging from mining graph patterns with constraints [2,11,19] and without constraints [6,18] for mining closed graph patterns [16]. Frequent graph-pattern mining may result in a large number of patterns. To make the mining more effective, constraints are often used to confine the pattern search. In gPrune algorithm[2] the frequent graph patterns are generated based on the constraints such as density and diameter. The Density of a graph is defined as the ratio of the edge set to vertex set. The diameter of a graph is the maximum length of the shortest path over all pairs of vertices. The problem of incorporating structural constraints in mining frequent graph patterns was solved in gPrune. Finding closed frequent graphs with connectivity constraints in relational graphs was introduced in [19]. Two approaches, namely CLOSECUT and SPLAT were proposed to speed up the mining process. The former was pattern growth approach and it has better performance on patterns with high support and low connectivity, later approach was pattern reduction approach and it achieves better performance for the high connectivity constraints. These [19] algorithms mines closed frequent graphs with connectivity constraints in relational graphs. DS-search algorithm [11] mines frequent subgraphs of limited diameter and symmetry. It [11] employs the tree decomposition structure of database graphs during the mining process and generates more structurally interesting patterns in the database. Recently two sequential techniques[3,10] were developed to solve the problem of graph-coloring. Graph-coloring concept [10 ] was used in processor allocation to represent the busy processor and the busy processors are mapped in the graph using different colors. The authors in [3] investigated the problem of Star coloring and they used DNA sequence to construct a solution space for the star coloring problem Two algorithms RP-FP and RP-GD were proposed in