SEQUEST mining frequent subsequences using DMA Strips

格式：pdf
大小：325.55 KB
文档页数：11

下载文档原格式

/ 11

孤立森林剔除异常值

孤立森林剔除异常值摘要：1.孤立森林的概念2.孤立森林的作用3.孤立森林剔除异常值的方法4.孤立森林剔除异常值的应用实例5.结论正文：1.孤立森林的概念孤立森林（Isolation Forest）是一种基于树的异常检测算法。

它通过构建一个树结构，将数据集中的各个数据点作为树的叶子节点，然后将这些节点分为不同的类别。

在这个过程中，孤立森林能够自动识别出数据集中的异常值。

2.孤立森林的作用孤立森林的主要作用是检测数据集中的异常值。

异常值是指那些与大多数数据点不同的数据点，它们可能是由于数据收集过程中的误差、数据污染或者数据集中固有的特性等原因造成的。

孤立森林能够有效地识别出这些异常值，从而为数据分析和处理提供更为准确的结果。

3.孤立森林剔除异常值的方法孤立森林剔除异常值的方法主要包括以下两个步骤：（1）构建树结构：首先，孤立森林算法会根据数据集中的各个数据点构建一个树结构。

这个树结构通常是一个决策树，它将数据点分为不同的叶子节点。

（2）计算异常值：在构建好树结构之后，孤立森林算法会根据叶子节点的密度来计算异常值。

具体来说，它将叶子节点的密度作为异常度的度量，密度较低的叶子节点对应的数据点被认为是异常值。

4.孤立森林剔除异常值的应用实例孤立森林剔除异常值的方法在很多领域都有广泛的应用，例如金融、医疗、物联网等。

以金融领域为例，银行在进行信用风险评估时，可能会遇到一些异常值，如欺诈行为等。

通过使用孤立森林算法，银行可以有效地识别出这些异常值，从而降低信用风险。

5.结论孤立森林是一种有效的异常检测算法，它通过构建树结构来识别数据集中的异常值。

sequential-feature-selector

sequential-feature-selectorsequential-feature-selector是一个Python库，它提供了一种基于特征重要性的特征选择方法。

该方法通过逐步选择最重要的特征来构建一个决策树，以避免过度拟合和欠拟合的问题。

使用sequential-feature-selector库，您可以按照以下步骤进行特征选择：导入库和数据集：pythonfrom sklearn.datasets import load_irisfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.feature_selection import SequentialFeatureSelectorX, y = load_iris(return_X_y=True)创建随机森林分类器：pythonclf = RandomForestClassifier()创建SequentialFeatureSelector对象，并指定要选择的特征数量：pythonsfs = SequentialFeatureSelector(clf, n_features_to_select=2)使用fit_transform方法来训练模型并选择特征：pythonsfs.fit_transform(X, y)查看选择的特征：pythonprint(sfs.get_support())查看特征重要性排序：pythonprint(sfs.ranking_)使用sequential-feature-selector库，您可以轻松地进行特征选择，并提高模型的泛化能力和预测性能。

随机森林递归特征消除步骤

随机森林递归特征消除步骤
随机森林递归特征消除（Recursive Feature Elimination with Random Forest, RFE-RF）是一种特征选择的方法，通过随机
森林模型来评估特征的重要性并进行特征消除。

以下是RFE-RF的步骤：
1. 导入必要的库和数据集：首先，导入所需的Python库（如scikit-learn）以及要应用RFE-RF的数据集。

2. 初始化随机森林模型：创建一个随机森林模型对象，并设置模型的参数（如树的数量、叶节点最少样本数等）。

3. 执行RFE-RF：使用RFE-RF算法来选择最重要的特征。

RFE-RF是一个递归的过程，它首先使用随机森林模型对所有
特征进行训练，并计算特征的重要性。

然后，它移除相对不重要的特征，并重复该过程，直到达到设定的特征数量。

4. 评估特征的重要性：计算每个特征的重要性得分，以决定是否保留该特征。

通常，重要性得分越高的特征越重要。

5. 消除特征：根据特征的重要性得分，可选择保留最常见的特征，或者通过设置阈值来保留重要性得分高于某个阈值的特征。

可以根据业务需求进行相应的调整。

6. 重新训练模型：使用经过特征选择后的数据集重新训练随机森林模型，并进行进一步的评估和验证。

7. 查看结果：根据实际需求和模型评估结果，观察选择后的特征对模型性能的影响。

总结来说，RFE-RF算法通过递归地使用随机森林模型来评估特征的重要性，并根据重要性得分来选择和消除特征。

这种方法可以减少数据集中的噪声特征，提高模型的预测性能，并且可根据具体需求进行灵活调整。

javascript中的previousSibling和nextSibling的正确用法

javascript中的previousSibling和nextSibling的正确⽤法我做的时间的验证，格式是不需要验证的，只需要验证起始⽇期与结束⽇期的⼤⼩，但是因为输⼊页⾯是批量的，⽽且每⼀⾏⼜是⾃动⽣成的，这样就不能⽤id来作为参数，只能⽤节点。

这就给验证增加了难度。

以下是jsp页⾯的部分：<td><input id="warrantyStartDateStr" name="warrantyStartDateStr"class="toolbar_button_input_80" type="Text" onClick="WdatePicker()"></td><td><input id="warrantyEndDateStr" name="warrantyEndDateStr" class="toolbar_button_input_80" type="Text" onClick="WdatePicker()" onBlur="checkDateOne(this)"></td>⽽我的js函数最终是这样写的：js函数的⽬的就是不通过id，却能够获得两个input的value,即起始时间和结束时间，然后⽐较两个时间的⼤⼩。

孤立森林剔除异常值

孤立森林剔除异常值孤立森林（Isolation Forest）是一种用于异常检测的无监督学习算法。

它基于以下观察：异常样本在特征空间中通常比正常样本更少，并且异常样本更容易被孤立。

因此，孤立森林利用一个随机生成的树结构来识别和孤立异常样本。

在本文中，我们将介绍孤立森林算法的工作原理，并说明如何使用它来剔除异常值。

孤立森林算法原理孤立森林算法是一种基于树的算法，它通过将样本划分为树的分支来识别异常值。

算法的基本步骤如下：1.随机选择一个特征和一个切割点，将左子树分支归为“正常”样本，将右子树分支归为“异常”样本。

2.重复步骤1，直到每个叶节点都包含一个样本，或者达到预定的树的深度。

3.通过计算样本在树中的路径长度来评估异常值的程度。

路径长度越短，样本越可能是异常值。

4.构建多个随机树，并对每个样本计算路径长度。

最终，将路径长度的平均值作为样本的异常程度。

使用孤立森林剔除异常值孤立森林不仅可以检测异常值，还可以用于剔除这些异常值。

以下是使用孤立森林剔除异常值的步骤：1.将数据集划分为特征向量。

2.使用孤立森林算法对数据集进行训练，构建一组随机树。

3.对每个样本计算路径长度，并根据路径长度对样本进行排序。

4.根据阈值选择需要剔除的异常样本。

可以根据路径长度的分布来确定阈值，例如选择路径长度在上四分位数以上的样本作为异常值。

5.剔除异常样本，得到一个不包含异常值的数据集。

孤立森林的优缺点孤立森林算法具有以下优点：•有效性：孤立森林算法对高维数据和大规模数据集具有较好的可伸缩性，算法的计算复杂度与样本数量近似线性相关。

•鲁棒性：孤立森林算法对于不同数据分布和异常值类型的检测都能表现良好，不需事先假设异常值的分布。

•无监督性：孤立森林算法无需依赖任何标签信息，可以对无标签的数据进行异常检测。

然而，孤立森林算法也存在一些缺点：•对于高维数据，孤立森林可能会出现精度下降的情况，因为维度的增加会使得正常和异常样本之间的分隔空间更稀缺。

孤立森林剔除异常值

孤立森林剔除异常值引言在数据分析和机器学习的过程中，异常值是一个常见的问题。

异常值可能是由于测量误差、数据录入错误或者真实的极端情况引起的。

在对数据进行分析和建模之前，我们通常需要剔除这些异常值，以确保模型的准确性和稳定性。

孤立森林是一种用于异常检测的无监督学习算法，它可以有效地识别和剔除异常值。

孤立森林算法简介孤立森林算法是由Liu等人在2008年提出的一种异常检测算法。

它基于以下两个假设： 1. 异常值在特征空间中的分布比正常值更稀疏。

2. 异常值可以被较少的分割次数所孤立。

基于这两个假设，孤立森林算法通过构建一棵随机划分的二叉树来识别异常值。

算法的主要步骤如下： 1. 随机选择一个特征，并在该特征的最小值和最大值之间随机选择一个划分点。

2. 根据划分点将样本集分成两个子集，左子树包含比划分点小的样本，右子树包含比划分点大的样本。

3. 递归地对左右子树进行划分，直到满足停止条件（如达到最大深度或子树中的样本数小于预设的阈值）。

4. 重复步骤1~3，构建多棵二叉树。

5. 根据构建的多棵二叉树，计算每个样本在树中的平均路径长度（即从根节点到样本所经过的边数）。

6. 根据样本的平均路径长度，定义一个异常得分，用于衡量样本的异常程度。

孤立森林的优点和适用场景孤立森林算法具有以下几个优点： - 无需依赖训练样本的标签，属于无监督学习算法。

- 算法的时间复杂度较低，可以处理大规模数据集。

- 对于高维数据和非线性关系具有较好的适应性。

孤立森林算法适用于以下场景： 1. 异常检测：孤立森林可以用于检测数据集中的异常值，帮助分析人员发现数据中的潜在问题。

2. 数据预处理：在进行数据分析和建模之前，可以使用孤立森林剔除异常值，以提高模型的准确性和稳定性。

3. 离群点检测：孤立森林可以用于发现数据集中的离群点，帮助分析人员发现数据中的异常模式。

孤立森林算法的步骤详解数据预处理在使用孤立森林算法之前，我们通常需要对数据进行预处理。

孤立森林剔除异常值

孤立森林剔除异常值摘要：1.孤立森林的概念及其作用2.异常值的定义和影响3.孤立森林剔除异常值的方法4.孤立森林剔除异常值的优势和应用场景正文：1.孤立森林的概念及其作用孤立森林（Isolation Forest）是一种基于树的异常检测算法，它是由Du et al.于2008 年提出的。

该算法的主要作用是识别数据集中的异常值，即那些与大多数数据不同的数据点。

通过检测这些异常值，可以帮助我们更好地理解数据，发现潜在的问题，以及提高数据分析的准确性。

2.异常值的定义和影响异常值是指在数据集中与大多数数据不同的数据点，它们的出现可能是由于数据收集过程中的错误、数据本身的特性或者数据中的噪声等因素。

异常值的存在可能会对数据分析和建模产生负面影响，例如影响模型的预测精度、降低数据挖掘的效果等。

因此，在数据分析和建模之前，剔除异常值是非常必要的。

3.孤立森林剔除异常值的方法孤立森林算法基于决策树的原理，通过构建多个树来识别异常值。

具体步骤如下：（1）对数据集进行预处理，如缩放、编码等。

（2）生成多个随机子集，每个子集包含的数据点数量约为数据集的1/3。

（3）对每个子集进行决策树构建，生成一棵孤立树。

（4）根据每个数据点在所有孤立树中的叶子节点的平均深度，计算该数据点的异常得分。

（5）根据异常得分的阈值，判断数据点是否为异常值，并将其剔除。

4.孤立森林剔除异常值的优势和应用场景孤立森林算法具有以下优势：（1）高效性：孤立森林算法的计算复杂度较低，可以快速地识别出异常值。

（2）可靠性：孤立森林算法对数据集的噪声具有较强的鲁棒性，且不容易受到异常值的影响。

（3）可扩展性：孤立森林算法可以很容易地应用于大规模数据集。

isolationforest参数解释

isolationforest参数解释孤立森林（Isolation Forest）是一种用于检测异常值的机器学习算法。

它基于孤立样本的思想，通过构建随机树的方式来发现数据中的异常点。

在孤立森林中，有一些关键的参数需要理解和调整，以便更好地适应不同的数据集和问题。

1. 样本子集大小（样本数量）•参数名称：n_samples。

•解释：表示构建每个随机树时所使用的样本数量。

较小的值通常能够更好地定位异常点，但在大多数情况下，对于常见的问题，可以选择默认值。

如果样本量较小，可以考虑减小此参数。

2. 随机树数量•参数名称：n_estimators。

•解释：表示构建孤立森林时所使用的随机树的数量。

增加树的数量通常会提高算法的性能，但也会增加计算成本。

通常选择一个适度的值，例如100。

3. 子树的最大深度•参数名称：max_depth。

•解释：限制每个随机树的最大深度。

较小的深度可以提高算法的训练速度，但可能会减弱其在异常检测上的性能。

选择适当的深度通常需要通过交叉验证来确定。

4. 特征子集大小•参数名称：max_features。

•解释：表示每个随机树在拆分节点时考虑的特征数量。

通常选择"auto"（即sqrt(n_features)）或"log2"（即log2(n_features)）。

这有助于提高算法的多样性。

5. 采样策略•参数名称：bootstrap。

•解释：表示在构建随机树时是否使用采样。

如果设置为True，则使用自助法（bootstrap sampling）。

在大多数情况下，使用自助法可以提高模型的多样性。

6. 异常值分数•参数名称：contamination。

•解释：表示在数据中假设的异常样本的比例。

通常选择一个较小的值，如0.01，以便更敏感地检测异常值。

7. 随机种子•参数名称：random_state。

•解释：设置随机种子，以便实验的可重复性。

如果需要每次运行得到相同的结果，可以设置此参数。

sequential-feature-selector

sequential-feature-selector sequentialfeatureselector（顺序特征选择器）是一种常用的特征选择算法，用于从给定的特征集合中选择最佳的特征子集。

在本文中，我们将一步一步地回答有关sequentialfeatureselector的问题，介绍其原理、使用方法以及相关注意事项。

首先，让我们从sequentialfeatureselector的原理开始。

一、特征选择的重要性在机器学习和数据分析中，特征选择是一个关键步骤。

它可以帮助我们减少特征空间的维度，去除冗余和不相关的特征，从而提高模型的性能和鲁棒性。

特征选择还可以提高模型可解释性和降低模型复杂性。

二、顺序特征选择器的原理sequentialfeatureselector（以下简称SFS）是一种基于子集搜索的特征选择方法。

它的基本思想是从初始特征集合中逐步选择最佳特征，直到满足某种停止准则为止。

SFS算法包括两个主要步骤：前向搜索和后向搜索。

在前向搜索中，算法从空特征集合开始，逐步添加一个特征，每次选择能够最大化模型性能指标（如准确率、F1分数等）的特征。

在后向搜索中，算法从初始特征集合开始，逐步删除一个特征，每次选择能够最大化模型性能指标的特征。

SFS算法通过反复执行前向和后向搜索，最终找到最佳特征子集。

三、如何使用sequentialfeatureselector使用sequentialfeatureselector算法可以通过以下步骤进行：1. 准备数据集：首先，收集和清洗数据，将其转换为适用于机器学习模型的格式。

确保数据集包含特征向量和目标变量。

2. 设置SFS参数：需要设置一些SFS的参数，如选择的模型类型（分类或回归）、性能指标、最大特征子集大小等。

这些参数将影响特征选择的结果。

3. 拟合模型：使用指定的模型类型对数据进行训练。

可以使用交叉验证等技术来评估模型性能。

4. 运行SFS算法：使用第二步中设置的参数，运行SFS算法来选择最佳的特征子集。

贝叶斯节点使用说明

贝叶斯节点使用说明作者：张青松目录1. 贝叶斯节点 (2)1.1. 朴素贝叶斯分类基本原理 (2)1.2. 贝叶斯节点 (2)2. 贝叶斯设置 (3)2.1. 建立贝叶斯节点的工作流 (3)2.1.1. 设置 (3)3. 贝叶斯分类结果 (4)1.贝叶斯节点贝叶斯节点使用了经典的朴实贝叶斯（NaiveBayes）算法对数据进行分类，其核心思想是贝叶斯公式：对于给出的待分类项，求解在此项出现的条件下各类别出现的概率，取概率最大的类别作为对该项的分类结果。

1.1.朴素贝叶斯分类基本原理朴素贝叶斯正式定义如下：1.设x={a1,a2,…,a m}为一个待分类项，而每个a为x的一个特征属性.2.有类别集合C={y1,y2,…y n,}。

3.计算P(y1|x)，P(y2|x)，…，P(y n|x)。

4.如果P(y k|x)=max⁡{P(y1|x)，P(y2|x)，…，P(y n|x)}，则x∈y k。

针对训练数据中，某个类别下的某个特征划分没有出现时，会令分类器的质量大大降低。

为了解决这个问题，引入了Laplace校准。

其思想就是对每类别下所有划分的计数加1，或者概率加上一个很小的值，这样如果训练样本集数据充分大时，并不会对结果产生影响，并且解决了概率为0的尴尬局面。

1.2.贝叶斯节点在DataStudio中，通过设置面板在输入的训练数据表中，选择某个标称量字段作为分类目标字段以及另一部分列作为参与分类的特征属性，训练朴素贝叶斯模型，然后使用该模型对位置的数据进行分类。

2.贝叶斯设置2.1.建立贝叶斯节点的工作流图1 贝叶斯节点工作流首先为贝叶斯节点连接输入数据。

输入数据中必须包含类型为标称量的字段。

以数据集为例。

2.1.1.设置图2 贝叶斯节点数据选择设置选择数据集中的标称量字段作为分类的目标列，并且在下方表格中选中要作为特征属性参与分类的列。

切换到模型页签，如图。

图3 贝叶斯算法参数设置注意：模型页签中的默认概率表示上文中提到的Laplace校准参数，最大分类个数不能小于分类目标列标称量的个数。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

SEQUEST: mining frequent subsequences using DMA StripsHenry Tan1, Tharam S. Dillon1, Fedja Hadzic1, Elizabeth Chang21Faculty of IT, University of Technology Sydney, Sydney, Australia2School of IT, Curtin University, Perth, AustraliaAbstractSequential patterns exist in data such as DNA string databases, occurrences of recurrent illness, etc. In this study, we are proposing an algorithm, SEQUEST, to mine frequent subsequences from sequential patterns. The challenges of mining very large database of sequences are it is computationally expensive and requires large memory space. SEQUEST uses a Direct Memory Access Strips (DMA-Strips) structure to efficiently generate candidate subsequences. DMA-Strips structure provides direct access to each item to be manipulated and thus is optimized for speed and space performance. In addition, the proposed technique uses a hybrid principle of frequency counting by the vertical join approach and candidate generation by structure guided method. The structure guided method is adapted from the TMG approach used for enumerating subtrees in our previous work [8]. Experiments utilizing very large databases of sequences which compare our technique with the existing technique, PLWAP[4], demonstrate the effectiveness of our proposed technique.Keywords: mining frequent subsequences, sequence, phylogenic tree, sequential strips1IntroductionData mining strives to find frequent patterns in data [1,2,3,4,12]. A task in data mining is characterized by the type of data, structures and patterns to be found [8,9]. Advancements in data collection technologies have contributed to the high volumes of sales data. Sales data typically consists of a transaction timestamp and the list of items bought by customers. If one is interested in inter-transactional patterns, sales data is a good example of a sequential pattern. Other examples ofsequential patterns include DNA sequences in DNA databases and events that are ordered in time in real time databases.There are two different sub problems in mining frequent subsequences, i.e. problems of addressing (1) sequences of items and (2) sequences of itemsets. Addressing sequences of itemsets, i.e. sequences whose elements consist of multi items, is a more difficult problem and poses more challenges. In this paper, we will limit our scope of study to the first problem by comparing our technique, SEQUEST, with the PLWAP algorithm [4]. Nonetheless, SEQUEST using DMA-Strips is designed to tackle both sub problems of mining frequent subsequences. Our contributions are as follows. We develop a DMA-Strips that provides a direct memory access to sequences in the database. One way to achieve efficient processing is to develop an intermediate memory construct that helps us to easily process and manipulate something. We showed this in [9] by developing an efficient intermediate memory construct called Embedding List (EL) [9,10] to efficiently mining frequent embedded subtrees. The problem of using such intermediate memory constructs is that it demands some additional memory storage. For mining a very large database of sequences this is more a trade off than an advantage. DMA-Strips does not store hyperlinks of items, instead it segmented the database of sequences systematically so that it allows a direct access processing and manipulation. This contributes to a better speed and space performance than using an intermediate memory construct like EL that reserves extra storage in memory to store a corpus of hyperlinks of items. The idea of using hyperlinks is described in [9,10]. Nevertheless, DMA-Strips resembles EL in that it allows efficient enumeration using a model or structure guided approach. Previously we developed the Tree Model Guided (TMG) [9,10] candidate generation technique that utilizes the tree model to generate candidate subtrees. A unique property of the structure or model guided approach is that it ensures all enumerated patterns are valid patterns. These patterns are valid in the sense that they exist in the database, and hence by ensuring only valid patterns, no extra work is required to prune invalid patterns. The strategy to tackle mining sequence of items and itemsets using DMA-Strips is outlined in detail in chapter 2. Additionally for efficient frequency counting we use a hybrid method. For counting frequent 1-subsequences we transpose the database from horizontal format to vertical format [12]. We combine an efficient horizontal counting using a hashtable [3,8,9] for counting 2-subsequences and the space efficient vertical join counting approach [3,12] for counting k-subsequences where k≥3. This hybrid approach overcomes a non-linear performance for counting 2-subsequences due to the O(n2) complexity of the join enumeration approach [3,8]. The rest of the paper is organized as follows. Section 2 gives the problem decomposition. Section 3 discusses related works. We examine the proposed technique and present the algorithms in section 4. In section 5, we empirically evaluate the performance and scale-up properties of the proposed technique. We summarize the research in section 6.2Problem StatementsA sequence S consists of ordered elements {e1,…, e n}. Sometimes elements of a sequence are referred to as events [12]. Each element in a sequence can be eitheran atomic or a complex type. For a sequence with atomic elements there is one entity or item per element. If the element is of a complex type, each element can have multiple items (itemsets). The itemsets can be ordered or unordered. In this paper we will show how our approach is designed to mine both sequences of ordered itemsets and sequences of ordered items. We refer to sequences of ordered itemsets or items as simply as sequences of itemsets of items.A sequence δ is denoted as e1→e2→…→e n and |e i| refers to the size of itemsets. We refer to the first item of the sequence as its root. In case of sequences of itemsets, the root is the first item of the first itemset in the sequence. Zaki [12] described e1→e2→…→e n to refer to the same notation. A sequence can be identified uniquely through its encoding which is a string of element labels separated by an arrow →. In a case of a sequence of items we assume that each even e i as an atomic item. Encoding of a sequence α is denoted by φ(α). An element with j items is denoted by e:{λ1,…, λj} where λi is a label of item i. A label of an element with j items, denoted by λ(e), is a concatenation of λ1+…+ λj. Say a sequence α:{e1,e2} where e1:{A,B} and e2:{C,D}, the label of e1 , λ(e1), is‘AB’ and the label of element e2,λ(e2), is ‘CD’ and the encoding of sequence α,φ(α), is ‘AB→CD’. This labelling function translates equivalently for an element that consists of an atomic item.We normally refer to a sequence by its encoding without quotes. Consider a sequence with n elements. We refer to a sequence withk items as k-sequence, i.e. k equals to the sums of |e i| for i=1,…,n. For example, AB→CD is a 4-sequence and AB→DF→G is a 5-sequence. A subsequence is a subset of a sequence such that the order of elements in the subsequence is the sameas the order of the elements in the sequence. Accordingly, a subsequence with k elements is called a k-subsequence. We say that a subsequence S is a subset of sequence T if the elements of subsequence S are a subset of elements of sequenceT and the ordering of items is preserved (S ≤ T). A→C is an example of subsequence of AB→CD whereas AC is not. A is a subset of AB and C is a subsetof CD whereas AC is not a subset of either AB or CD. We denote that element α isa subset of element β as α⊆β. Further the ordering of elements in a sequence follows a certain ordering function. The most common ordering function is a time function. If an element α occurs at t1 and element β occurs at t2 where t1 < t2, the position of α and β in a sequence is i and j respectively and i < j. A gap constraint [7] is very similar to the notion of level of embedding [9] between two nodes in a tree with an ancestor-descendant relationship. The gap between two elements in a sequence is determined by the gap distance (∆) between two elements at position i and j, i.e. ∆ is a delta between i and j. A subsequence τ is said to be contiguous if there is no gap between every two consecutive elements. In other words, for a contiguous subsequence every two consecutive elements have a gap distance 1. Subsequences that contain any two consecutive elements with ∆ > 1 are called non-contiguous subsequence. Suppose we have a sequence δ:A→BCD→F→GH→IJ→JK. A→B→F and A→F→K are both subsequences of δ. A→B→F is a contiguous subsequence of δ since the gap distance of every twoconsecutive elements is 1, whereas A→F→K is a non-contiguous subsequence of δ. Because the gap distance between A and F in A→F→K is 2 and between F and K is 3. The database D for sequence mining consists of input-sequences. Each input-sequence can contain one to many itemsets. Each input-sequence is identified through a unique sequence-id (sid) whereas each element (itemsets) is identified through a unique event-id (eid). If D contains sales data, the sid can be a customer-id. A common event identifier in transactional databases is a timestamp. We say that a sequence α has a support count γ if there are γ numbers of input-sequences that support it. An input-sequence supports sequence α if it contains at least 1 occurrence of α. Mining frequent subsequences from a database of input-sequences can be formulated as discovering all frequent subsequences whose support count γ is greater or equal to the user specified minimum support threshold σ.3Related WorksEarly in their work, Agrawal [1] introduced the problem of mining sequential patterns over transactions data. In [1] they proposed 2 algorithms for mining sequential data, AprioriAll and AprioriSome. While the problem of market basket analysis is concerned with finding frequent intra-transaction patterns, mining sequential patterns on the other hand is concerned with finding frequent inter-transaction patterns [5]. Further, the data used in mining sequential patterns has an ordered notion. A sequential pattern consists of ordered events or elements. We are in particular interested in finding non-contiguous subsequences [7]. Mining non-contiguous subsequences can be very expensive when data has long patterns. A concept of a constraint can be introduced. In [7] they proposed an improved algorithm GSP that generalizes the scope of sequential pattern mining to include time constraints, sliding time windows, and user-defined taxonomy. Mannila and Toivonen [6] proposed an approach that addresses the problem of sequential mining where each element of the sequences consists of atomic item. Similarly [11] formulated the problem of mining frequent subsequences from genetic sequences where each element of the sequences consists of an atomic item. Recently, Ezeife & Lu [4] developed an efficient technique PLWAP based on the concept of WAPTree that mines frequent subsequences of items. PLWAP is reported to have desirable scalability properties. Additionally it was reported that it outperforms WAPTree and GSP. Others [7,12] formulated the problem of mining frequent subsequences from database of transaction sequences such that each element can comprise of multiple items. Both [7,12] assume unordered itemsets. Zaki [12] developed an efficient algorithm, SPADE, for mining frequent subsequences. SPADE uses a vertical id-list database format and a lattice-theoretic approach to decompose the original search space into smaller pieces.4SEQUEST AlgorithmDatabase scanning.SEQUEST scans the database S db twice and generate a frequent database that contains only frequent 1-subsequences, S db’ after the first scan (figure 3 [1]). S db’ is obtained by intersecting S db with L1 (figure 3 [2]), the setof frequent 1-subsequences. The first scan and the second scan are executed in two separate runs. S db’ is used to construct DMA-Strips.Constructing DMA-Strips. A DMA-Strip is constructed as follows (figure 3 [3]). For each item in S db’, an ordered list (strip) is generated which stores a sequence of items’ label, scope and eid. Each strip is uniquely identified by a sequence-id sid. To allow mining sequences of itemsets each item in a strip stores event-id eid and a notion of scope. The eid groups items in a strip based on their timestamps. The scope is used to determine the relationship between two consecutive items in a strip. The scope of an item A is determined by the position of the last item that belongs to the same event. If the preceding item’s position is out of the prior item’s scope this tells us for sequence of itemsets that the two items (say A and B) occur at different time, i.e. A→B. Otherwise A and B occurs at the same time the event generated, i.e. AB. Through the use of the scope DMA-Strips can be used for both mining sequences of itemsets or items.sid [0]:: A→CEF→HJLFigure 1: Pseudo-code of SEQUESTFigure 1 shows an example of a strip of a sequence A→CEF→HJL whose sid is 0. The above sequence consists of 3 different events. AC, EF, and HJL belong to event 0, 1 and 2 respectively. The scope of A is equal to the position of the last item whose eid is the same as A, which is C. Hence, A’s scope is determined to be 1. Similarly, using the same rule, H’s scope is determined to be 6. Each strip can be fetched in O(1) by specifying its offset or its sid.Enumeration of subsequences. The enumeration of subsequences is done by iterating a strip and extending one item at the time to the expanding k-subsequence. We call this a model-guided or structured-guided enumeration as the enumeration is guided by the actual model or structure of the data. Figure 2 illustrates how the enumeration of AC, A→E, and A→H is done through the strip. Given that we have 1-subsequence I at position r and the strip S with size |S|, 2-subsequences are generated by extending I with each item J at position t on its right to the end of the strip such that r < t ≤ |S| (figure 3 [4]). This can be generalized to enumerating (k+1)-subsequences from a k-subsequence (figure 3 [5]). For a k-subsequence it is sufficient to store only a pair of sid and the last item’s position (tail) as its occurrence coordinate. Hence, given an occurrence coordinate of a (k-1)-subsequence its occurrence coordinate will be in a form of (sid,tail). Hence the enumeration of (k+1)-subsequences M using a strip S from a k-subsequence s whose encoding is φ(s), and occurrence coordinate is (p,q), where p is its sid and q is the tail can be formalized as follows. Suppose that C is the set of occurrence coordinates of (k+1)-subsequences, C={(p,t)|q< t ≤|S|}. The encoding of the (k+1)-subsequences is computed by performing a scope test. We define two functions, Ψ(p,q) and λ(p,q) that determine the scope and label of anitem in a strip with sid p and at position q . If t > Ψ(p ,q ) then the new encoding ς =φ(s ) + ‘→’ + λ(p ,t ) otherwise ς = φ + λ(p ,t ) (figure 3 [6]&[7]). For example, fromfigure 2, a strip with sid 0, extending A 0 (A at position 0) with A 2 (A at position 2)generates A →A since 2 > Ψ(0,0) for the given strip. We know that Ψ(0,0) fromfigure 1 is 1. Although a constraint represents an important concept, we don’tparticularly focus on it in this paper. However the current scheme can be extendedeasily to incorporate the gap constraint described in the previous section forexample by restricting the extension using the event-id delta.Frequency Counting. We utilize a hybrid strategy enumerating subsequences byextension and counting by the vertical join approach to make use of the desiredproperty from each technique. Enumeration by extension makes sure that thecandidates generated are all valid. What we mean by a valid candidate is acandidate whose support is greater than zero. The vertical counting approach isused to help us achieve a space efficient and less expensive full pruning strategy.Figure 2: Enumeration examplesWe define two operators inter-join and intra-join (figure 3 [10]&[11]) to tackleboth sub problems of mining sequence of itemsets and items. The inter-joinoperator is used whenever the tail items of two joinable (k-1)-subsequences arefrom different events, whereas intra-join operator is used if they are from the sameevent. Intra-join will join any two occurrence coordinates whose sid and eid areequal. Inter-join will join any two occurrence coordinates whose sid are equal butthe prior eid is less than the preceeding eid . If the gap constraint is to beintroduced, it will be to define the inter-join operator by constraining the gapbetween the two eid. Space efficiency is achieved through the vertical joinapproach by freeing the memory sooner than the horizontal counting approachdoes. The details of how the less expensive full pruning is achieved through thevertical join approach is discussed below.Pruning. According to Apriori theory a pattern is frequent if and only if all itssubsets are frequent. To make sure that all generated subsequenc`s do not containinfrequent subsequences, full (k-1) pruning must be performed. Full (k-1) pruningor full pruning implies that at most (k-1) numbers of (k-1)-subsequences need tobe generated from the currently expanding k-subsequences. This process isexpensive. Using the join approach principle, any two (k-1)-subsequence that areobtained from a k-subsequence by removing one node at a time can be joined toform the k-subsequence back. This implies that we could accelerate the fullpruning by only doing root and tail pruning (figure 3 [8]&[9]). Root pruning isdone by removing the root node, i.e. the first node of the sequence and checking ifthe pattern has been generated previously. Similarly, tail pruning is done byremoving the tail node. Since we use a hashtable to perform frequency countingthis can be done by simply checking if the pattern exists in the the k-1 hashtable. Ifit does not exist we can safely prune the generated k-subsequences. However, it isnot completely safe to assume that if both root-pruned and tail-pruned (k-1)-A →HACsubsequences exist means the generated k-subsequences must be frequent. At least,the root and tail pruning will do the partial check and the final check is done byjoining the occurrence coordinates of the root-pruned and tail-pruned (k-1)-subsequences which is done as part of the support counting. This scheme can onlybe done using vertical support counting since vertical support counting counts thesupport of a subsequence vertically, i.e. the result is known immediately.Horizontal counting, on the other hand, increments the support of a subsequenceby one at a time and the final support of a subsequence is not known until thedatabase is traversed completely.Sdb:SequenceDatabaseLk:Frequentk-subsequencesL1= Scan-S db[1] S db’= Intersect(S db∩L1) [2]DMA-STRIPS = Construct-DMA-STRIPS(S db’)[3]L2 = Generate-2-Subsequences(DMA-STRIPS) [4]k=3while(L k > 0)L k= Generate-k-Subsequences(L k-1)[5]k++Generate-k-Subsequences(L k-1):foreach (k-1)-sequence s in L k-1{foreach occurrence-coordinate oc(sid,r) in s{for t=r+1 to |DMA-STRIPS[sid]|{if( t > scope(sid,r) )join = inter-joinφ(s’) = φ(s) + → + label(sid,t) [6]elsejoin = intra-joinφ(s’) = φ(s) + label(sid,t) [7]root-pruned = remove-root(φ(s’))[8]tail-pruned = φ(s’)[9] if(join == intra-join)Intra-Join(root-pruned, tail-pruned) [10]elseInter-Join(root-pruned, tail-pruned) [11]if( InFrequent(φ(s’)) )Prune(φ(s’))elseInsert(φ(s’),L k)Figure 3: Pseudo-code of SEQUEST5Experimental Results & DiscussionsThis section explores the effectiveness of the proposed SEQUEST algorithm, bycomparing it with the PLWAP [4] algorithm that processes sequences of items.The formatting of the datasets used by PLWAP can be found in [4]. SEQUEST, onthe other hand, uses the datasets format that is used by the IBM data generator[1,7]. The minimum support σ is denoted as (Sxx), where xx is shown as theabsolute support count. Experiments were run on Dell Rack Optimized SlimServers (dual 2.8Ghz CPU, 4Gb RAM) running Fedora 4.0 ES Linux. The source code for each algorithm was compiled using GNU g++ version 3.4.3 with –g and –O3 parameters. The datasets chosen for this experiment were previously used in [4], in particular data.ncust_25 (figure 4), data.100K, data.600K, data.ncust_125 (figure 5 & 6(a)), and data.ncust_1000 (figure 6(b)). To obtain the 500K dataset for experiments in figure 5 we cut the last 100K sequences from the data.600K dataset.Figure 4(a) shows the comparison of the algorithms with respect to time, for varying support thresholds. In this experiment we try two schemes for SEQUEST, namely SH (refers to SEQUEST-hybrid) and S (refers to SEQUEST). SH uses a vertical join counting approach only for generating frequent k-sequences where k>3, whereas S uses a horizontal counting approach using a hashtable for all k. The results in figure 4(a) show that SH performs better than S and PLWAP. The performance of SH remains nearly constant for a very small support (σ:2 or 0.001%). The difference between SH and PLWAP performance at σ:2 is in the order of ~800 times. It should be noted as well that the memory usage of SH isFigure 4(b) shows comparison of memory usage between PLWAP, SH and S. We can see that when the horizontal counting approach is used S suffers from memory blow-up and this contributes to S aborting at σ≤135. SH memory profiles on the other hand, show a superior overall performance over the other two, especially when the minimum support is lowered.We perform a scalability test by varying the datasets of different sizes while σ is fixed at 0.005%. In figure 5(a), SH outperformed PLWAP with respect to time in the order of ~250 times faster. In this experiment we try two different methods for generating frequent 2-subsequences. SH uses the structure-guided enumeration approach as described in the previous section, and S-L2J uses a join approach. Figure 6(a) shows that SH has a more linear performance than S-L2J. We observe that if the enumeration for 2-subsequences uses the join approach the time performance is up by roughly n2 if the datasets size is scaled up by n times. On the other hand, it is interesting to see that whenever the join approach is used the memory usage is slashed by almost half.Figure 6(b) shows that SH memory usage is 2x greater than S-L2J. Overall, we witness that all SEQUEST variants, SH and S-L2J have outperformed PLWAP in terms of speed and space performance. Figure 6(a) shows the distribution of candidate subsequences and frequent subsequences generated over different dataset sizes with σ fixed at 0.005%. Figure 6(b) shows the time performance comparison between PLWAP and SH for a dataset of 7.8 million sequences. For this particular dataset, we see that PLWAP outperforms SH by a small margin. However at σ:2500 (0.03%), PLWAP aborted due to a memory blow up problem. At σ:10000 and σ:5000 the number of extracted frequent subsequences is relativelyPLWAP performs well whenever the average length of the extracted subsequences is relatively short. In other words, PLWAP performs well for large sparse databases but when applied to dense databases with long patterns, the performance is significantly reduced. In figure 4 & 5 experiments we have shown that PLWAP performance degrades whenever the support is set at a lower value and the number of frequent patterns extracted is very high. On the other hand, we observe that SH performs well for extracting frequent 1 and 2 subsequences. However the performance started to degrade when generating 3-subsequences. For this very large dataset we see that the bottleneck of SH has shifted from k=2 to k=3. We observe that the number of candidates generated for k=3 is extremely large, i.e.10,137,400. From previous experiment in figure 5(a) we saw a similar property, where counting using the vertical join approach when the number of candidates are very large at k=2, S-L2J, although space efficient, suffered a non-linear performance. From this we can infer that optimal performance can be obtained whenever it is known when to switch from the horizontal to vertical join counting approach. The horizontal counting approach suffers from memory blow-up but is faster whenever the numbers of candidates generated are very large. The vertical counting join approach is space efficient however the performance degrades when the number of candidates to be generated is very large.6Concluding RemarksOverall remarks about the algorithms are as follows. PLWAP is a space efficient technique but it suffers when the support is lowered and the numbers of frequent subsequences extracted are high. SEQUEST, on the other hand, when the horizontal counting approach is used suffers from a memory blow-up problem due to the BFS approach for generating candidate subsequences. It uses two hash-tables and additional space needs to be occupied when the length of the sequences to be enumerated increases. We also show that if SEQUEST uses a vertical join counting approach it performs extremely well for both speed and space performance. If the vertical join counting approach is used, the space can be freed much sooner than if the horizontal counting approach is used. The hybrid method of structure-guided enumeration using DMA-Strips and the vertical join counting approach enables SEQUEST to tackle a very large database and shows a linear scalability performance. However it should be noted that whenever the frequency of extracted subsequences are high the vertical join approach performance could degrade due to the nature of the join approach complexity. Additionally, through using the notion of scope in DMA-Strips, SEQUEST can process sequences of itemsets as well as sequences of items.References[1]Agrawal, R., Srikant, R.: Mining Sequential Patterns. In Proc. IEEE 11thICDE (1995) Vol. 6, No. 10, 3–14[2]Bayardo, R. J., Efficiently Mining Long Patterns from Databases.SIGMOD'98, Seattle, WA, USA (1998)[3]Chi, Y., Nijssen, S., Muntz, R.R., Kok. J.N.: Frequent Subtree Mining AnOverview. Fundamenta Informaticae, Special Issue on Graph & Tree Mining (2005)[4]Ezeife, C.I., Lu. Y.: Mining Web Log sequential Patterns with Position CodedPre-Order Linked WAP-tree. The International Journal of Data Mining and Knowledge Discovery (DMKD) (2005) Vol. 10, No. -, 5-38[5]Feng, L., Dillon, T.S., and Liu, J.: Inter-transactional association rules formulti-dimensional contexts for prediction and their application to studying meteorological data. Data Knowl. Eng. (2001) Vol. 37, No. 1, 85-115.[6]Manilla, H., Toivonen, H.: Discovering generalized episodes using minimaloccurrences. In 2nd Intl. Conf. Knowledge Discovery and Data Mining (1996) [7]Srikant, R., Agrawal, R.: Mining Sequential Patterns: Generalizations andPerformance Improvements. Technical Reports, IBM Research Division, Almaden Research Centre, San Jose, CA (1996)[8]Tan, H., Dillon, T.S., Feng, L., Chang, E., Hadzic, F.: X3-Miner: MiningPatterns from XML Database. In Proc. of Data Mining '05. Skiathos, Greece (2005)[9]Tan, H., Dillon, T.S., Hadzic, F., Feng, L., Chang, E.: IMB3-Miner: MiningInduced/Embedded Subtrees by Constraining the Level of Embedding. In Proc. of PAKDD’06 (2006) Singapore[10]Tan, H., Dillon, T.S., Hadzic, F., Feng, L., Chang, E.: Tree Model GuidedCandidate Generation for Mining Frequent Patterns from XML Documents.ACS TOIS Journal (2005) (Submitted)[11]Wang, J.T-L, Chirn, G.-W, Marr, T.G., Shapiro, B., Shasha, D., Zhang, K.Combinatorial pattern discovery for scientific data: Some preliminary results.In Proc. of the ACM SIGMOD Conference on Management of Data, Mineapolis, May 1994.[12]Zaki, M.J.: SPADE: An Efficient Algorithm for Mining Frequent Sequences.Machine Learning (2000) Vol. 0, 1-31。

SEQUEST mining frequent subsequences using DMA Strips

合集下载

孤立森林剔除异常值

sequential-feature-selector

随机森林递归特征消除步骤

javascript中的previousSibling和nextSibling的正确用法

孤立森林剔除异常值

孤立森林剔除异常值

孤立森林剔除异常值

isolationforest参数解释

sequential-feature-selector

贝叶斯节点使用说明

文档推荐

最新文档