Data-mining-clustering数据挖掘—聚类分析大学毕业论文外文文献翻译及原文
- 格式:doc
- 大小:150.00 KB
- 文档页数:15
开题报告立题依据范文关于《开题报告立题依据范文》,是我们特意为大家整理的,希望对大家有所帮助。
开题报告立题依据范文篇一:立题依据论文随着科技的发展, 计算机、网络、数据库等技术广泛应用于日常管理中, 各行各业积累了大量的信息数据, 对数据库的存取与查询操作, 已远远不能满足要求。
人们需要从海量数据中获得这些数据背后的更重要信息, 如数据的整体特征描述, 试图发现事件间的相互关联, 以及发展趋势进行预测。
数据挖掘, 从数据中挖掘知识, 就是从大量的、不完全的、有噪声的、模糊的、随机的数据中, 提取隐藏在其中的、人们事先不知道的、潜在有用的信息和知识的过程。
与数据挖掘相近的术语有: 从数据库发现知识( KDD )、数据分析、知识抽取、模式分析、信息收割、数据融合以及决策支持等。
数据挖掘不仅能对过去的数据进行查询, 并且能够对将来的趋势和行为进行预测, 并自动探测以前未发现的模式。
高校的教师教学科研管理涉及教师教学、科研活动、教师教学质量等多方面大量的数据。
充分运用数据挖掘技术, 可以及时了解教师教学状况、分析教师教学与科研相互间的关系、把握教学与科研方面的异常现象等, 从而增强教学与教学管理改革的针对性, 提高管理工作的效率和质量。
通过本课题,学生可以进一步了解数据挖掘技术的相关概念,结合数据挖掘过程中数据收集、数据清洗、数据规范、关联规则挖掘、决策树和系统分析设计技术,科学合理的分析高校教师教学科研管理数据和课程任务安排、教学之间的潜在关联关系并进行预测分析。
毕业论文,使学生熟悉科研论文的写作结构,较为深入的了解数据挖掘算法及其在大学生课程学习数据中的应用,进而增强学生独立解决实际问题的能力。
研究目标:本课题拟利用设数据挖掘(Data Mining)及关联规则挖掘、决策树、以及聚类等技术,利用学院已有的大学生四年课程学习数据,通过分析学院的学生学习数据,对大学生四年学习中的课程进行关联分析,对教育数据进行挖掘”,用以挖掘隐含在数据中的、对学院管理部门有用的未知数据;并适时利用已有数据进行关联分析与预测,为未来学院的课程设置调整等提供决策支持。
数据挖掘论文聚类分析论文摘要:结合数据挖掘技术的分析,对基于数据挖掘的道路交通流分布模式问题进行了探讨,最后进行了实验并得出结果。
关键词:数据挖掘;聚类分析;交通流road traffic flow distribution mode research based on data miningchen yuan(hunan vocational and technicalcollege,changsha410004,china)abstract:combinded with the analysis of data mining technology,the distirbution model of traffic flow is discussed,and an experiment is carried out and its related conclusions are made in this paper.keywords:data mining;clustering analysis;traffic flow道路网络上不同空间上的交通流具有相异的空间分布模式,如“线”性模式主要代表有城市主干道,“面”状模式主要出现在繁华地段等。
本文设计了一个道路交通流空间聚类算法以挖掘道路交通流分布模式,在真实数据和模拟数据上的实验表明spanbre算法具有良好的性能。
数据挖掘(datamining),也称数据库的知识发现(knowledgediseoveryindatabase)是指从随机、模糊的受到一定影响的大容量实际应用数据样本中,获取其中隐含的事前未被人们所知具有潜在价值的信息和知识的过程。
数据挖掘非独立概念,它涉及很多学科领域和方法,如有人工智能、数据统计、可视化并行计算等。
数据挖掘的分类有很多,以挖掘任务为区别点,可以划分为模型发现、聚类、关联规则发现、序列分析、偏差分析、数据可视化等类型。
数据挖掘中的聚类方法及其应用AbstractCluster analysis (or clustering) is an unsupervised statistical technique used for grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (or related) to each other than those inother groups (clusters). The technique is widely used in data mining, machine learning, and other domains for various applications, such as market segmentation, customer profiling, image processing, bioinformatics, and so on. In this paper,we discuss the basic concepts, algorithms, and applicationsof cluster analysis in data mining.Keywords: cluster analysis, clustering, unsupervised learning, data mining, machine learning.IntroductionCluster analysis is a fundamental method in data mining for exploring, discovering, and understanding the natural structure of data. The basic goal of cluster analysis is to group similar (or related) objects together and separate dissimilar (or unrelated) objects from one another, based on some distance or similarity measure. The similarity measure may be based on various attributes or variables of the objects, such as their numerical, categorical, or textual values.Cluster analysis is an unsupervised learning technique, which means that it does not require any prior knowledge or labeling of the data. Instead, it relies on the inherent patterns and relationships among the data points to derivemeaningful clusters. To achieve this goal, various clustering algorithms have been developed, each with its strengths and weaknesses, depending on the characteristics of the data and the objectives of the analysis.In this paper, we provide an overview of the main typesof clustering algorithms, their advantages and limitations, and some applications of cluster analysis in various domains.Types of Clustering AlgorithmsThere are several types of clustering algorithms, depending on the assumptions, criteria, and methods used for clustering the data. Some of the main types are:1. Partitioning methods: These methods divide the data into k non-overlapping clusters, where k is a predefined number of clusters. The most famous partitioning algorithm is the k-means algorithm, which iteratively assigns the data points to the nearest centroid (or mean) of the cluster and updates the centroids until convergence.2. Hierarchical methods: These methods create a tree-like structure (also known as dendrogram) that represents the nested clusters of the data at multiple levels of granularity. The two main types of hierarchical clustering are agglomerative (bottom-up) and divisive (top-down) clustering. The former starts with each data point as a cluster andmerges the closest pairs of clusters until all data points belong to a single cluster. The latter starts with the entire dataset as a cluster and recursively splits it into smaller clusters until each cluster contains only one data point.3. Density-based methods: These methods use the local density of the data points to identify the clusters, rather than assuming a fixed number of clusters or a hierarchical structure. The most popular density-based algorithm is theDBSCAN algorithm, which defines a neighborhood around each data point and groups the points that have a density above a threshold.4. Model-based methods: These methods assume that the data points are generated from a probabilistic model or a mixture of models, and use Bayesian inference or maximum likelihood estimation to estimate the parameters of the model(s) and assign the data points to the clusters. The most common model-based algorithms are the Gaussian mixture model (GMM), the Bayesian information criterion (BIC), and the expectation-maximization (EM) algorithm.Advantages and Limitations of Clustering AlgorithmsEach type of clustering algorithm has its advantages and limitations, depending on the nature of the data, the objectives of the analysis, and the computational resources available. Some of the main advantages and limitations are:1. Partitioning methods are easy to implement and efficient for large datasets, but they may converge to local optima and require a prior knowledge of the number of clusters.2. Hierarchical methods provide a complete picture of the cluster structure at various levels of granularity, but they are computationally expensive and may produce clustersof various shapes and sizes.3. Density-based methods can handle noise and outliersin the data, but they may not work well for datasets with different densities or shapes of clusters.4. Model-based methods can capture the underlying generative process of the data and provide probabilistic estimates of the cluster membership, but they may require a strong assumption of the model and may not be suitable forhigh-dimensional and non-linear data.Applications of Clustering AlgorithmsCluster analysis has numerous applications in various domains, ranging from business to science to engineering. Some of the main applications are:1. Market segmentation: clustering can be used to segment customers based on their demographics, behavior, or preferences, and target them with personalized marketing strategies.2. Image processing: clustering can be used to group pixels or regions in an image based on their color, texture, or shape, and extract features or objects of interest.3. Bioinformatics: clustering can be used to classify genes or proteins based on their expression patterns, sequences, or functions, and identify biomarkers or drug targets.4. Social network analysis: clustering can be used to detect communities or groups of users based on their interactions or interests, and analyze the structure and dynamics of the network.5. Anomaly detection: clustering can be used to detect outliers or anomalous events in the data, and alert the users or take corrective actions.ConclusionCluster analysis is a powerful and flexible technique in data mining that can reveal the natural structure of data and support various applications in business, science, and engineering. The choice of clustering algorithm depends on the nature of the data, the objectives of the analysis, and the computational resources available. To obtain reliable and meaningful clusters, it is important to preprocess the data,choose an appropriate distance or similarity measure, and validate the clustering results.。
聚类技术在学生成绩分析中的应用【摘要】数据挖掘技术是信息技术研究的热点问题之一。
目前数据挖掘技术在商业、金融业等方面都得到了广泛的应用,而在教育领域的应用较少,随着高校招生规模的扩大,在校学生成绩分布越来越复杂,除了传统成绩分析得到的一些结论外,还有一些不易发现的信息隐含其中,因而把数据挖掘技术引入到学生成绩分析中,有利于针对性地提高教学质量。
聚类分析是数据挖掘中的一个重要研究领域。
它将数据对象分成为若干个簇,使得在同一个簇中的对象比较相似,而不同簇中的对象差别很大。
本论文就是运用数据挖掘中的聚类分析学生成绩的,利用学生在分专业前的各主要学科的成绩构成,对数据进行选择,预处理,挖掘分析等。
运用聚类算法分析学生对哪个专业的强弱选择,从而为具有不同成绩特征的同学在专业选择及分专业后如何开展学习提供一定的参考意见。
【关键词】数据挖掘;聚类技术;学生成绩;K-means.The Application of Cluster Technology in Analysis forStudents’ AchievementAbstract: The technology of data mining is one of the hot issues in the information technology field. Nowadays data mining technology is widely used in business and finance. But it is less used in education field. With the increase of enrollment in universities, there are more and more students in campus, and that makes it more and more complex in the distribution of students" records. Besides some conclusions from traditional record analysis, a lot of potential information cannot be founded. Importing the data mining technology to students" record analyzing makes it more convenient and improve the teaching quality. Clustering analysis is an important research field in data mining. It classify data object to many groups so that the object is similar in the same clusters, different in the different clusters. In this paper, clustering technique in data mining is used to students' performance analysis, the use of data structure of main subject before the students specialized in choice of mode, pretreatment and data mining. Using clustering technology to analyse which professional students are good at, so as to choose how to learn professional and give some reference opinions after students of different grades choose their majors.Key words: Data Mining; ClusterinTechnology; Students' Achievement; k-means目录引言 (1)1 概述 (2)1.1课题背景 (2)1.2发展现状 (2)1.3课题意义 (3)1.4本文研究内容 (3)2 数据挖掘理论概述 (4)2.1数据挖掘概述 (4)2.1.1 数据挖掘的定义 (4)2.1.2 数据挖掘的过程 (4)2.2聚类分析 (5)2.2.1 聚类分析概述 (5)2.2.2 聚类分析原理方法 (5)2.2.3 聚类分析工具 (6)3 算法介绍 (7)3.1K-means算法 (7)3.1.1K-means算法描述 (7)3.1.2k-means算法的特点 (8)4 聚类分析的应用 (9)4.1算法实现 (9)4.1.1 数据准备 (9)4.1.2 数据预处理 (10)4.1.3 算法应用 (11)4.2结果分析 (13)4.2.1 聚类结果 (13)4.2.2 结果分析 (19)4.3结论 (19)总结 (20)致谢 (21)参考文献 (22)科技外文文献 (23)中文译文 (26)引言在高校学生成绩管理中,影响学生学习成绩的因素很多,因此要进行综合分析。
Data-mining-clustering数据挖掘—聚类分析大学毕业论文外文文献翻译及原文毕业设计(论文)外文文献翻译文献、资料中文题目:聚类分析文献、资料英文题目:clustering文献、资料来源:文献、资料发表(出版)日期:院(部):专业:自动化班级:姓名:学号:指导教师:翻译日期: 2017.02.14外文翻译英文名称:Data mining-clustering译文名称:数据挖掘—聚类分析专业:自动化姓名:****班级学号:****指导教师:******译文出处:Data mining:Ian H.Witten, EibeFrank 著Clustering5.1 INTRODUCTIONClustering is similar to classification in that data are grouped. However, unlike classification, the groups are not predefined. Instead, the grouping is accomplished by finding similarities between data according to characteristics found in the actualdata. The groups are called clusters. Some authors view clustering as a special type of classification. In this text, however, we follow a more conventional view in that the two are different. Many definitions for clusters have been proposed:●Set of like elements. Elements from different clusters are not alike.●The distance between points in a cluster is less than the distance betweena point in the cluster and any point outside it.A term similar to clustering is database segmentation, where like tuple (record) in a database are grouped together. This is done to partition or segment the database into components that then give the user a more general view of the data. In this case text, we do not differentiate between segmentation and clustering. A simple example of clustering is found in Example 5.1. This example illustrates the fact that that determining how to do the clustering is not straightforward.As illustrated in Figure 5.1, a given set of data may be clustered on different attributes. Here a group of homes in a geographic area is shown. The first floor type of clustering is based on the location of the home. Homes that are geographically close to each other are clustered together. In the second clustering, homes are grouped based on the size of the house.Clustering has been used in many application domains, including biology, medicine, anthropology, marketing, and economics. Clustering applications include plant and animal classification, disease classification, image processing, pattern recognition, and document retrieval. One of the first domains in which clustering was used was biological taxonomy. Recent usesinclude examining Web log data to detect usage patterns.When clustering is applied to a real-world database, many interesting problems occur:●Outlier handling is difficult. Here the elements do not naturally fallinto any cluster. They can be viewed as solitary clusters. However, if aclustering algorithm attempts to find larger clusters, these outliers will beforced to be placed in some cluster. This process may result in the creationof poor clusters by combining two existing clusters and leaving the outlier in its own cluster.● Dynamic data in the database implies that cluster membership may change over time.● Interpreting the semantic meaning of each cluster may be difficult. With classification, the labeling of the classes is known ahead of time. However, with clustering, this may not be the case. Thus, when the clustering process finishes creating a set of clusters, the exact meaning of each cluster may not be obvious. Here is where a domain expert is needed to assign a label or interpretation for each cluster.● There is no one correct answer to a clustering problem. In fact, many answers may be found. The exact number of clusters required is not easy to determine. Again, a domain expert may be required. For example, suppose we have a set of data about plants that have been collected during a field trip. Without any prior knowledge of plant classification, if we attempt to divide this set of data into similar groupings, it would not be clear how many groups should be created.● Another related iss ue is what data should be used of clustering. Unlike learning during a classification process, where there is some a priori knowledge concerning what the attributes of each classification should be, in clustering we have no supervised learning to aid the process. Indeed, clustering can be viewed as similar to unsupervised learning.We can then summarize some basic features of clustering (as opposed to classification):● The (best) number of clusters is not known.● There may not be any a priori knowledge co ncerning the clusters. ● Cluster results are dynamic.The clustering problem is stated as shown in Definition 5.1. Here we assume that the number of clusters to be created is an input value, k. The actual content (and interpretation) of each cluster,j k ,1j k ≤≤, is determined as a result of the function definition. Without loss of generality, we will view that the result of solving a clustering problem is that a set of clusters is created: K={12,,...,k k k k }.D EFINITION 5.1.Given a database D ={12,,...,n t t t } of tuples and aninteger value k , the clustering problem is to define a mapping f : {1,...,}D k → where each i t is assigned to one cluster j K ,1j k ≤≤. A cluster j K , contains precisely those tuples mapped to it; that is,j K ={|(),1,i i j t f t K i n =≤≤and i t D ∈}.A classification of the different types of clustering algorithms is shown in Figure 5.2. Clustering algorithms themselves may be viewed as hierarchical or partitional. With hierarchical clustering, a nested set of clusters is created. Each level in the hierarchy has a separate set of clusters. At the lowest level, each item is in itsown unique cluster. At the highest level, all items belong to the same cluster. With hierarchical clustering, the desired number of clusters is not input. With partitional clustering, the algorithm creates only one set of clusters. These approaches use the desired number of clusters to drive how the final set is created. Traditional clustering algorithms tend to be targeted to small numeric database that fit into memory .There are, however, more recent clustering algorithms that look at categorical data and are targeted to larger, perhaps dynamic, databases. Algorithms targeted to larger databases may adapt to memory constraints by either sampling the database or using data structures, which can be compressed or pruned to fit into memory regardless of the size of the database. Clustering algorithms may also differ based on whether they produce overlapping or nonoverlapping clusters. Even though we consider only nonoverlapping clusters, it is possible to place an item in multiple clusters. In turn, nonoverlapping clusters can be viewed as extrinsic or intrinsic. Extrinsic techniques use labeling of the items to assist in the classification process. These algorithms are the traditional classification supervised learning algorithms in which a special input training set is used. Intrinsic algorithms do not use any a priori category labels, but depend only on the adjacency matrix containing the distance between objects. All algorithms we examine in this chapter fall into the intrinsic class.The types of clustering algorithms can be furthered classified based on the implementation technique used. Hierarchical algorithms can becategorized as agglomerat ive or divisive. ”Agglomerative ” implies that the clusters are created in a bottom-up fashion, while divisive algorithms work in a top-down fashion. Although bothhierarchical and partitional algorithms could be described using the agglomerative vs. divisive label, it typically is more associated with hierarchical algorithms. Another descriptive tag indicates whether each individual element is handled one by one, serial (sometimes called incremental), or whether all items are examined together, simultaneous. If a specific tuple is viewed as having attribute values for all attributes in the schema, then clustering algorithms could differ as to how the attribute values are examined. As is usually done with decision tree classification techniques, some algorithms examine attribute values one at a time, monothetic. Polythetic algorithms consider all attribute values at one time. Finally, clustering algorithms can be labeled base on the mathematical formulation given to the algorithm: graph theoretic or matrix algebra. In this chapter we generally use the graph approach and describe the input to the clustering algorithm as an adjacency matrix labeled with distance measure.We discuss many clustering algorithms in the following sections. This is only a representative subset of the many algorithms that have been proposed in the literature. Before looking at these algorithms, we first examine possible similarity measures and examine the impact of outliers.5.2 SIMILARITY AND DISTANCE MEASURESThere are many desirable properties for the clusters created by a solution to a specific clustering problem. The most important one is that a tuple within one cluster is more like tuples within that cluster than it is similar to tuples outside it. As with classification, then, we assume the definition of a similarity measure, sim(,i l t t ), defined between any two tuples, ,i l t t D . This provides a more strict and alternative clustering definition, as found in Definition 5.2. Unless otherwise stated, we use thefirst definition rather than the second. Keep in mind that the similarity relationship stated within the second definition is a desirable, although not always obtainable, property.A distance measure, dis(,i j t t ), as opposed to similarity, is often used in。
数据挖掘中聚类分析综述摘要院数据挖掘中的聚类技术是一种非监督分类技术。
概述了聚类分析算法中的数据结构和数据类型,分析了聚类分析的意义及研究现状,比较了几种聚类算法的优点及问题,并结合通信领域的应用指出了K-Means 聚类技术的绝对优势。
Abstract: The clustering technology in data mining is a kind of unsupervised classification techniques. The paper analyses the datastructure and data types of clustering analysis algorithm, the significance and resent research of cluster analysis, compares the advantagesand disadvantages of several kinds of clustering algorithm, points out the absolute advantages of K-Means clustering technology combinedwiththe application in communication feild.关键词院数据挖掘;聚类分析;K-Means 算法Key words: data mining;clustering analysis;K-Means algorithm中图分类号院TP274 文献标识码院A 文章编号院1006-4311(2014)15-0226-020引言数据挖掘,也称知识发现数据库(KDD)[1],就是从实际的大量的、不完全的,含有噪声的数据中去提取出人们事先不知道的、隐含在其中的对人们有用的知识和信息的过程。
数据挖掘经常被企业决策者利用,通过挖掘企业中存储的大量数据中的潜在的有价值的信息,从而帮助企业经营者做出正确的决策,为企业创造更多的利益。
毕业设计(论文)外文文献翻译文献、资料中文题目:聚类分析文献、资料英文题目:clustering文献、资料来源:文献、资料发表(出版)日期:院(部):专业:自动化班级:姓名:学号:指导教师:翻译日期: 2017.02.14外文翻译英文名称:Data mining-clustering译文名称:数据挖掘—聚类分析专业:自动化姓名:****班级学号:****指导教师:******译文出处:Data mining:Ian H.Witten, EibeFrank 著Clustering5.1 INTRODUCTIONClustering is similar to classification in that data are grouped. However, unlike classification, the groups are not predefined. Instead, the grouping is accomplished by finding similarities between data according to characteristics found in the actual data. The groups are called clusters. Some authors view clustering as a special type of classification. In this text, however, we follow a more conventional view in that the two are different. Many definitions for clusters have been proposed:●Set of like elements. Elements from different clusters are not alike.●The distance between points in a cluster is less than the distance betweena point in the cluster and any point outside it.A term similar to clustering is database segmentation, where like tuple (record) in a database are grouped together. This is done to partition or segment the database into components that then give the user a more general view of the data. In this case text, we do not differentiate between segmentation and clustering. A simple example of clustering is found in Example 5.1. This example illustrates the fact that that determining how to do the clustering is not straightforward.As illustrated in Figure 5.1, a given set of data may be clustered on different attributes. Here a group of homes in a geographic area is shown. The first floor type of clustering is based on the location of the home. Homes that are geographically close to each other are clustered together. In the second clustering, homes are grouped based on the size of the house.Clustering has been used in many application domains, including biology, medicine, anthropology, marketing, and economics. Clustering applications include plant and animal classification, disease classification, image processing, pattern recognition, and document retrieval. One of the first domains in which clustering was used was biological taxonomy. Recent uses include examining Web log data to detect usage patterns.When clustering is applied to a real-world database, many interesting problems occur:●Outlier handling is difficult. Here the elements do not naturally fallinto any cluster. They can be viewed as solitary clusters. However, if aclustering algorithm attempts to find larger clusters, these outliers will beforced to be placed in some cluster. This process may result in the creationof poor clusters by combining two existing clusters and leaving the outlier in its own cluster.● Dynamic data in the database implies that cluster membership may change over time.● Interpreting the semantic meaning of each cluster may be difficult. With classification, the labeling of the classes is known ahead of time. However, with clustering, this may not be the case. Thus, when the clustering process finishes creating a set of clusters, the exact meaning of each cluster may not be obvious. Here is where a domain expert is needed to assign a label or interpretation for each cluster.● There is no one correct answer to a clustering problem. In fact, many answers may be found. The exact number of clusters required is not easy to determine. Again, a domain expert may be required. For example, suppose we have a set of data about plants that have been collected during a field trip. Without any prior knowledge of plant classification, if we attempt to divide this set of data into similar groupings, it would not be clear how many groups should be created.● Another related issue is what data should be used of clustering. Unlike learning during a classification process, where there is some a priori knowledge concerning what the attributes of each classification should be, in clustering we have no supervised learning to aid the process. Indeed, clustering can be viewed as similar to unsupervised learning.We can then summarize some basic features of clustering (as opposed to classification):● The (best) number of clusters is not known.● There may not be any a priori knowledge concerning the clusters. ● Cluster results are dynamic.The clustering problem is stated as shown in Definition 5.1. Here we assume that the number of clusters to be created is an input value, k. The actual content (and interpretation) of each cluster,j k ,1j k ≤≤, is determined as a result of the function definition. Without loss of generality, we will view that the result of solving a clustering problem is that a set of clusters is created: K={12,,...,k k k k }.D EFINITION 5.1.Given a database D ={12,,...,n t t t } of tuples and an integer value k , the clustering problem is to define a mapping f : {1,...,}D k → where each i t is assigned to one cluster j K ,1j k ≤≤. A cluster j K , contains precisely those tuples mapped to it; that is, j K ={|(),1,i i j t f t K i n =≤≤and i t D ∈}.A classification of the different types of clustering algorithms is shown in Figure 5.2. Clustering algorithms themselves may be viewed as hierarchical or partitional. With hierarchical clustering, a nested set of clusters is created. Each level in the hierarchy has a separate set of clusters. At the lowest level, each item is in its own unique cluster. At the highest level, all items belong to the same cluster. With hierarchical clustering, the desired number of clusters is not input. With partitional clustering, the algorithm creates only one set of clusters. These approaches use the desired number of clusters to drive how the final set is created. Traditional clustering algorithms tend to be targeted to small numeric database that fit into memory .There are, however, more recent clustering algorithms that look at categorical data and are targeted to larger, perhaps dynamic, databases. Algorithms targeted to larger databases may adapt to memory constraints by either sampling the database or using data structures, which can be compressed or pruned to fit into memory regardless of the size of the database. Clustering algorithms may also differ based on whether they produce overlapping or nonoverlapping clusters. Even though we consider only nonoverlapping clusters, it is possible to place an item in multiple clusters. In turn, nonoverlapping clusters can be viewed as extrinsic or intrinsic. Extrinsic techniques use labeling of the items to assist in the classification process. These algorithms are the traditional classification supervised learning algorithms in which a special input training set is used. Intrinsic algorithms do not use any a priori category labels, but depend only on the adjacency matrix containing the distance between objects. All algorithms we examine in this chapter fall into the intrinsic class.The types of clustering algorithms can be furthered classified based on the implementation technique used. Hierarchical algorithms can becategorized as agglomerative or divisive. ”Agglomerative ” implies that the clusters are created in a bottom-up fashion, while divisive algorithms work in a top-down fashion. Although both hierarchical and partitional algorithms could be described using the agglomerative vs. divisive label, it typically is more associated with hierarchical algorithms. Another descriptive tag indicates whether each individual element is handled one by one, serial (sometimes called incremental), or whether all items are examined together, simultaneous. If a specific tuple is viewed as having attribute values for all attributes in the schema, then clustering algorithms could differ as to how the attribute values are examined. As is usually done with decision tree classification techniques, some algorithms examine attribute values one at a time, monothetic. Polythetic algorithms consider all attribute values at one time. Finally, clustering algorithms can be labeled base on the mathematical formulation given to the algorithm: graph theoretic or matrix algebra. In this chapter we generally use the graph approach and describe the input to the clustering algorithm as an adjacency matrix labeled with distance measure.We discuss many clustering algorithms in the following sections. This is only a representative subset of the many algorithms that have been proposed in the literature. Before looking at these algorithms, we first examine possible similarity measures and examine the impact of outliers.5.2 SIMILARITY AND DISTANCE MEASURESThere are many desirable properties for the clusters created by a solution to a specific clustering problem. The most important one is that a tuple within one cluster is more like tuples within that cluster than it is similar to tuples outside it. As with classification, then, we assume the definition of a similarity measure, sim(,i l t t ), defined between any two tuples, ,i l t t D . This provides a more strict and alternative clustering definition, as found in Definition 5.2. Unless otherwise stated, we use the first definition rather than the second. Keep in mind that the similarity relationship stated within the second definition is a desirable, although not always obtainable, property.A distance measure, dis(,i j t t ), as opposed to similarity, is often used in。