Clustering Large Categorical Data Set via Genetic Algorithms Algoritmi genetici per il ragg

格式：pdf
大小：99.11 KB
文档页数：4

下载文档原格式

人工智能领域中英文专有名词汇总

名词解释中英文对比<using_information_sources> social networks 社会网络abductive reasoning 溯因推理action recognition(行为识别)active learning(主动学习)adaptive systems 自适应系统adverse drugs reactions(药物不良反应)algorithm design and analysis(算法设计与分析) algorithm(算法)artificial intelligence 人工智能association rule(关联规则)attribute value taxonomy 属性分类规范automomous agent 自动代理automomous systems 自动系统background knowledge 背景知识bayes methods(贝叶斯方法)bayesian inference(贝叶斯推断)bayesian methods(bayes 方法)belief propagation(置信传播)better understanding 内涵理解big data 大数据big data(大数据)biological network(生物网络)biological sciences(生物科学)biomedical domain 生物医学领域biomedical research(生物医学研究)biomedical text(生物医学文本)boltzmann machine(玻尔兹曼机)bootstrapping method 拔靴法case based reasoning 实例推理causual models 因果模型citation matching (引文匹配)classification (分类)classification algorithms(分类算法)clistering algorithms 聚类算法cloud computing(云计算)cluster-based retrieval (聚类检索)clustering (聚类)clustering algorithms(聚类算法)clustering 聚类cognitive science 认知科学collaborative filtering (协同过滤)collaborative filtering(协同过滤)collabrative ontology development 联合本体开发collabrative ontology engineering 联合本体工程commonsense knowledge 常识communication networks(通讯网络)community detection(社区发现)complex data(复杂数据)complex dynamical networks(复杂动态网络)complex network(复杂网络)complex network(复杂网络)computational biology 计算生物学computational biology(计算生物学)computational complexity(计算复杂性) computational intelligence 智能计算computational modeling(计算模型)computer animation(计算机动画)computer networks(计算机网络)computer science 计算机科学concept clustering 概念聚类concept formation 概念形成concept learning 概念学习concept map 概念图concept model 概念模型concept modelling 概念模型conceptual model 概念模型conditional random field(条件随机场模型) conjunctive quries 合取查询constrained least squares (约束最小二乘) convex programming(凸规划)convolutional neural networks(卷积神经网络) customer relationship management(客户关系管理) data analysis(数据分析)data analysis(数据分析)data center(数据中心)data clustering (数据聚类)data compression(数据压缩)data envelopment analysis (数据包络分析)data fusion 数据融合data generation(数据生成)data handling(数据处理)data hierarchy (数据层次)data integration(数据整合)data integrity 数据完整性data intensive computing(数据密集型计算)data management 数据管理data management(数据管理)data management(数据管理)data miningdata mining 数据挖掘data model 数据模型data models(数据模型)data partitioning 数据划分data point(数据点)data privacy(数据隐私)data security(数据安全)data stream(数据流)data streams(数据流)data structure( 数据结构)data structure(数据结构)data visualisation(数据可视化)data visualization 数据可视化data visualization(数据可视化)data warehouse(数据仓库)data warehouses(数据仓库)data warehousing(数据仓库)database management systems(数据库管理系统)database management(数据库管理)date interlinking 日期互联date linking 日期链接Decision analysis(决策分析)decision maker 决策者decision making (决策)decision models 决策模型decision models 决策模型decision rule 决策规则decision support system 决策支持系统decision support systems (决策支持系统) decision tree(决策树)decission tree 决策树deep belief network(深度信念网络)deep learning(深度学习)defult reasoning 默认推理density estimation(密度估计)design methodology 设计方法论dimension reduction(降维) dimensionality reduction(降维)directed graph(有向图)disaster management 灾害管理disastrous event(灾难性事件)discovery(知识发现)dissimilarity (相异性)distributed databases 分布式数据库distributed databases(分布式数据库) distributed query 分布式查询document clustering (文档聚类)domain experts 领域专家domain knowledge 领域知识domain specific language 领域专用语言dynamic databases(动态数据库)dynamic logic 动态逻辑dynamic network(动态网络)dynamic system(动态系统)earth mover's distance(EMD 距离) education 教育efficient algorithm(有效算法)electric commerce 电子商务electronic health records(电子健康档案) entity disambiguation 实体消歧entity recognition 实体识别entity recognition(实体识别)entity resolution 实体解析event detection 事件检测event detection(事件检测)event extraction 事件抽取event identificaton 事件识别exhaustive indexing 完整索引expert system 专家系统expert systems(专家系统)explanation based learning 解释学习factor graph(因子图)feature extraction 特征提取feature extraction(特征提取)feature extraction(特征提取)feature selection (特征选择)feature selection 特征选择feature selection(特征选择)feature space 特征空间first order logic 一阶逻辑formal logic 形式逻辑formal meaning prepresentation 形式意义表示formal semantics 形式语义formal specification 形式描述frame based system 框为本的系统frequent itemsets(频繁项目集)frequent pattern(频繁模式)fuzzy clustering (模糊聚类)fuzzy clustering (模糊聚类)fuzzy clustering (模糊聚类)fuzzy data mining(模糊数据挖掘)fuzzy logic 模糊逻辑fuzzy set theory(模糊集合论)fuzzy set(模糊集)fuzzy sets 模糊集合fuzzy systems 模糊系统gaussian processes(高斯过程)gene expression data 基因表达数据gene expression(基因表达)generative model(生成模型)generative model(生成模型)genetic algorithm 遗传算法genome wide association study(全基因组关联分析) graph classification(图分类)graph classification(图分类)graph clustering(图聚类)graph data(图数据)graph data(图形数据)graph database 图数据库graph database(图数据库)graph mining(图挖掘)graph mining(图挖掘)graph partitioning 图划分graph query 图查询graph structure(图结构)graph theory(图论)graph theory(图论)graph theory(图论)graph theroy 图论graph visualization(图形可视化)graphical user interface 图形用户界面graphical user interfaces(图形用户界面)health care 卫生保健health care(卫生保健)heterogeneous data source 异构数据源heterogeneous data(异构数据)heterogeneous database 异构数据库heterogeneous information network(异构信息网络) heterogeneous network(异构网络)heterogenous ontology 异构本体heuristic rule 启发式规则hidden markov model(隐马尔可夫模型)hidden markov model(隐马尔可夫模型)hidden markov models(隐马尔可夫模型) hierarchical clustering (层次聚类) homogeneous network(同构网络)human centered computing 人机交互技术human computer interaction 人机交互human interaction 人机交互human robot interaction 人机交互image classification(图像分类)image clustering (图像聚类)image mining( 图像挖掘)image reconstruction(图像重建)image retrieval (图像检索)image segmentation(图像分割)inconsistent ontology 本体不一致incremental learning(增量学习)inductive learning (归纳学习)inference mechanisms 推理机制inference mechanisms(推理机制)inference rule 推理规则information cascades(信息追随)information diffusion(信息扩散)information extraction 信息提取information filtering(信息过滤)information filtering(信息过滤)information integration(信息集成)information network analysis(信息网络分析) information network mining(信息网络挖掘) information network(信息网络)information processing 信息处理information processing 信息处理information resource management (信息资源管理) information retrieval models(信息检索模型) information retrieval 信息检索information retrieval(信息检索)information retrieval(信息检索)information science 情报科学information sources 信息源information system( 信息系统)information system(信息系统)information technology(信息技术)information visualization(信息可视化)instance matching 实例匹配intelligent assistant 智能辅助intelligent systems 智能系统interaction network(交互网络)interactive visualization(交互式可视化)kernel function(核函数)kernel operator (核算子)keyword search(关键字检索)knowledege reuse 知识再利用knowledgeknowledgeknowledge acquisitionknowledge base 知识库knowledge based system 知识系统knowledge building 知识建构knowledge capture 知识获取knowledge construction 知识建构knowledge discovery(知识发现)knowledge extraction 知识提取knowledge fusion 知识融合knowledge integrationknowledge management systems 知识管理系统knowledge management 知识管理knowledge management(知识管理)knowledge model 知识模型knowledge reasoningknowledge representationknowledge representation(知识表达) knowledge sharing 知识共享knowledge storageknowledge technology 知识技术knowledge verification 知识验证language model(语言模型)language modeling approach(语言模型方法) large graph(大图)large graph(大图)learning(无监督学习)life science 生命科学linear programming(线性规划)link analysis (链接分析)link prediction(链接预测)link prediction(链接预测)link prediction(链接预测)linked data(关联数据)location based service(基于位置的服务) loclation based services(基于位置的服务) logic programming 逻辑编程logical implication 逻辑蕴涵logistic regression(logistic 回归)machine learning 机器学习machine translation(机器翻译)management system(管理系统)management( 知识管理)manifold learning(流形学习)markov chains 马尔可夫链markov processes(马尔可夫过程)matching function 匹配函数matrix decomposition(矩阵分解)matrix decomposition(矩阵分解)maximum likelihood estimation(最大似然估计)medical research(医学研究)mixture of gaussians(混合高斯模型)mobile computing(移动计算)multi agnet systems 多智能体系统multiagent systems 多智能体系统multimedia 多媒体natural language processing 自然语言处理natural language processing(自然语言处理) nearest neighbor (近邻)network analysis( 网络分析)network analysis(网络分析)network analysis(网络分析)network formation(组网)network structure(网络结构)network theory(网络理论)network topology(网络拓扑)network visualization(网络可视化)neural network(神经网络)neural networks (神经网络)neural networks(神经网络)nonlinear dynamics(非线性动力学)nonmonotonic reasoning 非单调推理nonnegative matrix factorization (非负矩阵分解) nonnegative matrix factorization(非负矩阵分解) object detection(目标检测)object oriented 面向对象object recognition(目标识别)object recognition(目标识别)online community(网络社区)online social network(在线社交网络)online social networks(在线社交网络)ontology alignment 本体映射ontology development 本体开发ontology engineering 本体工程ontology evolution 本体演化ontology extraction 本体抽取ontology interoperablity 互用性本体ontology language 本体语言ontology mapping 本体映射ontology matching 本体匹配ontology versioning 本体版本ontology 本体论open government data 政府公开数据opinion analysis(舆情分析)opinion mining(意见挖掘)opinion mining(意见挖掘)outlier detection(孤立点检测)parallel processing(并行处理)patient care(病人医疗护理)pattern classification(模式分类)pattern matching(模式匹配)pattern mining(模式挖掘)pattern recognition 模式识别pattern recognition(模式识别)pattern recognition(模式识别)personal data(个人数据)prediction algorithms(预测算法)predictive model 预测模型predictive models(预测模型)privacy preservation(隐私保护)probabilistic logic(概率逻辑)probabilistic logic(概率逻辑)probabilistic model(概率模型)probabilistic model(概率模型)probability distribution(概率分布)probability distribution(概率分布)project management(项目管理)pruning technique(修剪技术)quality management 质量管理query expansion(查询扩展)query language 查询语言query language(查询语言)query processing(查询处理)query rewrite 查询重写question answering system 问答系统random forest(随机森林)random graph(随机图)random processes(随机过程)random walk(随机游走)range query(范围查询)RDF database 资源描述框架数据库RDF query 资源描述框架查询RDF repository 资源描述框架存储库RDF storge 资源描述框架存储real time(实时)recommender system(推荐系统)recommender system(推荐系统)recommender systems 推荐系统recommender systems(推荐系统)record linkage 记录链接recurrent neural network(递归神经网络) regression(回归)reinforcement learning 强化学习reinforcement learning(强化学习)relation extraction 关系抽取relational database 关系数据库relational learning 关系学习relevance feedback (相关反馈)resource description framework 资源描述框架restricted boltzmann machines(受限玻尔兹曼机) retrieval models(检索模型)rough set theroy 粗糙集理论rough set 粗糙集rule based system 基于规则系统rule based 基于规则rule induction (规则归纳)rule learning (规则学习)rule learning 规则学习schema mapping 模式映射schema matching 模式匹配scientific domain 科学域search problems(搜索问题)semantic (web) technology 语义技术semantic analysis 语义分析semantic annotation 语义标注semantic computing 语义计算semantic integration 语义集成semantic interpretation 语义解释semantic model 语义模型semantic network 语义网络semantic relatedness 语义相关性semantic relation learning 语义关系学习semantic search 语义检索semantic similarity 语义相似度semantic similarity(语义相似度)semantic web rule language 语义网规则语言semantic web 语义网semantic web(语义网)semantic workflow 语义工作流semi supervised learning(半监督学习)sensor data(传感器数据)sensor networks(传感器网络)sentiment analysis(情感分析)sentiment analysis(情感分析)sequential pattern(序列模式)service oriented architecture 面向服务的体系结构shortest path(最短路径)similar kernel function(相似核函数)similarity measure(相似性度量)similarity relationship (相似关系)similarity search(相似搜索)similarity(相似性)situation aware 情境感知social behavior(社交行为)social influence(社会影响)social interaction(社交互动)social interaction(社交互动)social learning(社会学习)social life networks(社交生活网络)social machine 社交机器social media(社交媒体)social media(社交媒体)social media(社交媒体)social network analysis 社会网络分析social network analysis(社交网络分析)social network(社交网络)social network(社交网络)social science(社会科学)social tagging system(社交标签系统)social tagging(社交标签)social web(社交网页)sparse coding(稀疏编码)sparse matrices(稀疏矩阵)sparse representation(稀疏表示)spatial database(空间数据库)spatial reasoning 空间推理statistical analysis(统计分析)statistical model 统计模型string matching(串匹配)structural risk minimization (结构风险最小化) structured data 结构化数据subgraph matching 子图匹配subspace clustering(子空间聚类)supervised learning( 有support vector machine 支持向量机support vector machines(支持向量机)system dynamics(系统动力学)tag recommendation(标签推荐)taxonmy induction 感应规范temporal logic 时态逻辑temporal reasoning 时序推理text analysis(文本分析)text anaylsis 文本分析text classification (文本分类)text data(文本数据)text mining technique(文本挖掘技术)text mining 文本挖掘text mining(文本挖掘)text summarization(文本摘要)thesaurus alignment 同义对齐time frequency analysis(时频分析)time series analysis( 时time series data(时间序列数据)time series data(时间序列数据)time series(时间序列)topic model(主题模型)topic modeling(主题模型)transfer learning 迁移学习triple store 三元组存储uncertainty reasoning 不精确推理undirected graph(无向图)unified modeling language 统一建模语言unsupervisedupper bound(上界)user behavior(用户行为)user generated content(用户生成内容)utility mining(效用挖掘)visual analytics(可视化分析)visual content(视觉内容)visual representation(视觉表征)visualisation(可视化)visualization technique(可视化技术) visualization tool(可视化工具)web 2.0(网络2.0)web forum(web 论坛)web mining(网络挖掘)web of data 数据网web ontology lanuage 网络本体语言web pages(web 页面)web resource 网络资源web science 万维科学web search (网络检索)web usage mining(web 使用挖掘)wireless networks 无线网络world knowledge 世界知识world wide web 万维网world wide web(万维网)xml database 可扩展标志语言数据库附录 2 Data Mining 知识图谱（共包含二级节点15 个，三级节点93 个）间序列分析)监督学习)领域二级分类三级分类。

机器学习专业词汇中英文对照

机器学习专业词汇中英⽂对照activation 激活值activation function 激活函数additive noise 加性噪声autoencoder ⾃编码器Autoencoders ⾃编码算法average firing rate 平均激活率average sum-of-squares error 均⽅差backpropagation 后向传播basis 基basis feature vectors 特征基向量batch gradient ascent 批量梯度上升法Bayesian regularization method 贝叶斯规则化⽅法Bernoulli random variable 伯努利随机变量bias term 偏置项binary classfication ⼆元分类class labels 类型标记concatenation 级联conjugate gradient 共轭梯度contiguous groups 联通区域convex optimization software 凸优化软件convolution 卷积cost function 代价函数covariance matrix 协⽅差矩阵DC component 直流分量decorrelation 去相关degeneracy 退化demensionality reduction 降维derivative 导函数diagonal 对⾓线diffusion of gradients 梯度的弥散eigenvalue 特征值eigenvector 特征向量error term 残差feature matrix 特征矩阵feature standardization 特征标准化feedforward architectures 前馈结构算法feedforward neural network 前馈神经⽹络feedforward pass 前馈传导fine-tuned 微调first-order feature ⼀阶特征forward pass 前向传导forward propagation 前向传播Gaussian prior ⾼斯先验概率generative model ⽣成模型gradient descent 梯度下降Greedy layer-wise training 逐层贪婪训练⽅法grouping matrix 分组矩阵Hadamard product 阿达马乘积Hessian matrix Hessian 矩阵hidden layer 隐含层hidden units 隐藏神经元Hierarchical grouping 层次型分组higher-order features 更⾼阶特征highly non-convex optimization problem ⾼度⾮凸的优化问题histogram 直⽅图hyperbolic tangent 双曲正切函数hypothesis 估值，假设identity activation function 恒等激励函数IID 独⽴同分布illumination 照明inactive 抑制independent component analysis 独⽴成份分析input domains 输⼊域input layer 输⼊层intensity 亮度/灰度intercept term 截距KL divergence 相对熵KL divergence KL分散度k-Means K-均值learning rate 学习速率least squares 最⼩⼆乘法linear correspondence 线性响应linear superposition 线性叠加line-search algorithm 线搜索算法local mean subtraction 局部均值消减local optima 局部最优解logistic regression 逻辑回归loss function 损失函数low-pass filtering 低通滤波magnitude 幅值MAP 极⼤后验估计maximum likelihood estimation 极⼤似然估计mean 平均值MFCC Mel 倒频系数multi-class classification 多元分类neural networks 神经⽹络neuron 神经元Newton’s method ⽜顿法non-convex function ⾮凸函数non-linear feature ⾮线性特征norm 范式norm bounded 有界范数norm constrained 范数约束normalization 归⼀化numerical roundoff errors 数值舍⼊误差numerically checking 数值检验numerically reliable 数值计算上稳定object detection 物体检测objective function ⽬标函数off-by-one error 缺位错误orthogonalization 正交化output layer 输出层overall cost function 总体代价函数over-complete basis 超完备基over-fitting 过拟合parts of objects ⽬标的部件part-whole decompostion 部分-整体分解PCA 主元分析penalty term 惩罚因⼦per-example mean subtraction 逐样本均值消减pooling 池化pretrain 预训练principal components analysis 主成份分析quadratic constraints ⼆次约束RBMs 受限Boltzman机reconstruction based models 基于重构的模型reconstruction cost 重建代价reconstruction term 重构项redundant 冗余reflection matrix 反射矩阵regularization 正则化regularization term 正则化项rescaling 缩放robust 鲁棒性run ⾏程second-order feature ⼆阶特征sigmoid activation function S型激励函数significant digits 有效数字singular value 奇异值singular vector 奇异向量smoothed L1 penalty 平滑的L1范数惩罚Smoothed topographic L1 sparsity penalty 平滑地形L1稀疏惩罚函数smoothing 平滑Softmax Regresson Softmax回归sorted in decreasing order 降序排列source features 源特征sparse autoencoder 消减归⼀化Sparsity 稀疏性sparsity parameter 稀疏性参数sparsity penalty 稀疏惩罚square function 平⽅函数squared-error ⽅差stationary 平稳性（不变性）stationary stochastic process 平稳随机过程step-size 步长值supervised learning 监督学习symmetric positive semi-definite matrix 对称半正定矩阵symmetry breaking 对称失效tanh function 双曲正切函数the average activation 平均活跃度the derivative checking method 梯度验证⽅法the empirical distribution 经验分布函数the energy function 能量函数the Lagrange dual 拉格朗⽇对偶函数the log likelihood 对数似然函数the pixel intensity value 像素灰度值the rate of convergence 收敛速度topographic cost term 拓扑代价项topographic ordered 拓扑秩序transformation 变换translation invariant 平移不变性trivial answer 平凡解under-complete basis 不完备基unrolling 组合扩展unsupervised learning ⽆监督学习variance ⽅差vecotrized implementation 向量化实现vectorization ⽮量化visual cortex 视觉⽪层weight decay 权重衰减weighted average 加权平均值whitening ⽩化zero-mean 均值为零Letter AAccumulated error backpropagation 累积误差逆传播Activation Function 激活函数Adaptive Resonance Theory/ART ⾃适应谐振理论Addictive model 加性学习Adversarial Networks 对抗⽹络Affine Layer 仿射层Affinity matrix 亲和矩阵Agent 代理 / 智能体Algorithm 算法Alpha-beta pruning α-β剪枝Anomaly detection 异常检测Approximation 近似Area Under ROC Curve／AUC Roc 曲线下⾯积Artificial General Intelligence/AGI 通⽤⼈⼯智能Artificial Intelligence/AI ⼈⼯智能Association analysis 关联分析Attention mechanism 注意⼒机制Attribute conditional independence assumption 属性条件独⽴性假设Attribute space 属性空间Attribute value 属性值Autoencoder ⾃编码器Automatic speech recognition ⾃动语⾳识别Automatic summarization ⾃动摘要Average gradient 平均梯度Average-Pooling 平均池化Letter BBackpropagation Through Time 通过时间的反向传播Backpropagation/BP 反向传播Base learner 基学习器Base learning algorithm 基学习算法Batch Normalization/BN 批量归⼀化Bayes decision rule 贝叶斯判定准则Bayes Model Averaging／BMA 贝叶斯模型平均Bayes optimal classifier 贝叶斯最优分类器Bayesian decision theory 贝叶斯决策论Bayesian network 贝叶斯⽹络Between-class scatter matrix 类间散度矩阵Bias 偏置 / 偏差Bias-variance decomposition 偏差-⽅差分解Bias-Variance Dilemma 偏差 – ⽅差困境Bi-directional Long-Short Term Memory/Bi-LSTM 双向长短期记忆Binary classification ⼆分类Binomial test ⼆项检验Bi-partition ⼆分法Boltzmann machine 玻尔兹曼机Bootstrap sampling ⾃助采样法／可重复采样／有放回采样Bootstrapping ⾃助法Break-Event Point／BEP 平衡点Letter CCalibration 校准Cascade-Correlation 级联相关Categorical attribute 离散属性Class-conditional probability 类条件概率Classification and regression tree/CART 分类与回归树Classifier 分类器Class-imbalance 类别不平衡Closed -form 闭式Cluster 簇/类/集群Cluster analysis 聚类分析Clustering 聚类Clustering ensemble 聚类集成Co-adapting 共适应Coding matrix 编码矩阵COLT 国际学习理论会议Committee-based learning 基于委员会的学习Competitive learning 竞争型学习Component learner 组件学习器Comprehensibility 可解释性Computation Cost 计算成本Computational Linguistics 计算语⾔学Computer vision 计算机视觉Concept drift 概念漂移Concept Learning System /CLS 概念学习系统Conditional entropy 条件熵Conditional mutual information 条件互信息Conditional Probability Table／CPT 条件概率表Conditional random field/CRF 条件随机场Conditional risk 条件风险Confidence 置信度Confusion matrix 混淆矩阵Connection weight 连接权Connectionism 连结主义Consistency ⼀致性／相合性Contingency table 列联表Continuous attribute 连续属性Convergence 收敛Conversational agent 会话智能体Convex quadratic programming 凸⼆次规划Convexity 凸性Convolutional neural network/CNN 卷积神经⽹络Co-occurrence 同现Correlation coefficient 相关系数Cosine similarity 余弦相似度Cost curve 成本曲线Cost Function 成本函数Cost matrix 成本矩阵Cost-sensitive 成本敏感Cross entropy 交叉熵Cross validation 交叉验证Crowdsourcing 众包Curse of dimensionality 维数灾难Cut point 截断点Cutting plane algorithm 割平⾯法Letter DData mining 数据挖掘Data set 数据集Decision Boundary 决策边界Decision stump 决策树桩Decision tree 决策树／判定树Deduction 演绎Deep Belief Network 深度信念⽹络Deep Convolutional Generative Adversarial Network/DCGAN 深度卷积⽣成对抗⽹络Deep learning 深度学习Deep neural network/DNN 深度神经⽹络Deep Q-Learning 深度 Q 学习Deep Q-Network 深度 Q ⽹络Density estimation 密度估计Density-based clustering 密度聚类Differentiable neural computer 可微分神经计算机Dimensionality reduction algorithm 降维算法Directed edge 有向边Disagreement measure 不合度量Discriminative model 判别模型Discriminator 判别器Distance measure 距离度量Distance metric learning 距离度量学习Distribution 分布Divergence 散度Diversity measure 多样性度量／差异性度量Domain adaption 领域⾃适应Downsampling 下采样D-separation （Directed separation）有向分离Dual problem 对偶问题Dummy node 哑结点Dynamic Fusion 动态融合Dynamic programming 动态规划Letter EEigenvalue decomposition 特征值分解Embedding 嵌⼊Emotional analysis 情绪分析Empirical conditional entropy 经验条件熵Empirical entropy 经验熵Empirical error 经验误差Empirical risk 经验风险End-to-End 端到端Energy-based model 基于能量的模型Ensemble learning 集成学习Ensemble pruning 集成修剪Error Correcting Output Codes／ECOC 纠错输出码Error rate 错误率Error-ambiguity decomposition 误差-分歧分解Euclidean distance 欧⽒距离Evolutionary computation 演化计算Expectation-Maximization 期望最⼤化Expected loss 期望损失Exploding Gradient Problem 梯度爆炸问题Exponential loss function 指数损失函数Extreme Learning Machine/ELM 超限学习机Letter FFactorization 因⼦分解False negative 假负类False positive 假正类False Positive Rate/FPR 假正例率Feature engineering 特征⼯程Feature selection 特征选择Feature vector 特征向量Featured Learning 特征学习Feedforward Neural Networks/FNN 前馈神经⽹络Fine-tuning 微调Flipping output 翻转法Fluctuation 震荡Forward stagewise algorithm 前向分步算法Frequentist 频率主义学派Full-rank matrix 满秩矩阵Functional neuron 功能神经元Letter GGain ratio 增益率Game theory 博弈论Gaussian kernel function ⾼斯核函数Gaussian Mixture Model ⾼斯混合模型General Problem Solving 通⽤问题求解Generalization 泛化Generalization error 泛化误差Generalization error bound 泛化误差上界Generalized Lagrange function ⼴义拉格朗⽇函数Generalized linear model ⼴义线性模型Generalized Rayleigh quotient ⼴义瑞利商Generative Adversarial Networks/GAN ⽣成对抗⽹络Generative Model ⽣成模型Generator ⽣成器Genetic Algorithm/GA 遗传算法Gibbs sampling 吉布斯采样Gini index 基尼指数Global minimum 全局最⼩Global Optimization 全局优化Gradient boosting 梯度提升Gradient Descent 梯度下降Graph theory 图论Ground-truth 真相／真实Letter HHard margin 硬间隔Hard voting 硬投票Harmonic mean 调和平均Hesse matrix 海塞矩阵Hidden dynamic model 隐动态模型Hidden layer 隐藏层Hidden Markov Model/HMM 隐马尔可夫模型Hierarchical clustering 层次聚类Hilbert space 希尔伯特空间Hinge loss function 合页损失函数Hold-out 留出法Homogeneous 同质Hybrid computing 混合计算Hyperparameter 超参数Hypothesis 假设Hypothesis test 假设验证Letter IICML 国际机器学习会议Improved iterative scaling/IIS 改进的迭代尺度法Incremental learning 增量学习Independent and identically distributed/i.i.d. 独⽴同分布Independent Component Analysis/ICA 独⽴成分分析Indicator function 指⽰函数Individual learner 个体学习器Induction 归纳Inductive bias 归纳偏好Inductive learning 归纳学习Inductive Logic Programming／ILP 归纳逻辑程序设计Information entropy 信息熵Information gain 信息增益Input layer 输⼊层Insensitive loss 不敏感损失Inter-cluster similarity 簇间相似度International Conference for Machine Learning/ICML 国际机器学习⼤会Intra-cluster similarity 簇内相似度Intrinsic value 固有值Isometric Mapping/Isomap 等度量映射Isotonic regression 等分回归Iterative Dichotomiser 迭代⼆分器Letter KKernel method 核⽅法Kernel trick 核技巧Kernelized Linear Discriminant Analysis／KLDA 核线性判别分析K-fold cross validation k 折交叉验证／k 倍交叉验证K-Means Clustering K – 均值聚类K-Nearest Neighbours Algorithm/KNN K近邻算法Knowledge base 知识库Knowledge Representation 知识表征Letter LLabel space 标记空间Lagrange duality 拉格朗⽇对偶性Lagrange multiplier 拉格朗⽇乘⼦Laplace smoothing 拉普拉斯平滑Laplacian correction 拉普拉斯修正Latent Dirichlet Allocation 隐狄利克雷分布Latent semantic analysis 潜在语义分析Latent variable 隐变量Lazy learning 懒惰学习Learner 学习器Learning by analogy 类⽐学习Learning rate 学习率Learning Vector Quantization/LVQ 学习向量量化Least squares regression tree 最⼩⼆乘回归树Leave-One-Out/LOO 留⼀法linear chain conditional random field 线性链条件随机场Linear Discriminant Analysis／LDA 线性判别分析Linear model 线性模型Linear Regression 线性回归Link function 联系函数Local Markov property 局部马尔可夫性Local minimum 局部最⼩Log likelihood 对数似然Log odds／logit 对数⼏率Logistic Regression Logistic 回归Log-likelihood 对数似然Log-linear regression 对数线性回归Long-Short Term Memory/LSTM 长短期记忆Loss function 损失函数Letter MMachine translation/MT 机器翻译Macron-P 宏查准率Macron-R 宏查全率Majority voting 绝对多数投票法Manifold assumption 流形假设Manifold learning 流形学习Margin theory 间隔理论Marginal distribution 边际分布Marginal independence 边际独⽴性Marginalization 边际化Markov Chain Monte Carlo/MCMC 马尔可夫链蒙特卡罗⽅法Markov Random Field 马尔可夫随机场Maximal clique 最⼤团Maximum Likelihood Estimation/MLE 极⼤似然估计／极⼤似然法Maximum margin 最⼤间隔Maximum weighted spanning tree 最⼤带权⽣成树Max-Pooling 最⼤池化Mean squared error 均⽅误差Meta-learner 元学习器Metric learning 度量学习Micro-P 微查准率Micro-R 微查全率Minimal Description Length/MDL 最⼩描述长度Minimax game 极⼩极⼤博弈Misclassification cost 误分类成本Mixture of experts 混合专家Momentum 动量Moral graph 道德图／端正图Multi-class classification 多分类Multi-document summarization 多⽂档摘要Multi-layer feedforward neural networks 多层前馈神经⽹络Multilayer Perceptron/MLP 多层感知器Multimodal learning 多模态学习Multiple Dimensional Scaling 多维缩放Multiple linear regression 多元线性回归Multi-response Linear Regression ／MLR 多响应线性回归Mutual information 互信息Letter NNaive bayes 朴素贝叶斯Naive Bayes Classifier 朴素贝叶斯分类器Named entity recognition 命名实体识别Nash equilibrium 纳什均衡Natural language generation/NLG ⾃然语⾔⽣成Natural language processing ⾃然语⾔处理Negative class 负类Negative correlation 负相关法Negative Log Likelihood 负对数似然Neighbourhood Component Analysis/NCA 近邻成分分析Neural Machine Translation 神经机器翻译Neural Turing Machine 神经图灵机Newton method ⽜顿法NIPS 国际神经信息处理系统会议No Free Lunch Theorem／NFL 没有免费的午餐定理Noise-contrastive estimation 噪⾳对⽐估计Nominal attribute 列名属性Non-convex optimization ⾮凸优化Nonlinear model ⾮线性模型Non-metric distance ⾮度量距离Non-negative matrix factorization ⾮负矩阵分解Non-ordinal attribute ⽆序属性Non-Saturating Game ⾮饱和博弈Norm 范数Normalization 归⼀化Nuclear norm 核范数Numerical attribute 数值属性Letter OObjective function ⽬标函数Oblique decision tree 斜决策树Occam’s razor 奥卡姆剃⼑Odds ⼏率Off-Policy 离策略One shot learning ⼀次性学习One-Dependent Estimator／ODE 独依赖估计On-Policy 在策略Ordinal attribute 有序属性Out-of-bag estimate 包外估计Output layer 输出层Output smearing 输出调制法Overfitting 过拟合／过配Oversampling 过采样Letter PPaired t-test 成对 t 检验Pairwise 成对型Pairwise Markov property 成对马尔可夫性Parameter 参数Parameter estimation 参数估计Parameter tuning 调参Parse tree 解析树Particle Swarm Optimization/PSO 粒⼦群优化算法Part-of-speech tagging 词性标注Perceptron 感知机Performance measure 性能度量Plug and Play Generative Network 即插即⽤⽣成⽹络Plurality voting 相对多数投票法Polarity detection 极性检测Polynomial kernel function 多项式核函数Pooling 池化Positive class 正类Positive definite matrix 正定矩阵Post-hoc test 后续检验Post-pruning 后剪枝potential function 势函数Precision 查准率／准确率Prepruning 预剪枝Principal component analysis/PCA 主成分分析Principle of multiple explanations 多释原则Prior 先验Probability Graphical Model 概率图模型Proximal Gradient Descent/PGD 近端梯度下降Pruning 剪枝Pseudo-label 伪标记Letter QQuantized Neural Network 量⼦化神经⽹络Quantum computer 量⼦计算机Quantum Computing 量⼦计算Quasi Newton method 拟⽜顿法Letter RRadial Basis Function／RBF 径向基函数Random Forest Algorithm 随机森林算法Random walk 随机漫步Recall 查全率／召回率Receiver Operating Characteristic/ROC 受试者⼯作特征Rectified Linear Unit/ReLU 线性修正单元Recurrent Neural Network 循环神经⽹络Recursive neural network 递归神经⽹络Reference model 参考模型Regression 回归Regularization 正则化Reinforcement learning/RL 强化学习Representation learning 表征学习Representer theorem 表⽰定理reproducing kernel Hilbert space/RKHS 再⽣核希尔伯特空间Re-sampling 重采样法Rescaling 再缩放Residual Mapping 残差映射Residual Network 残差⽹络Restricted Boltzmann Machine/RBM 受限玻尔兹曼机Restricted Isometry Property/RIP 限定等距性Re-weighting 重赋权法Robustness 稳健性/鲁棒性Root node 根结点Rule Engine 规则引擎Rule learning 规则学习Letter SSaddle point 鞍点Sample space 样本空间Sampling 采样Score function 评分函数Self-Driving ⾃动驾驶Self-Organizing Map／SOM ⾃组织映射Semi-naive Bayes classifiers 半朴素贝叶斯分类器Semi-Supervised Learning 半监督学习semi-Supervised Support Vector Machine 半监督⽀持向量机Sentiment analysis 情感分析Separating hyperplane 分离超平⾯Sigmoid function Sigmoid 函数Similarity measure 相似度度量Simulated annealing 模拟退⽕Simultaneous localization and mapping 同步定位与地图构建Singular Value Decomposition 奇异值分解Slack variables 松弛变量Smoothing 平滑Soft margin 软间隔Soft margin maximization 软间隔最⼤化Soft voting 软投票Sparse representation 稀疏表征Sparsity 稀疏性Specialization 特化Spectral Clustering 谱聚类Speech Recognition 语⾳识别Splitting variable 切分变量Squashing function 挤压函数Stability-plasticity dilemma 可塑性-稳定性困境Statistical learning 统计学习Status feature function 状态特征函Stochastic gradient descent 随机梯度下降Stratified sampling 分层采样Structural risk 结构风险Structural risk minimization/SRM 结构风险最⼩化Subspace ⼦空间Supervised learning 监督学习／有导师学习support vector expansion ⽀持向量展式Support Vector Machine/SVM ⽀持向量机Surrogat loss 替代损失Surrogate function 替代函数Symbolic learning 符号学习Symbolism 符号主义Synset 同义词集Letter TT-Distribution Stochastic Neighbour Embedding/t-SNE T – 分布随机近邻嵌⼊Tensor 张量Tensor Processing Units/TPU 张量处理单元The least square method 最⼩⼆乘法Threshold 阈值Threshold logic unit 阈值逻辑单元Threshold-moving 阈值移动Time Step 时间步骤Tokenization 标记化Training error 训练误差Training instance 训练⽰例／训练例Transductive learning 直推学习Transfer learning 迁移学习Treebank 树库Tria-by-error 试错法True negative 真负类True positive 真正类True Positive Rate/TPR 真正例率Turing Machine 图灵机Twice-learning ⼆次学习Letter UUnderfitting ⽋拟合／⽋配Undersampling ⽋采样Understandability 可理解性Unequal cost ⾮均等代价Unit-step function 单位阶跃函数Univariate decision tree 单变量决策树Unsupervised learning ⽆监督学习／⽆导师学习Unsupervised layer-wise training ⽆监督逐层训练Upsampling 上采样Letter VVanishing Gradient Problem 梯度消失问题Variational inference 变分推断VC Theory VC维理论Version space 版本空间Viterbi algorithm 维特⽐算法Von Neumann architecture 冯 · 诺伊曼架构Letter WWasserstein GAN/WGAN Wasserstein⽣成对抗⽹络Weak learner 弱学习器Weight 权重Weight sharing 权共享Weighted voting 加权投票法Within-class scatter matrix 类内散度矩阵Word embedding 词嵌⼊Word sense disambiguation 词义消歧Letter ZZero-data learning 零数据学习Zero-shot learning 零次学习Aapproximations近似值arbitrary随意的affine仿射的arbitrary任意的amino acid氨基酸amenable经得起检验的axiom公理，原则abstract提取architecture架构，体系结构；建造业absolute绝对的arsenal军⽕库assignment分配algebra线性代数asymptotically⽆症状的appropriate恰当的Bbias偏差brevity简短，简洁；短暂broader⼴泛briefly简短的batch批量Cconvergence 收敛，集中到⼀点convex凸的contours轮廓constraint约束constant常理commercial商务的complementarity补充coordinate ascent同等级上升clipping剪下物；剪报；修剪component分量；部件continuous连续的covariance协⽅差canonical正规的，正则的concave⾮凸的corresponds相符合；相当；通信corollary推论concrete具体的事物，实在的东西cross validation交叉验证correlation相互关系convention约定cluster⼀簇centroids 质⼼，形⼼converge收敛computationally计算(机)的calculus计算Dderive获得，取得dual⼆元的duality⼆元性；⼆象性；对偶性derivation求导；得到；起源denote预⽰，表⽰，是…的标志；意味着，[逻]指称divergence 散度；发散性dimension尺度，规格；维数dot⼩圆点distortion变形density概率密度函数discrete离散的discriminative有识别能⼒的diagonal对⾓dispersion分散，散开determinant决定因素disjoint不相交的Eencounter遇到ellipses椭圆equality等式extra额外的empirical经验；观察ennmerate例举，计数exceed超过，越出expectation期望efficient⽣效的endow赋予explicitly清楚的exponential family指数家族equivalently等价的Ffeasible可⾏的forary初次尝试finite有限的，限定的forgo摒弃，放弃fliter过滤frequentist最常发⽣的forward search前向式搜索formalize使定形Ggeneralized归纳的generalization概括，归纳；普遍化；判断（根据不⾜）guarantee保证；抵押品generate形成，产⽣geometric margins⼏何边界gap裂⼝generative⽣产的；有⽣产⼒的Hheuristic启发式的；启发法；启发程序hone怀恋；磨hyperplane超平⾯Linitial最初的implement执⾏intuitive凭直觉获知的incremental增加的intercept截距intuitious直觉instantiation例⼦indicator指⽰物，指⽰器interative重复的，迭代的integral积分identical相等的；完全相同的indicate表⽰，指出invariance不变性，恒定性impose把…强加于intermediate中间的interpretation解释，翻译Jjoint distribution联合概率Llieu替代logarithmic对数的，⽤对数表⽰的latent潜在的Leave-one-out cross validation留⼀法交叉验证Mmagnitude巨⼤mapping绘图，制图；映射matrix矩阵mutual相互的，共同的monotonically单调的minor较⼩的，次要的multinomial多项的multi-class classification⼆分类问题Nnasty讨厌的notation标志，注释naïve朴素的Oobtain得到oscillate摆动optimization problem最优化问题objective function⽬标函数optimal最理想的orthogonal(⽮量，矩阵等)正交的orientation⽅向ordinary普通的occasionally偶然的Ppartial derivative偏导数property性质proportional成⽐例的primal原始的，最初的permit允许pseudocode伪代码permissible可允许的polynomial多项式preliminary预备precision精度perturbation 不安，扰乱poist假定，设想positive semi-definite半正定的parentheses圆括号posterior probability后验概率plementarity补充pictorially图像的parameterize确定…的参数poisson distribution柏松分布pertinent相关的Qquadratic⼆次的quantity量，数量；分量query疑问的Rregularization使系统化；调整reoptimize重新优化restrict限制；限定；约束reminiscent回忆往事的；提醒的；使⼈联想…的（of）remark注意random variable随机变量respect考虑respectively各⾃的；分别的redundant过多的；冗余的Ssusceptible敏感的stochastic可能的；随机的symmetric对称的sophisticated复杂的spurious假的；伪造的subtract减去；减法器simultaneously同时发⽣地；同步地suffice满⾜scarce稀有的，难得的split分解，分离subset⼦集statistic统计量successive iteratious连续的迭代scale标度sort of有⼏分的squares平⽅Ttrajectory轨迹temporarily暂时的terminology专⽤名词tolerance容忍；公差thumb翻阅threshold阈，临界theorem定理tangent正弦Uunit-length vector单位向量Vvalid有效的，正确的variance⽅差variable变量；变元vocabulary词汇valued经估价的；宝贵的Wwrapper包装分类:。

机器学习与人工智能领域中常用的英语词汇

机器学习与人工智能领域中常用的英语词汇1.General Concepts (基础概念)•Artificial Intelligence (AI) - 人工智能1)Artificial Intelligence (AI) - 人工智能2)Machine Learning (ML) - 机器学习3)Deep Learning (DL) - 深度学习4)Neural Network - 神经网络5)Natural Language Processing (NLP) - 自然语言处理6)Computer Vision - 计算机视觉7)Robotics - 机器人技术8)Speech Recognition - 语音识别9)Expert Systems - 专家系统10)Knowledge Representation - 知识表示11)Pattern Recognition - 模式识别12)Cognitive Computing - 认知计算13)Autonomous Systems - 自主系统14)Human-Machine Interaction - 人机交互15)Intelligent Agents - 智能代理16)Machine Translation - 机器翻译17)Swarm Intelligence - 群体智能18)Genetic Algorithms - 遗传算法19)Fuzzy Logic - 模糊逻辑20)Reinforcement Learning - 强化学习•Machine Learning (ML) - 机器学习1)Machine Learning (ML) - 机器学习2)Artificial Neural Network - 人工神经网络3)Deep Learning - 深度学习4)Supervised Learning - 有监督学习5)Unsupervised Learning - 无监督学习6)Reinforcement Learning - 强化学习7)Semi-Supervised Learning - 半监督学习8)Training Data - 训练数据9)Test Data - 测试数据10)Validation Data - 验证数据11)Feature - 特征12)Label - 标签13)Model - 模型14)Algorithm - 算法15)Regression - 回归16)Classification - 分类17)Clustering - 聚类18)Dimensionality Reduction - 降维19)Overfitting - 过拟合20)Underfitting - 欠拟合•Deep Learning (DL) - 深度学习1)Deep Learning - 深度学习2)Neural Network - 神经网络3)Artificial Neural Network (ANN) - 人工神经网络4)Convolutional Neural Network (CNN) - 卷积神经网络5)Recurrent Neural Network (RNN) - 循环神经网络6)Long Short-Term Memory (LSTM) - 长短期记忆网络7)Gated Recurrent Unit (GRU) - 门控循环单元8)Autoencoder - 自编码器9)Generative Adversarial Network (GAN) - 生成对抗网络10)Transfer Learning - 迁移学习11)Pre-trained Model - 预训练模型12)Fine-tuning - 微调13)Feature Extraction - 特征提取14)Activation Function - 激活函数15)Loss Function - 损失函数16)Gradient Descent - 梯度下降17)Backpropagation - 反向传播18)Epoch - 训练周期19)Batch Size - 批量大小20)Dropout - 丢弃法•Neural Network - 神经网络1)Neural Network - 神经网络2)Artificial Neural Network (ANN) - 人工神经网络3)Deep Neural Network (DNN) - 深度神经网络4)Convolutional Neural Network (CNN) - 卷积神经网络5)Recurrent Neural Network (RNN) - 循环神经网络6)Long Short-Term Memory (LSTM) - 长短期记忆网络7)Gated Recurrent Unit (GRU) - 门控循环单元8)Feedforward Neural Network - 前馈神经网络9)Multi-layer Perceptron (MLP) - 多层感知器10)Radial Basis Function Network (RBFN) - 径向基函数网络11)Hopfield Network - 霍普菲尔德网络12)Boltzmann Machine - 玻尔兹曼机13)Autoencoder - 自编码器14)Spiking Neural Network (SNN) - 脉冲神经网络15)Self-organizing Map (SOM) - 自组织映射16)Restricted Boltzmann Machine (RBM) - 受限玻尔兹曼机17)Hebbian Learning - 海比安学习18)Competitive Learning - 竞争学习19)Neuroevolutionary - 神经进化20)Neuron - 神经元•Algorithm - 算法1)Algorithm - 算法2)Supervised Learning Algorithm - 有监督学习算法3)Unsupervised Learning Algorithm - 无监督学习算法4)Reinforcement Learning Algorithm - 强化学习算法5)Classification Algorithm - 分类算法6)Regression Algorithm - 回归算法7)Clustering Algorithm - 聚类算法8)Dimensionality Reduction Algorithm - 降维算法9)Decision Tree Algorithm - 决策树算法10)Random Forest Algorithm - 随机森林算法11)Support Vector Machine (SVM) Algorithm - 支持向量机算法12)K-Nearest Neighbors (KNN) Algorithm - K近邻算法13)Naive Bayes Algorithm - 朴素贝叶斯算法14)Gradient Descent Algorithm - 梯度下降算法15)Genetic Algorithm - 遗传算法16)Neural Network Algorithm - 神经网络算法17)Deep Learning Algorithm - 深度学习算法18)Ensemble Learning Algorithm - 集成学习算法19)Reinforcement Learning Algorithm - 强化学习算法20)Metaheuristic Algorithm - 元启发式算法•Model - 模型1)Model - 模型2)Machine Learning Model - 机器学习模型3)Artificial Intelligence Model - 人工智能模型4)Predictive Model - 预测模型5)Classification Model - 分类模型6)Regression Model - 回归模型7)Generative Model - 生成模型8)Discriminative Model - 判别模型9)Probabilistic Model - 概率模型10)Statistical Model - 统计模型11)Neural Network Model - 神经网络模型12)Deep Learning Model - 深度学习模型13)Ensemble Model - 集成模型14)Reinforcement Learning Model - 强化学习模型15)Support Vector Machine (SVM) Model - 支持向量机模型16)Decision Tree Model - 决策树模型17)Random Forest Model - 随机森林模型18)Naive Bayes Model - 朴素贝叶斯模型19)Autoencoder Model - 自编码器模型20)Convolutional Neural Network (CNN) Model - 卷积神经网络模型•Dataset - 数据集1)Dataset - 数据集2)Training Dataset - 训练数据集3)Test Dataset - 测试数据集4)Validation Dataset - 验证数据集5)Balanced Dataset - 平衡数据集6)Imbalanced Dataset - 不平衡数据集7)Synthetic Dataset - 合成数据集8)Benchmark Dataset - 基准数据集9)Open Dataset - 开放数据集10)Labeled Dataset - 标记数据集11)Unlabeled Dataset - 未标记数据集12)Semi-Supervised Dataset - 半监督数据集13)Multiclass Dataset - 多分类数据集14)Feature Set - 特征集15)Data Augmentation - 数据增强16)Data Preprocessing - 数据预处理17)Missing Data - 缺失数据18)Outlier Detection - 异常值检测19)Data Imputation - 数据插补20)Metadata - 元数据•Training - 训练1)Training - 训练2)Training Data - 训练数据3)Training Phase - 训练阶段4)Training Set - 训练集5)Training Examples - 训练样本6)Training Instance - 训练实例7)Training Algorithm - 训练算法8)Training Model - 训练模型9)Training Process - 训练过程10)Training Loss - 训练损失11)Training Epoch - 训练周期12)Training Batch - 训练批次13)Online Training - 在线训练14)Offline Training - 离线训练15)Continuous Training - 连续训练16)Transfer Learning - 迁移学习17)Fine-Tuning - 微调18)Curriculum Learning - 课程学习19)Self-Supervised Learning - 自监督学习20)Active Learning - 主动学习•Testing - 测试1)Testing - 测试2)Test Data - 测试数据3)Test Set - 测试集4)Test Examples - 测试样本5)Test Instance - 测试实例6)Test Phase - 测试阶段7)Test Accuracy - 测试准确率8)Test Loss - 测试损失9)Test Error - 测试错误10)Test Metrics - 测试指标11)Test Suite - 测试套件12)Test Case - 测试用例13)Test Coverage - 测试覆盖率14)Cross-Validation - 交叉验证15)Holdout Validation - 留出验证16)K-Fold Cross-Validation - K折交叉验证17)Stratified Cross-Validation - 分层交叉验证18)Test Driven Development (TDD) - 测试驱动开发19)A/B Testing - A/B 测试20)Model Evaluation - 模型评估•Validation - 验证1)Validation - 验证2)Validation Data - 验证数据3)Validation Set - 验证集4)Validation Examples - 验证样本5)Validation Instance - 验证实例6)Validation Phase - 验证阶段7)Validation Accuracy - 验证准确率8)Validation Loss - 验证损失9)Validation Error - 验证错误10)Validation Metrics - 验证指标11)Cross-Validation - 交叉验证12)Holdout Validation - 留出验证13)K-Fold Cross-Validation - K折交叉验证14)Stratified Cross-Validation - 分层交叉验证15)Leave-One-Out Cross-Validation - 留一法交叉验证16)Validation Curve - 验证曲线17)Hyperparameter Validation - 超参数验证18)Model Validation - 模型验证19)Early Stopping - 提前停止20)Validation Strategy - 验证策略•Supervised Learning - 有监督学习1)Supervised Learning - 有监督学习2)Label - 标签3)Feature - 特征4)Target - 目标5)Training Labels - 训练标签6)Training Features - 训练特征7)Training Targets - 训练目标8)Training Examples - 训练样本9)Training Instance - 训练实例10)Regression - 回归11)Classification - 分类12)Predictor - 预测器13)Regression Model - 回归模型14)Classifier - 分类器15)Decision Tree - 决策树16)Support Vector Machine (SVM) - 支持向量机17)Neural Network - 神经网络18)Feature Engineering - 特征工程19)Model Evaluation - 模型评估20)Overfitting - 过拟合21)Underfitting - 欠拟合22)Bias-Variance Tradeoff - 偏差-方差权衡•Unsupervised Learning - 无监督学习1)Unsupervised Learning - 无监督学习2)Clustering - 聚类3)Dimensionality Reduction - 降维4)Anomaly Detection - 异常检测5)Association Rule Learning - 关联规则学习6)Feature Extraction - 特征提取7)Feature Selection - 特征选择8)K-Means - K均值9)Hierarchical Clustering - 层次聚类10)Density-Based Clustering - 基于密度的聚类11)Principal Component Analysis (PCA) - 主成分分析12)Independent Component Analysis (ICA) - 独立成分分析13)T-distributed Stochastic Neighbor Embedding (t-SNE) - t分布随机邻居嵌入14)Gaussian Mixture Model (GMM) - 高斯混合模型15)Self-Organizing Maps (SOM) - 自组织映射16)Autoencoder - 自动编码器17)Latent Variable - 潜变量18)Data Preprocessing - 数据预处理19)Outlier Detection - 异常值检测20)Clustering Algorithm - 聚类算法•Reinforcement Learning - 强化学习1)Reinforcement Learning - 强化学习2)Agent - 代理3)Environment - 环境4)State - 状态5)Action - 动作6)Reward - 奖励7)Policy - 策略8)Value Function - 值函数9)Q-Learning - Q学习10)Deep Q-Network (DQN) - 深度Q网络11)Policy Gradient - 策略梯度12)Actor-Critic - 演员-评论家13)Exploration - 探索14)Exploitation - 开发15)Temporal Difference (TD) - 时间差分16)Markov Decision Process (MDP) - 马尔可夫决策过程17)State-Action-Reward-State-Action (SARSA) - 状态-动作-奖励-状态-动作18)Policy Iteration - 策略迭代19)Value Iteration - 值迭代20)Monte Carlo Methods - 蒙特卡洛方法•Semi-Supervised Learning - 半监督学习1)Semi-Supervised Learning - 半监督学习2)Labeled Data - 有标签数据3)Unlabeled Data - 无标签数据4)Label Propagation - 标签传播5)Self-Training - 自训练6)Co-Training - 协同训练7)Transudative Learning - 传导学习8)Inductive Learning - 归纳学习9)Manifold Regularization - 流形正则化10)Graph-based Methods - 基于图的方法11)Cluster Assumption - 聚类假设12)Low-Density Separation - 低密度分离13)Semi-Supervised Support Vector Machines (S3VM) - 半监督支持向量机14)Expectation-Maximization (EM) - 期望最大化15)Co-EM - 协同期望最大化16)Entropy-Regularized EM - 熵正则化EM17)Mean Teacher - 平均教师18)Virtual Adversarial Training - 虚拟对抗训练19)Tri-training - 三重训练20)Mix Match - 混合匹配•Feature - 特征1)Feature - 特征2)Feature Engineering - 特征工程3)Feature Extraction - 特征提取4)Feature Selection - 特征选择5)Input Features - 输入特征6)Output Features - 输出特征7)Feature Vector - 特征向量8)Feature Space - 特征空间9)Feature Representation - 特征表示10)Feature Transformation - 特征转换11)Feature Importance - 特征重要性12)Feature Scaling - 特征缩放13)Feature Normalization - 特征归一化14)Feature Encoding - 特征编码15)Feature Fusion - 特征融合16)Feature Dimensionality Reduction - 特征维度减少17)Continuous Feature - 连续特征18)Categorical Feature - 分类特征19)Nominal Feature - 名义特征20)Ordinal Feature - 有序特征•Label - 标签1)Label - 标签2)Labeling - 标注3)Ground Truth - 地面真值4)Class Label - 类别标签5)Target Variable - 目标变量6)Labeling Scheme - 标注方案7)Multi-class Labeling - 多类别标注8)Binary Labeling - 二分类标注9)Label Noise - 标签噪声10)Labeling Error - 标注错误11)Label Propagation - 标签传播12)Unlabeled Data - 无标签数据13)Labeled Data - 有标签数据14)Semi-supervised Learning - 半监督学习15)Active Learning - 主动学习16)Weakly Supervised Learning - 弱监督学习17)Noisy Label Learning - 噪声标签学习18)Self-training - 自训练19)Crowdsourcing Labeling - 众包标注20)Label Smoothing - 标签平滑化•Prediction - 预测1)Prediction - 预测2)Forecasting - 预测3)Regression - 回归4)Classification - 分类5)Time Series Prediction - 时间序列预测6)Forecast Accuracy - 预测准确性7)Predictive Modeling - 预测建模8)Predictive Analytics - 预测分析9)Forecasting Method - 预测方法10)Predictive Performance - 预测性能11)Predictive Power - 预测能力12)Prediction Error - 预测误差13)Prediction Interval - 预测区间14)Prediction Model - 预测模型15)Predictive Uncertainty - 预测不确定性16)Forecast Horizon - 预测时间跨度17)Predictive Maintenance - 预测性维护18)Predictive Policing - 预测式警务19)Predictive Healthcare - 预测性医疗20)Predictive Maintenance - 预测性维护•Classification - 分类1)Classification - 分类2)Classifier - 分类器3)Class - 类别4)Classify - 对数据进行分类5)Class Label - 类别标签6)Binary Classification - 二元分类7)Multiclass Classification - 多类分类8)Class Probability - 类别概率9)Decision Boundary - 决策边界10)Decision Tree - 决策树11)Support Vector Machine (SVM) - 支持向量机12)K-Nearest Neighbors (KNN) - K最近邻算法13)Naive Bayes - 朴素贝叶斯14)Logistic Regression - 逻辑回归15)Random Forest - 随机森林16)Neural Network - 神经网络17)SoftMax Function - SoftMax函数18)One-vs-All (One-vs-Rest) - 一对多(一对剩余)19)Ensemble Learning - 集成学习20)Confusion Matrix - 混淆矩阵•Regression - 回归1)Regression Analysis - 回归分析2)Linear Regression - 线性回归3)Multiple Regression - 多元回归4)Polynomial Regression - 多项式回归5)Logistic Regression - 逻辑回归6)Ridge Regression - 岭回归7)Lasso Regression - Lasso回归8)Elastic Net Regression - 弹性网络回归9)Regression Coefficients - 回归系数10)Residuals - 残差11)Ordinary Least Squares (OLS) - 普通最小二乘法12)Ridge Regression Coefficient - 岭回归系数13)Lasso Regression Coefficient - Lasso回归系数14)Elastic Net Regression Coefficient - 弹性网络回归系数15)Regression Line - 回归线16)Prediction Error - 预测误差17)Regression Model - 回归模型18)Nonlinear Regression - 非线性回归19)Generalized Linear Models (GLM) - 广义线性模型20)Coefficient of Determination (R-squared) - 决定系数21)F-test - F检验22)Homoscedasticity - 同方差性23)Heteroscedasticity - 异方差性24)Autocorrelation - 自相关25)Multicollinearity - 多重共线性26)Outliers - 异常值27)Cross-validation - 交叉验证28)Feature Selection - 特征选择29)Feature Engineering - 特征工程30)Regularization - 正则化2.Neural Networks and Deep Learning (神经网络与深度学习)•Convolutional Neural Network (CNN) - 卷积神经网络1)Convolutional Neural Network (CNN) - 卷积神经网络2)Convolution Layer - 卷积层3)Feature Map - 特征图4)Convolution Operation - 卷积操作5)Stride - 步幅6)Padding - 填充7)Pooling Layer - 池化层8)Max Pooling - 最大池化9)Average Pooling - 平均池化10)Fully Connected Layer - 全连接层11)Activation Function - 激活函数12)Rectified Linear Unit (ReLU) - 线性修正单元13)Dropout - 随机失活14)Batch Normalization - 批量归一化15)Transfer Learning - 迁移学习16)Fine-Tuning - 微调17)Image Classification - 图像分类18)Object Detection - 物体检测19)Semantic Segmentation - 语义分割20)Instance Segmentation - 实例分割21)Generative Adversarial Network (GAN) - 生成对抗网络22)Image Generation - 图像生成23)Style Transfer - 风格迁移24)Convolutional Autoencoder - 卷积自编码器25)Recurrent Neural Network (RNN) - 循环神经网络•Recurrent Neural Network (RNN) - 循环神经网络1)Recurrent Neural Network (RNN) - 循环神经网络2)Long Short-Term Memory (LSTM) - 长短期记忆网络3)Gated Recurrent Unit (GRU) - 门控循环单元4)Sequence Modeling - 序列建模5)Time Series Prediction - 时间序列预测6)Natural Language Processing (NLP) - 自然语言处理7)Text Generation - 文本生成8)Sentiment Analysis - 情感分析9)Named Entity Recognition (NER) - 命名实体识别10)Part-of-Speech Tagging (POS Tagging) - 词性标注11)Sequence-to-Sequence (Seq2Seq) - 序列到序列12)Attention Mechanism - 注意力机制13)Encoder-Decoder Architecture - 编码器-解码器架构14)Bidirectional RNN - 双向循环神经网络15)Teacher Forcing - 强制教师法16)Backpropagation Through Time (BPTT) - 通过时间的反向传播17)Vanishing Gradient Problem - 梯度消失问题18)Exploding Gradient Problem - 梯度爆炸问题19)Language Modeling - 语言建模20)Speech Recognition - 语音识别•Long Short-Term Memory (LSTM) - 长短期记忆网络1)Long Short-Term Memory (LSTM) - 长短期记忆网络2)Cell State - 细胞状态3)Hidden State - 隐藏状态4)Forget Gate - 遗忘门5)Input Gate - 输入门6)Output Gate - 输出门7)Peephole Connections - 窥视孔连接8)Gated Recurrent Unit (GRU) - 门控循环单元9)Vanishing Gradient Problem - 梯度消失问题10)Exploding Gradient Problem - 梯度爆炸问题11)Sequence Modeling - 序列建模12)Time Series Prediction - 时间序列预测13)Natural Language Processing (NLP) - 自然语言处理14)Text Generation - 文本生成15)Sentiment Analysis - 情感分析16)Named Entity Recognition (NER) - 命名实体识别17)Part-of-Speech Tagging (POS Tagging) - 词性标注18)Attention Mechanism - 注意力机制19)Encoder-Decoder Architecture - 编码器-解码器架构20)Bidirectional LSTM - 双向长短期记忆网络•Attention Mechanism - 注意力机制1)Attention Mechanism - 注意力机制2)Self-Attention - 自注意力3)Multi-Head Attention - 多头注意力4)Transformer - 变换器5)Query - 查询6)Key - 键7)Value - 值8)Query-Value Attention - 查询-值注意力9)Dot-Product Attention - 点积注意力10)Scaled Dot-Product Attention - 缩放点积注意力11)Additive Attention - 加性注意力12)Context Vector - 上下文向量13)Attention Score - 注意力分数14)SoftMax Function - SoftMax函数15)Attention Weight - 注意力权重16)Global Attention - 全局注意力17)Local Attention - 局部注意力18)Positional Encoding - 位置编码19)Encoder-Decoder Attention - 编码器-解码器注意力20)Cross-Modal Attention - 跨模态注意力•Generative Adversarial Network (GAN) - 生成对抗网络1)Generative Adversarial Network (GAN) - 生成对抗网络2)Generator - 生成器3)Discriminator - 判别器4)Adversarial Training - 对抗训练5)Minimax Game - 极小极大博弈6)Nash Equilibrium - 纳什均衡7)Mode Collapse - 模式崩溃8)Training Stability - 训练稳定性9)Loss Function - 损失函数10)Discriminative Loss - 判别损失11)Generative Loss - 生成损失12)Wasserstein GAN (WGAN) - Wasserstein GAN（WGAN）13)Deep Convolutional GAN (DCGAN) - 深度卷积生成对抗网络（DCGAN）14)Conditional GAN (c GAN) - 条件生成对抗网络（c GAN）15)Style GAN - 风格生成对抗网络16)Cycle GAN - 循环生成对抗网络17)Progressive Growing GAN (PGGAN) - 渐进式增长生成对抗网络（PGGAN）18)Self-Attention GAN (SAGAN) - 自注意力生成对抗网络（SAGAN）19)Big GAN - 大规模生成对抗网络20)Adversarial Examples - 对抗样本•Encoder-Decoder - 编码器-解码器1)Encoder-Decoder Architecture - 编码器-解码器架构2)Encoder - 编码器3)Decoder - 解码器4)Sequence-to-Sequence Model (Seq2Seq) - 序列到序列模型5)State Vector - 状态向量6)Context Vector - 上下文向量7)Hidden State - 隐藏状态8)Attention Mechanism - 注意力机制9)Teacher Forcing - 强制教师法10)Beam Search - 束搜索11)Recurrent Neural Network (RNN) - 循环神经网络12)Long Short-Term Memory (LSTM) - 长短期记忆网络13)Gated Recurrent Unit (GRU) - 门控循环单元14)Bidirectional Encoder - 双向编码器15)Greedy Decoding - 贪婪解码16)Masking - 遮盖17)Dropout - 随机失活18)Embedding Layer - 嵌入层19)Cross-Entropy Loss - 交叉熵损失20)Tokenization - 令牌化•Transfer Learning - 迁移学习1)Transfer Learning - 迁移学习2)Source Domain - 源领域3)Target Domain - 目标领域4)Fine-Tuning - 微调5)Domain Adaptation - 领域自适应6)Pre-Trained Model - 预训练模型7)Feature Extraction - 特征提取8)Knowledge Transfer - 知识迁移9)Unsupervised Domain Adaptation - 无监督领域自适应10)Semi-Supervised Domain Adaptation - 半监督领域自适应11)Multi-Task Learning - 多任务学习12)Data Augmentation - 数据增强13)Task Transfer - 任务迁移14)Model Agnostic Meta-Learning (MAML) - 与模型无关的元学习（MAML）15)One-Shot Learning - 单样本学习16)Zero-Shot Learning - 零样本学习17)Few-Shot Learning - 少样本学习18)Knowledge Distillation - 知识蒸馏19)Representation Learning - 表征学习20)Adversarial Transfer Learning - 对抗迁移学习•Pre-trained Models - 预训练模型1)Pre-trained Model - 预训练模型2)Transfer Learning - 迁移学习3)Fine-Tuning - 微调4)Knowledge Transfer - 知识迁移5)Domain Adaptation - 领域自适应6)Feature Extraction - 特征提取7)Representation Learning - 表征学习8)Language Model - 语言模型9)Bidirectional Encoder Representations from Transformers (BERT) - 双向编码器结构转换器10)Generative Pre-trained Transformer (GPT) - 生成式预训练转换器11)Transformer-based Models - 基于转换器的模型12)Masked Language Model (MLM) - 掩蔽语言模型13)Cloze Task - 填空任务14)Tokenization - 令牌化15)Word Embeddings - 词嵌入16)Sentence Embeddings - 句子嵌入17)Contextual Embeddings - 上下文嵌入18)Self-Supervised Learning - 自监督学习19)Large-Scale Pre-trained Models - 大规模预训练模型•Loss Function - 损失函数1)Loss Function - 损失函数2)Mean Squared Error (MSE) - 均方误差3)Mean Absolute Error (MAE) - 平均绝对误差4)Cross-Entropy Loss - 交叉熵损失5)Binary Cross-Entropy Loss - 二元交叉熵损失6)Categorical Cross-Entropy Loss - 分类交叉熵损失7)Hinge Loss - 合页损失8)Huber Loss - Huber损失9)Wasserstein Distance - Wasserstein距离10)Triplet Loss - 三元组损失11)Contrastive Loss - 对比损失12)Dice Loss - Dice损失13)Focal Loss - 焦点损失14)GAN Loss - GAN损失15)Adversarial Loss - 对抗损失16)L1 Loss - L1损失17)L2 Loss - L2损失18)Huber Loss - Huber损失19)Quantile Loss - 分位数损失•Activation Function - 激活函数1)Activation Function - 激活函数2)Sigmoid Function - Sigmoid函数3)Hyperbolic Tangent Function (Tanh) - 双曲正切函数4)Rectified Linear Unit (Re LU) - 矩形线性单元5)Parametric Re LU (P Re LU) - 参数化Re LU6)Exponential Linear Unit (ELU) - 指数线性单元7)Swish Function - Swish函数8)Softplus Function - Soft plus函数9)Softmax Function - SoftMax函数10)Hard Tanh Function - 硬双曲正切函数11)Softsign Function - Softsign函数12)GELU (Gaussian Error Linear Unit) - GELU（高斯误差线性单元）13)Mish Function - Mish函数14)CELU (Continuous Exponential Linear Unit) - CELU（连续指数线性单元）15)Bent Identity Function - 弯曲恒等函数16)Gaussian Error Linear Units (GELUs) - 高斯误差线性单元17)Adaptive Piecewise Linear (APL) - 自适应分段线性函数18)Radial Basis Function (RBF) - 径向基函数•Backpropagation - 反向传播1)Backpropagation - 反向传播2)Gradient Descent - 梯度下降3)Partial Derivative - 偏导数4)Chain Rule - 链式法则5)Forward Pass - 前向传播6)Backward Pass - 反向传播7)Computational Graph - 计算图8)Neural Network - 神经网络9)Loss Function - 损失函数10)Gradient Calculation - 梯度计算11)Weight Update - 权重更新12)Activation Function - 激活函数13)Optimizer - 优化器14)Learning Rate - 学习率15)Mini-Batch Gradient Descent - 小批量梯度下降16)Stochastic Gradient Descent (SGD) - 随机梯度下降17)Batch Gradient Descent - 批量梯度下降18)Momentum - 动量19)Adam Optimizer - Adam优化器20)Learning Rate Decay - 学习率衰减•Gradient Descent - 梯度下降1)Gradient Descent - 梯度下降2)Stochastic Gradient Descent (SGD) - 随机梯度下降3)Mini-Batch Gradient Descent - 小批量梯度下降4)Batch Gradient Descent - 批量梯度下降5)Learning Rate - 学习率6)Momentum - 动量7)Adaptive Moment Estimation (Adam) - 自适应矩估计8)RMSprop - 均方根传播9)Learning Rate Schedule - 学习率调度10)Convergence - 收敛11)Divergence - 发散12)Adagrad - 自适应学习速率方法13)Adadelta - 自适应增量学习率方法14)Adamax - 自适应矩估计的扩展版本15)Nadam - Nesterov Accelerated Adaptive Moment Estimation16)Learning Rate Decay - 学习率衰减17)Step Size - 步长18)Conjugate Gradient Descent - 共轭梯度下降19)Line Search - 线搜索20)Newton's Method - 牛顿法•Learning Rate - 学习率1)Learning Rate - 学习率2)Adaptive Learning Rate - 自适应学习率3)Learning Rate Decay - 学习率衰减4)Initial Learning Rate - 初始学习率5)Step Size - 步长6)Momentum - 动量7)Exponential Decay - 指数衰减8)Annealing - 退火9)Cyclical Learning Rate - 循环学习率10)Learning Rate Schedule - 学习率调度11)Warm-up - 预热12)Learning Rate Policy - 学习率策略13)Learning Rate Annealing - 学习率退火14)Cosine Annealing - 余弦退火15)Gradient Clipping - 梯度裁剪16)Adapting Learning Rate - 适应学习率17)Learning Rate Multiplier - 学习率倍增器18)Learning Rate Reduction - 学习率降低19)Learning Rate Update - 学习率更新20)Scheduled Learning Rate - 定期学习率•Batch Size - 批量大小1)Batch Size - 批量大小2)Mini-Batch - 小批量3)Batch Gradient Descent - 批量梯度下降4)Stochastic Gradient Descent (SGD) - 随机梯度下降5)Mini-Batch Gradient Descent - 小批量梯度下降6)Online Learning - 在线学习7)Full-Batch - 全批量8)Data Batch - 数据批次9)Training Batch - 训练批次10)Batch Normalization - 批量归一化11)Batch-wise Optimization - 批量优化12)Batch Processing - 批量处理13)Batch Sampling - 批量采样14)Adaptive Batch Size - 自适应批量大小15)Batch Splitting - 批量分割16)Dynamic Batch Size - 动态批量大小17)Fixed Batch Size - 固定批量大小18)Batch-wise Inference - 批量推理19)Batch-wise Training - 批量训练20)Batch Shuffling - 批量洗牌•Epoch - 训练周期1)Training Epoch - 训练周期2)Epoch Size - 周期大小3)Early Stopping - 提前停止4)Validation Set - 验证集5)Training Set - 训练集6)Test Set - 测试集7)Overfitting - 过拟合8)Underfitting - 欠拟合9)Model Evaluation - 模型评估10)Model Selection - 模型选择11)Hyperparameter Tuning - 超参数调优12)Cross-Validation - 交叉验证13)K-fold Cross-Validation - K折交叉验证14)Stratified Cross-Validation - 分层交叉验证15)Leave-One-Out Cross-Validation (LOOCV) - 留一法交叉验证16)Grid Search - 网格搜索17)Random Search - 随机搜索18)Model Complexity - 模型复杂度19)Learning Curve - 学习曲线20)Convergence - 收敛3.Machine Learning Techniques and Algorithms (机器学习技术与算法)•Decision Tree - 决策树1)Decision Tree - 决策树2)Node - 节点3)Root Node - 根节点4)Leaf Node - 叶节点5)Internal Node - 内部节点6)Splitting Criterion - 分裂准则7)Gini Impurity - 基尼不纯度8)Entropy - 熵9)Information Gain - 信息增益10)Gain Ratio - 增益率11)Pruning - 剪枝12)Recursive Partitioning - 递归分割13)CART (Classification and Regression Trees) - 分类回归树14)ID3 (Iterative Dichotomiser 3) - 迭代二叉树315)C4.5 (successor of ID3) - C4.5（ID3的后继者）16)C5.0 (successor of C4.5) - C5.0（C4.5的后继者）17)Split Point - 分裂点18)Decision Boundary - 决策边界19)Pruned Tree - 剪枝后的树20)Decision Tree Ensemble - 决策树集成•Random Forest - 随机森林1)Random Forest - 随机森林2)Ensemble Learning - 集成学习3)Bootstrap Sampling - 自助采样4)Bagging (Bootstrap Aggregating) - 装袋法5)Out-of-Bag (OOB) Error - 袋外误差6)Feature Subset - 特征子集7)Decision Tree - 决策树8)Base Estimator - 基础估计器9)Tree Depth - 树深度10)Randomization - 随机化11)Majority Voting - 多数投票12)Feature Importance - 特征重要性13)OOB Score - 袋外得分14)Forest Size - 森林大小15)Max Features - 最大特征数16)Min Samples Split - 最小分裂样本数17)Min Samples Leaf - 最小叶节点样本数18)Gini Impurity - 基尼不纯度19)Entropy - 熵20)Variable Importance - 变量重要性•Support Vector Machine (SVM) - 支持向量机1)Support Vector Machine (SVM) - 支持向量机2)Hyperplane - 超平面3)Kernel Trick - 核技巧4)Kernel Function - 核函数5)Margin - 间隔6)Support Vectors - 支持向量7)Decision Boundary - 决策边界8)Maximum Margin Classifier - 最大间隔分类器9)Soft Margin Classifier - 软间隔分类器10) C Parameter - C参数11)Radial Basis Function (RBF) Kernel - 径向基函数核12)Polynomial Kernel - 多项式核13)Linear Kernel - 线性核14)Quadratic Kernel - 二次核15)Gaussian Kernel - 高斯核16)Regularization - 正则化17)Dual Problem - 对偶问题18)Primal Problem - 原始问题19)Kernelized SVM - 核化支持向量机20)Multiclass SVM - 多类支持向量机•K-Nearest Neighbors (KNN) - K-最近邻1)K-Nearest Neighbors (KNN) - K-最近邻2)Nearest Neighbor - 最近邻3)Distance Metric - 距离度量4)Euclidean Distance - 欧氏距离5)Manhattan Distance - 曼哈顿距离6)Minkowski Distance - 闵可夫斯基距离7)Cosine Similarity - 余弦相似度8)K Value - K值9)Majority Voting - 多数投票10)Weighted KNN - 加权KNN11)Radius Neighbors - 半径邻居12)Ball Tree - 球树13)KD Tree - KD树14)Locality-Sensitive Hashing (LSH) - 局部敏感哈希15)Curse of Dimensionality - 维度灾难16)Class Label - 类标签17)Training Set - 训练集18)Test Set - 测试集19)Validation Set - 验证集20)Cross-Validation - 交叉验证•Naive Bayes - 朴素贝叶斯1)Naive Bayes - 朴素贝叶斯2)Bayes' Theorem - 贝叶斯定理3)Prior Probability - 先验概率4)Posterior Probability - 后验概率5)Likelihood - 似然6)Class Conditional Probability - 类条件概率7)Feature Independence Assumption - 特征独立假设8)Multinomial Naive Bayes - 多项式朴素贝叶斯9)Gaussian Naive Bayes - 高斯朴素贝叶斯10)Bernoulli Naive Bayes - 伯努利朴素贝叶斯11)Laplace Smoothing - 拉普拉斯平滑12)Add-One Smoothing - 加一平滑13)Maximum A Posteriori (MAP) - 最大后验概率14)Maximum Likelihood Estimation (MLE) - 最大似然估计15)Classification - 分类16)Feature Vectors - 特征向量17)Training Set - 训练集18)Test Set - 测试集19)Class Label - 类标签20)Confusion Matrix - 混淆矩阵•Clustering - 聚类1)Clustering - 聚类2)Centroid - 质心3)Cluster Analysis - 聚类分析4)Partitioning Clustering - 划分式聚类5)Hierarchical Clustering - 层次聚类6)Density-Based Clustering - 基于密度的聚类7)K-Means Clustering - K均值聚类8)K-Medoids Clustering - K中心点聚类9)DBSCAN (Density-Based Spatial Clustering of Applications with Noise) - 基于密度的空间聚类算法10)Agglomerative Clustering - 聚合式聚类11)Dendrogram - 系统树图12)Silhouette Score - 轮廓系数13)Elbow Method - 肘部法则14)Clustering Validation - 聚类验证15)Intra-cluster Distance - 类内距离16)Inter-cluster Distance - 类间距离17)Cluster Cohesion - 类内连贯性18)Cluster Separation - 类间分离度19)Cluster Assignment - 聚类分配20)Cluster Label - 聚类标签•K-Means - K-均值1)K-Means - K-均值2)Centroid - 质心3)Cluster - 聚类4)Cluster Center - 聚类中心5)Cluster Assignment - 聚类分配6)Cluster Analysis - 聚类分析7)K Value - K值8)Elbow Method - 肘部法则9)Inertia - 惯性10)Silhouette Score - 轮廓系数11)Convergence - 收敛12)Initialization - 初始化13)Euclidean Distance - 欧氏距离14)Manhattan Distance - 曼哈顿距离15)Distance Metric - 距离度量16)Cluster Radius - 聚类半径17)Within-Cluster Variation - 类内变异18)Cluster Quality - 聚类质量19)Clustering Algorithm - 聚类算法20)Clustering Validation - 聚类验证•Dimensionality Reduction - 降维1)Dimensionality Reduction - 降维2)Feature Extraction - 特征提取3)Feature Selection - 特征选择4)Principal Component Analysis (PCA) - 主成分分析5)Singular Value Decomposition (SVD) - 奇异值分解6)Linear Discriminant Analysis (LDA) - 线性判别分析7)t-Distributed Stochastic Neighbor Embedding (t-SNE) - t-分布随机邻域嵌入8)Autoencoder - 自编码器9)Manifold Learning - 流形学习10)Locally Linear Embedding (LLE) - 局部线性嵌入11)Isomap - 等度量映射12)Uniform Manifold Approximation and Projection (UMAP) - 均匀流形逼近与投影13)Kernel PCA - 核主成分分析14)Non-negative Matrix Factorization (NMF) - 非负矩阵分解15)Independent Component Analysis (ICA) - 独立成分分析16)Variational Autoencoder (VAE) - 变分自编码器17)Sparse Coding - 稀疏编码18)Random Projection - 随机投影19)Neighborhood Preserving Embedding (NPE) - 保持邻域结构的嵌入20)Curvilinear Component Analysis (CCA) - 曲线成分分析•Principal Component Analysis (PCA) - 主成分分析1)Principal Component Analysis (PCA) - 主成分分析2)Eigenvector - 特征向量3)Eigenvalue - 特征值4)Covariance Matrix - 协方差矩阵。

AI术语

人工智能专业重要词汇表1、A开头的词汇：Artificial General Intelligence/AGI通用人工智能Artificial Intelligence/AI人工智能Association analysis关联分析Attention mechanism注意力机制Attribute conditional independence assumption属性条件独立性假设Attribute space属性空间Attribute value属性值Autoencoder自编码器Automatic speech recognition自动语音识别Automatic summarization自动摘要Average gradient平均梯度Average-Pooling平均池化Accumulated error backpropagation累积误差逆传播Activation Function激活函数Adaptive Resonance Theory/ART自适应谐振理论Addictive model加性学习Adversarial Networks对抗网络Affine Layer仿射层Affinity matrix亲和矩阵Agent代理/ 智能体Algorithm算法Alpha-beta pruningα-β剪枝Anomaly detection异常检测Approximation近似Area Under ROC Curve／AUC R oc 曲线下面积2、B开头的词汇Backpropagation Through Time通过时间的反向传播Backpropagation/BP反向传播Base learner基学习器Base learning algorithm基学习算法Batch Normalization/BN批量归一化Bayes decision rule贝叶斯判定准则Bayes Model Averaging／BMA贝叶斯模型平均Bayes optimal classifier贝叶斯最优分类器Bayesian decision theory贝叶斯决策论Bayesian network贝叶斯网络Between-class scatter matrix类间散度矩阵Bias偏置/ 偏差Bias-variance decomposition偏差-方差分解Bias-Variance Dilemma偏差–方差困境Bi-directional Long-Short Term Memory/Bi-LSTM双向长短期记忆Binary classification二分类Binomial test二项检验Bi-partition二分法Boltzmann machine玻尔兹曼机Bootstrap sampling自助采样法／可重复采样／有放回采样Bootstrapping自助法Break-Event Point／BEP平衡点3、C开头的词汇Calibration校准Cascade-Correlation级联相关Categorical attribute离散属性Class-conditional probability类条件概率Classification and regression tree/CART分类与回归树Classifier分类器Class-imbalance类别不平衡Closed -form闭式Cluster簇/类/集群Cluster analysis聚类分析Clustering聚类Clustering ensemble聚类集成Co-adapting共适应Coding matrix编码矩阵COLT国际学习理论会议Committee-based learning基于委员会的学习Competitive learning竞争型学习Component learner组件学习器Comprehensibility可解释性Computation Cost计算成本Computational Linguistics计算语言学Computer vision计算机视觉Concept drift概念漂移Concept Learning System /CLS概念学习系统Conditional entropy条件熵Conditional mutual information条件互信息Conditional Probability Table／CPT条件概率表Conditional random field/CRF条件随机场Conditional risk条件风险Confidence置信度Confusion matrix混淆矩阵Connection weight连接权Connectionism连结主义Consistency一致性／相合性Contingency table列联表Continuous attribute连续属性Convergence收敛Conversational agent会话智能体Convex quadratic programming凸二次规划Convexity凸性Convolutional neural network/CNN卷积神经网络Co-occurrence同现Correlation coefficient相关系数Cosine similarity余弦相似度Cost curve成本曲线Cost Function成本函数Cost matrix成本矩阵Cost-sensitive成本敏感Cross entropy交叉熵Cross validation交叉验证Crowdsourcing众包Curse of dimensionality维数灾难Cut point截断点Cutting plane algorithm割平面法4、D开头的词汇Data mining数据挖掘Data set数据集Decision Boundary决策边界Decision stump决策树桩Decision tree决策树／判定树Deduction演绎Deep Belief Network深度信念网络Deep Convolutional Generative Adversarial Network/DCGAN深度卷积生成对抗网络Deep learning深度学习Deep neural network/DNN深度神经网络Deep Q-Learning深度Q 学习Deep Q-Network深度Q 网络Density estimation密度估计Density-based clustering密度聚类Differentiable neural computer可微分神经计算机Dimensionality reduction algorithm降维算法Directed edge有向边Disagreement measure不合度量Discriminative model判别模型Discriminator判别器Distance measure距离度量Distance metric learning距离度量学习Distribution分布Divergence散度Diversity measure多样性度量／差异性度量Domain adaption领域自适应Downsampling下采样D-separation （Directed separation）有向分离Dual problem对偶问题Dummy node哑结点Dynamic Fusion动态融合Dynamic programming动态规划5、E开头的词汇Eigenvalue decomposition特征值分解Embedding嵌入Emotional analysis情绪分析Empirical conditional entropy经验条件熵Empirical entropy经验熵Empirical error经验误差Empirical risk经验风险End-to-End端到端Energy-based model基于能量的模型Ensemble learning集成学习Ensemble pruning集成修剪Error Correcting Output Codes／ECOC纠错输出码Error rate错误率Error-ambiguity decomposition误差-分歧分解Euclidean distance欧氏距离Evolutionary computation演化计算Expectation-Maximization期望最大化Expected loss期望损失Exploding Gradient Problem梯度爆炸问题Exponential loss function指数损失函数Extreme Learning Machine/ELM超限学习机6、F开头的词汇Factorization因子分解False negative假负类False positive假正类False Positive Rate/FPR假正例率Feature engineering特征工程Feature selection特征选择Feature vector特征向量Featured Learning特征学习Feedforward Neural Networks/FNN前馈神经网络Fine-tuning微调Flipping output翻转法Fluctuation震荡Forward stagewise algorithm前向分步算法Frequentist频率主义学派Full-rank matrix满秩矩阵Functional neuron功能神经元7、G开头的词汇Gain ratio增益率Game theory博弈论Gaussian kernel function高斯核函数Gaussian Mixture Model高斯混合模型General Problem Solving通用问题求解Generalization泛化Generalization error泛化误差Generalization error bound泛化误差上界Generalized Lagrange function广义拉格朗日函数Generalized linear model广义线性模型Generalized Rayleigh quotient广义瑞利商Generative Adversarial Networks/GAN生成对抗网络Generative Model生成模型Generator生成器Genetic Algorithm/GA遗传算法Gibbs sampling吉布斯采样Gini index基尼指数Global minimum全局最小Global Optimization全局优化Gradient boosting梯度提升Gradient Descent梯度下降Graph theory图论Ground-truth真相／真实8、H开头的词汇Hard margin硬间隔Hard voting硬投票Harmonic mean调和平均Hesse matrix海塞矩阵Hidden dynamic model隐动态模型Hidden layer隐藏层Hidden Markov Model/HMM隐马尔可夫模型Hierarchical clustering层次聚类Hilbert space希尔伯特空间Hinge loss function合页损失函数Hold-out留出法Homogeneous同质Hybrid computing混合计算Hyperparameter超参数Hypothesis假设Hypothesis test假设验证9、I开头的词汇ICML国际机器学习会议Improved iterative scaling/IIS改进的迭代尺度法Incremental learning增量学习Independent and identically distributed/i.i.d.独立同分布Independent Component Analysis/ICA独立成分分析Indicator function指示函数Individual learner个体学习器Induction归纳Inductive bias归纳偏好Inductive learning归纳学习Inductive Logic Programming／ILP归纳逻辑程序设计Information entropy信息熵Information gain信息增益Input layer输入层Insensitive loss不敏感损失Inter-cluster similarity簇间相似度International Conference for Machine Learning/ICML国际机器学习大会Intra-cluster similarity簇内相似度Intrinsic value固有值Isometric Mapping/Isomap等度量映射Isotonic regression等分回归Iterative Dichotomiser迭代二分器10、K开头的词汇Kernel method核方法Kernel trick核技巧Kernelized Linear Discriminant Analysis／KLDA核线性判别分析K-fold cross validation k 折交叉验证／k 倍交叉验证K-Means Clustering K –均值聚类K-Nearest Neighbours Algorithm/KNN K近邻算法Knowledge base知识库Knowledge Representation知识表征11、L开头的词汇Label space标记空间Lagrange duality拉格朗日对偶性Lagrange multiplier拉格朗日乘子Laplace smoothing拉普拉斯平滑Laplacian correction拉普拉斯修正Latent Dirichlet Allocation隐狄利克雷分布Latent semantic analysis潜在语义分析Latent variable隐变量Lazy learning懒惰学习Learner学习器Learning by analogy类比学习Learning rate学习率Learning Vector Quantization/LVQ学习向量量化Least squares regression tree最小二乘回归树Leave-One-Out/LOO留一法linear chain conditional random field线性链条件随机场Linear Discriminant Analysis／LDA线性判别分析Linear model线性模型Linear Regression线性回归Link function联系函数Local Markov property局部马尔可夫性Local minimum局部最小Log likelihood对数似然Log odds／logit对数几率Logistic Regression Logistic 回归Log-likelihood对数似然Log-linear regression对数线性回归Long-Short Term Memory/LSTM长短期记忆Loss function损失函数12、M开头的词汇Machine translation/MT机器翻译Macron-P宏查准率Macron-R宏查全率Majority voting绝对多数投票法Manifold assumption流形假设Manifold learning流形学习Margin theory间隔理论Marginal distribution边际分布Marginal independence边际独立性Marginalization边际化Markov Chain Monte Carlo/MCMC马尔可夫链蒙特卡罗方法Markov Random Field马尔可夫随机场Maximal clique最大团Maximum Likelihood Estimation/MLE极大似然估计／极大似然法Maximum margin最大间隔Maximum weighted spanning tree最大带权生成树Max-Pooling最大池化Mean squared error均方误差Meta-learner元学习器Metric learning度量学习Micro-P微查准率Micro-R微查全率Minimal Description Length/MDL最小描述长度Minimax game极小极大博弈Misclassification cost误分类成本Mixture of experts混合专家Momentum动量Moral graph道德图／端正图Multi-class classification多分类Multi-document summarization多文档摘要Multi-layer feedforward neural networks多层前馈神经网络Multilayer Perceptron/MLP多层感知器Multimodal learning多模态学习Multiple Dimensional Scaling多维缩放Multiple linear regression多元线性回归Multi-response Linear Regression ／MLR多响应线性回归Mutual information互信息13、N开头的词汇Naive bayes朴素贝叶斯Naive Bayes Classifier朴素贝叶斯分类器Named entity recognition命名实体识别Nash equilibrium纳什均衡Natural language generation/NLG自然语言生成Natural language processing自然语言处理Negative class负类Negative correlation负相关法Negative Log Likelihood负对数似然Neighbourhood Component Analysis/NCA近邻成分分析Neural Machine Translation神经机器翻译Neural Turing Machine神经图灵机Newton method牛顿法NIPS国际神经信息处理系统会议No Free Lunch Theorem／NFL没有免费的午餐定理Noise-contrastive estimation噪音对比估计Nominal attribute列名属性Non-convex optimization非凸优化Nonlinear model非线性模型Non-metric distance非度量距离Non-negative matrix factorization非负矩阵分解Non-ordinal attribute无序属性Non-Saturating Game非饱和博弈Norm范数Normalization归一化Nuclear norm核范数Numerical attribute数值属性14、O开头的词汇Objective function目标函数Oblique decision tree斜决策树Occam’s razor奥卡姆剃刀Odds几率Off-Policy离策略One shot learning一次性学习One-Dependent Estimator／ODE独依赖估计On-Policy在策略Ordinal attribute有序属性Out-of-bag estimate包外估计Output layer输出层Output smearing输出调制法Overfitting过拟合／过配Oversampling过采样15、P开头的词汇Paired t-test成对t 检验Pairwise成对型Pairwise Markov property成对马尔可夫性Parameter参数Parameter estimation参数估计Parameter tuning调参Parse tree解析树Particle Swarm Optimization/PSO粒子群优化算法Part-of-speech tagging词性标注Perceptron感知机Performance measure性能度量Plug and Play Generative Network即插即用生成网络Plurality voting相对多数投票法Polarity detection极性检测Polynomial kernel function多项式核函数Pooling池化Positive class正类Positive definite matrix正定矩阵Post-hoc test后续检验Post-pruning后剪枝potential function势函数Precision查准率／准确率Prepruning预剪枝Principal component analysis/PCA主成分分析Principle of multiple explanations多释原则Prior先验Probability Graphical Model概率图模型Proximal Gradient Descent/PGD近端梯度下降Pruning剪枝Pseudo-label伪标记16、Q开头的词汇Quantized Neural Network量子化神经网络Quantum computer量子计算机Quantum Computing量子计算Quasi Newton method拟牛顿法17、R开头的词汇Radial Basis Function／RBF径向基函数Random Forest Algorithm随机森林算法Random walk随机漫步Recall查全率／召回率Receiver Operating Characteristic/ROC受试者工作特征Rectified Linear Unit/ReLU线性修正单元Recurrent Neural Network循环神经网络Recursive neural network递归神经网络Reference model参考模型Regression回归Regularization正则化Reinforcement learning/RL强化学习Representation learning表征学习Representer theorem表示定理reproducing kernel Hilbert space/RKHS再生核希尔伯特空间Re-sampling重采样法Rescaling再缩放Residual Mapping残差映射Residual Network残差网络Restricted Boltzmann Machine/RBM受限玻尔兹曼机Restricted Isometry Property/RIP限定等距性Re-weighting重赋权法Robustness稳健性/鲁棒性Root node根结点Rule Engine规则引擎Rule learning规则学习18、S开头的词汇Saddle point鞍点Sample space样本空间Sampling采样Score function评分函数Self-Driving自动驾驶Self-Organizing Map／SOM自组织映射Semi-naive Bayes classifiers半朴素贝叶斯分类器Semi-Supervised Learning半监督学习semi-Supervised Support Vector Machine半监督支持向量机Sentiment analysis情感分析Separating hyperplane分离超平面Sigmoid function Sigmoid 函数Similarity measure相似度度量Simulated annealing模拟退火Simultaneous localization and mapping同步定位与地图构建Singular Value Decomposition奇异值分解Slack variables松弛变量Smoothing平滑Soft margin软间隔Soft margin maximization软间隔最大化Soft voting软投票Sparse representation稀疏表征Sparsity稀疏性Specialization特化Spectral Clustering谱聚类Speech Recognition语音识别Splitting variable切分变量Squashing function挤压函数Stability-plasticity dilemma可塑性-稳定性困境Statistical learning统计学习Status feature function状态特征函Stochastic gradient descent随机梯度下降Stratified sampling分层采样Structural risk结构风险Structural risk minimization/SRM结构风险最小化Subspace子空间Supervised learning监督学习／有导师学习support vector expansion支持向量展式Support Vector Machine/SVM支持向量机Surrogat loss替代损失Surrogate function替代函数Symbolic learning符号学习Symbolism符号主义Synset同义词集19、T开头的词汇T-Distribution Stochastic Neighbour Embedding/t-SNE T–分布随机近邻嵌入Tensor张量Tensor Processing Units/TPU张量处理单元The least square method最小二乘法Threshold阈值Threshold logic unit阈值逻辑单元Threshold-moving阈值移动Time Step时间步骤Tokenization标记化Training error训练误差Training instance训练示例／训练例Transductive learning直推学习Transfer learning迁移学习Treebank树库Tria-by-error试错法True negative真负类True positive真正类True Positive Rate/TPR真正例率Turing Machine图灵机Twice-learning二次学习20、U开头的词汇Underfitting欠拟合／欠配Undersampling欠采样Understandability可理解性Unequal cost非均等代价Unit-step function单位阶跃函数Univariate decision tree单变量决策树Unsupervised learning无监督学习／无导师学习Unsupervised layer-wise training无监督逐层训练Upsampling上采样21、V开头的词汇Vanishing Gradient Problem梯度消失问题Variational inference变分推断VC Theory VC维理论Version space版本空间Viterbi algorithm维特比算法Von Neumann architecture冯·诺伊曼架构22、W开头的词汇Wasserstein GAN/WGAN Wasserstein生成对抗网络Weak learner弱学习器Weight权重Weight sharing权共享Weighted voting加权投票法Within-class scatter matrix类内散度矩阵Word embedding词嵌入Word sense disambiguation词义消歧23、Z开头的词汇Zero-data learning零数据学习Zero-shot learning零次学习。

统计学专业英语词汇完整版(可编辑修改word版)

统计学专业英语词汇AAbsolute deviation,绝对离差Absolute number,绝对数Absolute residuals,绝对残差Acceleration array,加速度立体阵Acceleration in an arbitrary direction,任意方向上的加速度Acceleration normal,法向加速度Acceleration space dimension,加速度空间的维数Acceleration tangential,切向加速度Acceleration vector,加速度向量Acceptable hypothesis,可接受假设Accumulation,累积Accuracy,准确度Actual frequency,实际频数Adaptive estimator,自适应估计量Addition,相加Addition theorem,加法定理Additivity,可加性Adjusted rate,调整率Adjusted value,校正值Admissible error,容许误差Aggregation,聚集性Alternative hypothesis,备择假设Among groups,组间Amounts,总量Analysis of correlation,相关分析Analysis of covariance,协方差分析Analysis of regression,回归分析Analysis of time series,时间序列分析Analysis of variance,方差分析Angular transformation,角转换ANOVA（analysis of variance）,方差分析ANOVA Models,方差分析模型Arcing,弧/弧旋Arcsine transformation,反正弦变换Area under the curve,曲线面积AREG,评估从一个时间点到下一个时间点回归相关时的误差ARIMA,季节和非季节性单变量模型的极大似然估计Arithmetic grid paper,算术格纸Arithmetic mean,算术平均数Arrhenius relation,艾恩尼斯关系Assessing fit,拟合的评估Associative laws,结合律Asymmetric distribution,非对称分布Asymptotic bias,渐近偏倚Asymptotic efficiency,渐近效率Asymptotic variance,渐近方差Attributable risk,归因危险度Attribute data,属性资料Attribution,属性Autocorrelation,自相关Autocorrelation of residuals,残差的自相关Average,平均数Average confidence interval length,平均置信区间长度Average growth rate,平均增长率BBar chart,条形图Bar graph,条形图Base period,基期Bayes theorem, 贝叶斯定理Bell-shaped curve,钟形曲线Bernoullidistribution,伯努力分布Best-trim estimator,最好切尾估计量Bias,偏性Binary logistic regression,二元逻辑斯蒂回归Binomial distribution,二项分布Bisquare,双平方Bivariate Correlate,二变量相关Bivariate normal distribution,双变量正态分布Bivariate normal population,双变量正态总体Biweight interval,双权区间Biweight M-estimator,双权M 估计量Block,区组/配伍组BMDP(Biomedical computer programs),BMDP 统计软件包Box plots,箱线图/箱尾图Break down bound,崩溃界/崩溃点CCanonical correlation,典型相关Caption,纵标目Case-control study,病例对照研究Categorical variable,分类变量Catenary,悬链线Cauchy distribution,柯西分布Cause-and-effect relationship,因果关系Cell,单元Censoring,终检Center of symmetry,对称中心Centering and scaling,中心化和定标Central tendency,集中趋势Central value,中心值CHAID-χ2AutomaticInteractionDetector,卡方自动交互检测Chance,机遇Chance error,随机误差Chance variable,随机变量Characteristic equation,特征方程Characteristic root,特征根Characteristic vector,特征向量Chebshev criterion of fit,拟合的切比雪夫准则Chernoff faces,切尔诺夫脸谱图Chi-square test,卡方检验/χ2 检验Choleskey decomposition,乔洛斯基分解Circle chart,圆图Class interval,组距Class mid-value,组中值Class upper limit,组上限Classified variable,分类变量Cluster analysis,聚类分析Cluster sampling,整群抽样Code,代码Coded data,编码数据Coding,编码Coefficient of contingency,列联系数Coefficientof determination,决定系数Coefficient ofmultiple correlation,多重相关系数Coefficient ofpartial correlation,偏相关系数Coefficient of production-moment correlation,积差相关系数Coefficient of rank correlation,等级相关系数Coefficient of regression,回归系数Coefficient of skewness,偏度系数Coefficient of variation,变异系数Cohort study,队列研究Column,列Column effect,列效应Column factor,列因素Combination pool,合并Combinative table,组合表Common factor,共性因子Common regression coefficient,公共回归系数Common value,共同值Common variance,公共方差Common variation,公共变异Communality variance,共性方差Comparability,可比性Comparison of bathes,批比较Comparison value,比较值Compartment model,分部模型Compassion,伸缩Complement of an event,补事件Complete association,完全正相关Complete dissociation,完全不相关Complete statistics,完备统计量Completely randomized design,完全随机化设计Composite event,联合事件/复合事件Concavity,凹性Conditional expectation,条件期望Conditional likelihood,条件似然Conditional probability,条件概率Conditionally linear,依条件线性Confidence interval,置信区间Confidence limit,置信限Confidence lower limit,置信下限Confidence upper limit,置信上限Confirmatory Factor Analysis,验证性因子分析Confirmatory research,证实性实验研究Confounding factor,混杂因素Conjoint,联合分析Consistency,相合性Consistency check,一致性检验Consistent asymptotically normal estimate,相合渐近正态估计Consistent estimate,相合估计Constrained nonlinear regression,受约束非线性回归Constraint,约束Contaminated distribution,污染分布Contaminated Gausssian,污染高斯分布Contaminated normal distribution,污染正态分布Contamination,污染Contamination model,污染模型Contingency table,列联表Contour,边界线Contribution rate,贡献率Control,对照Controlled experiments,对照实验Conventional depth,常规深度Convolution,卷积Corrected factor,校正因子Corrected mean,校正均值Correction coefficient,校正系数Correctness,正确性Correlation coefficient,相关系数Correlation index,相关指数Correspondence, 对应Counting,计数Counts,计数/频数Covariance,协方差Covariant,共变Cox Regression, Cox 回归Criteria for fitting,拟合准则Criteria of least squares,最小二乘准则Critical ratio,临界比Critical region,拒绝域Critical value,临界值Cross-over design,交叉设计Cross-section analysis,横断面分析Cross-section survey,横断面调查Cross tabs,交叉表Cross-tabulation table,复合表Cube root,立方根Cumulative distribution function,累计分布函数Cumulative probability,累计概率Curvature,曲率/弯曲Curve fit,曲线拟和Curve fitting,曲线拟合Curvilinear regression,曲线回归Curvilinear relation,曲线关系Cut-and-try method,尝试法Cycle,周期Cyclist,周期性DD test, D 检验Data acquisition,资料收集Databank,数据库Data capacity,数据容量Data deficiencies,数据缺乏Data handling,数据处理Data manipulation,数据处理Data processing,数据处理Data reduction,数据缩减Data set,数据集Data sources,数据来源Data transformation,数据变换Data validity,数据有效性Data-in,数据输入Data-out,数据输出Dead time,停滞期Degree of freedom,自由度Degree of precision,精密度Degree of reliability,可靠性程度Degression,递减Density function,密度函数Density of datapoints,数据点的密度Dependent variable,应变量/依变量/因变量Depth,深度Derivative matrix,导数矩阵Derivative-free methods,无导数方法Design,设计Determinacy,确定性Determinant,行列式Determinant,决定因素Deviation,离差Deviation from average,离均差Diagnostic plot,诊断图Dichotomousvariable,二分变量Differentialequation,微分方程Directstandardization,直接标准化法Discrete variable,离散型变量Discriminant,判断Discriminant analysis,判别分析Discriminant coefficient,判别系数Discriminant function,判别值Dispersion,散布/分散度Disproportional,不成比例的Disproportionate sub-class numbers,不成比例次级组含量Distribution free,分布无关性/免分布Distribution shape,分布形状Distribution-free method,任意分布法Distributive laws,分配律Disturbance,随机扰动项Dose response curve,剂量反应曲线Double blind method,双盲法Doubleblind rial,双盲试验Double exponential distribution,双指数分布Double logarithmic,双对数Downward rank,降秩Dual-space plot,对偶空间图DUD,无导数方法Duncan's new multiple range method,新复极差法/Duncan 新法EEffect, 实验效应Eigen value,特征值Eigen vector,特征向量Ellipse,椭圆Empirical distribution,经验分布Empirical probability,经验概率单位Enumeration data,计数资料Equal sun-class number,相等次级组含量Equally likely,等可能Equal variance,同变性Error,误差/错误Error of estimate,估计误差Error type I,第一类错误Error type II,第二类错误Estimand,被估量Estimated error mean squares,估计误差均方Estimated error sum of squares,估计误差平方和Euclidean distance,欧式距离Event,事件Exceptional data point,异常数据点Expectation plane,期望平面Expectation surface,期望曲面Expected values,期望值Experiment,实验Experimental sampling,试验抽样Experimental unit,试验单位Explanatory variable,说明变量/解释变量Exploratory data analysis,探索性数据分析Explore Summarize,探索-摘要Exponential curve,指数曲线Exponential growth,指数式增长Exsooth,指数平滑方法Extended fit,扩充拟合Extra parameter,附加参数Extra polation,外推法Extreme observation,末端观测值Extremes,极端值/极值FF distribution, F 分布F test, F 检验Factor,因素/因子Factor analysis,因子分析Factor score,因子得分Factorial,阶乘Factorial design,析因试验设计False negative,假阴性False negative error,假阴性错误Family of distributions,分布族Family of estimators,估计量族Fanning,扇面Fatality rate,病死率Field investigation,现场调查Field survey,现场调查Finitepopulation,有限总体Finite-sample, 有限样本Firstderivative,一阶导数First principal component,第一主成分First quartile,第一四分位数Fisher information,费雪信息量Fitted value,拟合值Fitting a curve,曲线拟合Fixed base,定基Fluctuation,随机起伏Forecast,预测Four fold table,四格表Fourth, 四分点Fraction blow,左侧比率Fractional error,相对误差Frequency,频率Frequency polygon,频数多边图Frontier point,界限点Function relationship,泛函关系GGamma distribution,伽玛分布Gauss increment,高斯增量Gaussian distribution,高斯分布/正态分布Gauss-Newton increment,高斯-牛顿增量General census,全面普查GENLOG(Generalized liner models),广义线性模型Geometric mean,几何平均数Gini's mean difference,基尼均差GLM(General liner models),通用线性模型Goodness of fit,拟和优度/配合度Gradientof determinant,行列式的梯度Graeco-Latin square,希腊拉丁方Grand mean,总均值Gross errors,重大错误Gross-error sensitivity,大错敏感度Group averages,分组平均Grouped data,分组资料Guessed mean,假定平均数HHalf-life,半衰期Hampel M-estimators,汉佩尔M 估计量Happenstance,偶然事件Harmonic mean,调和均数Hazard function,风险均数Hazard rate,风险率Heading,标目Heavy-tailed distribution,重尾分布Hessian array,海森立体阵Heterogeneity,不同质Heterogeneity of variance,方差不齐Hierarchical classification,组内分组Hierarchical clustering method,系统聚类法High-leverage point,高杠杆率点HILOGLINEAR,多维列联表的层次对数线性模型Hinge,折叶点Histogram,直方图Historical cohort study,历史性队列研究Holes,空洞HOMALS,多重响应分析Homogeneity of variance,方差齐性Homogeneity test,齐性检验Huber M-estimators,休伯M 估计量Hyperbola,双曲线Hypothesis testing,假设检验Hypothetical universe,假设总体IImpossible event,不可能事件Independence,独立性Independent variable,自变量Index,指标/指数Indirect standardization,间接标准化法Individual,个体Inference band, 推断带Infinite population,无限总体Infinitely great, 无穷大Infinitely small,无穷小Influence curve,影响曲线Information capacity,信息容量Initial condition,初始条件Initial estimate,初始估计值Initial level,最初水平Interaction,交互作用Interaction terms,交互作用项Intercept,截距Interpolation,内插法Inter quartile range,四分位距Interval estimation,区间估计Intervals of equal probability,等概率区间Intrinsic curvature,固有曲率Invariance, 不变性Inverse matrix,逆矩阵Inverse probability,逆概率Inverse sine transformation,反正弦变换Iteration,迭代JJacobian determinant,雅可比行列式Joint distribution function,联合分布函数Joint probability,联合概率Joint probability distribution,联合概率分布KK means method,逐步聚类法Kaplan-Meier,评估事件的时间长度Kaplan-Merier chart, Kaplan-Merier 图Kendall's rank correlation, Kendall 等级相关Kinetic,动力学Kolmogorov-Smirnove test,柯尔莫哥洛夫-斯米尔诺夫检验Kruskal and Wallis test, Kruskal 及Wallis 检验/多样本的秩和检验/H 检验Kurtosis,峰度LLack of fit,失拟Ladder of powers,幂阶梯Lag,滞后Large sample,大样本Large sample test,大样本检验Latin square,拉丁方Latin square design,拉丁方设计Leakage,泄漏Least favorable configuration,最不利构形Least favorable distribution,最不利分布Least significant difference,最小显著差法Least square method,最小二乘法Least-absolute-residuals estimates,最小绝对残差估计Least-absolute-residuals fit,最小绝对残差拟合Least-absolute-residuals line,最小绝对残差线Legend,图例L-estimator,L 估计量L-estimator of location,位置L 估计量L-estimator of scale,尺度L 估计量Level,水平Life expectance,预期期望寿命Life table,寿命表Life table method,生命表法Light-taile distribution,轻尾分布Likelihood function,似然函数Likelihood ratio,似然比Line graph,线图Linear correlation,直线相关Linear equation,线性方程Linear programming,线性规划Linear regression,直线回归/线性回归Linear trend,线性趋势Loading,载荷Location and scale equi variance,位置尺度同变性Location equi variance,位置同变性Location invariance,位置不变性Location scale family,位置尺度族Log rank test,时序检验Logarithmic curve,对数曲线Logarithmic normal distribution,对数正态分布Logarithmic scale,对数尺度Logarithmic transformation,对数变换Logic check,逻辑检查Logistic distribution,逻辑斯蒂分布Logit transformation, Logit 转换LOGLINEAR,多维列联表通用模型Lognormal distribution,对数正态分布Lost function,损失函数Low correlation,低度相关Lower limit,下限Lowest-attained variance,最小可达方差LSD,最小显著差法的简称Lurking variable,潜在变量MMain effect,主效应Major heading,主辞标目Marginal density function,边缘密度函数Marginal probability,边缘概率Marginal probability distribution,边缘概率分布Matched data,配对资料Matched distribution,匹配过分布Matching of distribution,分布的匹配Matching of transformation,变换的匹配Mathematical expectation,数学期望Mathematical model,数学模型MaximumL-estimator,极大L 估计量Maximumlikelihood method,最大似然法Mean,均数Mean squares between groups,组间均方Mean squares within group,组内均方Means (Compare means),均值-均值比较Median,中位数Median effective dose,半数效量Median lethal dose,半数致死量Median polish,中位数平滑Median test,中位数检验Minimal sufficient statistic,最小充分统计量Minimum distance estimation,最小距离估计Minimum effective dose,最小有效量Minimum lethal dose,最小致死量Minimum variance estimator,最小方差估计量MINITAB,统计软件包Minor heading,宾词标目Missing data,缺失值Model specification,模型的确定Modeling Statistics ,模型统计Models for outliers,离群值模型Modifying the model,模型的修正Modulus of continuity,连续性模Morbidity,发病率Most favorable configuration,最有利构形Multidimensional Scaling (ASCAL),多维尺度/多维标度Multinomial Logistic Regression ,多项逻辑斯蒂回归Multiple comparison,多重比较Multiple correlation ,复相关Multiple covariance,多元协方差Multiple linear regression,多元线性回归Multiple response ,多重选项Multiple solutions,多解Multiplication theorem,乘法定理Multiresponse,多元响应Multi-stage sampling,多阶段抽样Multivariate T distribution,多元T 分布Mutual exclusive,互不相容Mutual independence,互相独立NNatural boundary,自然边界Natural dead,自然死亡Natural zero,自然零Negative correlation,负相关Negative linear correlation,负线性相关Negatively skewed,负偏Newman-Keuls method, q 检验NK method, q 检验No statistical significance,无统计意义Nominal variable,名义变量Nonconstancy of variability,变异的非定常性Nonlinear regression,非线性相关Nonparametric statistics,非参数统计Nonparametric test,非参数检验Normal deviate,正态离差Normal distribution,正态分布Normal equation,正规方程组Normal ranges,正常范围Normal value,正常值Nuisance parameter,多余参数/讨厌参数Null hypothesis,无效假设Numerical variable,数值变量OObjective function,目标函数Observation unit,观察单位Observed value, 观察值One sided test,单侧检验One-way analysis of variance,单因素方差分析One way ANOVA ,单因素方差分析Open sequential trial,开放型序贯设计Optrim, 优切尾Optrim efficiency,优切尾效率Order statistics,顺序统计量Ordered categories,有序分类Ordinal logistic regression ,序数逻辑斯蒂回归Ordinal variable,有序变量Orthogonal basis,正交基Orthogonal design,正交试验设计Orthogonality conditions,正交条件ORTHOPLAN,正交设计Outlier cutoffs,离群值截断点Outliers,极端值OVERALS ,多组变量的非线性正规相关Overshoot,迭代过度PPaired design,配对设计Paired sample,配对样本Pairwise slopes,成对斜率Parabola,抛物线Parallel tests,平行试验Parameter,参数Parametric statistics,参数统计Parametric test,参数检验Partial correlation,偏相关Partial regression,偏回归Partial sorting,偏排序Partials residuals,偏残差Pattern,模式Pearson curves,皮尔逊曲线Peeling,退层Percent bar graph,百分条形图Percentage, 百分比Percentile, 百分位数Percentile curves,百分位曲线Periodicity,周期性Permutation,排列P-estimator,P 估计量Pie graph,饼图Pitman estimator,皮特曼估计量Pivot,枢轴量Planar,平坦Planar assumption,平面的假设PLANCARDS,生成试验的计划卡Point estimation,点估计Poisson distribution,泊松分布Polishing,平滑Polled standard deviation,合并标准差Polled variance,合并方差Polygon,多边图Polynomial,多项式Polynomial curve,多项式曲线Population,总体Population attributable risk,人群归因危险度Positive correlation,正相关Positively skewed,正偏Posterior distribution,后验分布Power of a test,检验效能Precision,精密度Predicted value,预测值Preliminary analysis,预备性分析Principalcomponent analysis,主成分分析Priordistribution,先验分布Prior probability,先验概率Probabilistic model,概率模型probability,概率Probability density,概率密度Product moment,乘积矩/协方差Profile trace,截面迹图Proportion,比/构成比Proportion allocation in stratified random sampling,按比例分层随机抽样Proportionate,成比例Proportionate sub-class numbers,成比例次级组含量Prospective study,前瞻性调查Proximities, 亲近性Pseudo F test,近似F 检验Pseudo model,近似模型Pseudo sigma,伪标准差Purposive sampling,有目的抽样QQR decomposition, QR 分解Quadratic approximation,二次近似Qualitative classification,属性分类Qualitative method,定性方法Quantile-quantile plot,分位数-分位数图/Q-Q 图Quantitative analysis,定量分析Quartile,四分位数Quick Cluster,快速聚类RRadix sort,基数排序Random allocation,随机化分组Random blocks design,随机区组设计Random event,随机事件Randomization,随机化Range,极差/全距Rank correlation,等级相关Rank sum test,秩和检验Rank test,秩检验Ranked data,等级资料Rate,比率Ratio,比例Raw data,原始资料Rawresidual,原始残差Rayleigh's test,雷氏检验Rayleigh's Z,雷氏Z 值Reciprocal,倒数Reciprocal transformation,倒数变换Recording,记录Redescending estimators,回降估计量Reducing dimensions,降维Re-expression,重新表达Reference set,标准组Regionof acceptance,接受域Regression coefficient,回归系数Regression sum of square,回归平方和Rejection point,拒绝点Relative dispersion,相对离散度Relative number,相对数Reliability,可靠性Reparametrization,重新设置参数Replication,重复Report Summaries,报告摘要Residual sum of square,剩余平方和Resistance,耐抗性Resistant line,耐抗线Resistant technique,耐抗技术R-estimator of location,位置R 估计量R-estimator of scale,尺度R 估计量Retrospective study,回顾性调查Ridge trace,岭迹Ridit analysis , Ridit 分析Rotation, 旋转Rounding,舍入Row,行Row effects,行效应Row factor,行因素RXC table, RXC 表SSample,样本Sample regression coefficient,样本回归系数Sample size,样本量Sample standard deviation,样本标准差Sampling error,抽样误差SAS(Statistical analysis system ),SAS 统计软件包Scale,尺度/量表Scatter diagram,散点图Schematic plot,示意图/简图Score test,计分检验Screening,筛检SEASON, 季节分析Second derivative,二阶导数Second principal component,第二主成分SEM (Structural equation modeling),结构化方程模型Semi-logarithmic graph,半对数图Semi-logarithmic paper,半对数格纸Sensitivity curve,敏感度曲线Sequential analysis,贯序分析Sequential data set,顺序数据集Sequential design,贯序设计Sequential method,贯序法Sequential test,贯序检验法Serial tests,系列试验Short-cut method,简捷法Sigmoid curve, S 形曲线Sign function,正负号函数Sign test,符号检验Signed rank,符号秩Significance test,显著性检验Significant figure,有效数字Simple cluster sampling,简单整群抽样Simple correlation,简单相关Simple random sampling,简单随机抽样Simple regression,简单回归simple table,简单表Sine estimator,正弦Single-valued estimate, 单值估计Singular matrix, 奇异矩阵Skeweddistribution, 偏斜分布Skewness,偏度Slash distribution, 斜线分布Slope, 斜率Smirnov test, 斯米尔诺夫检验Source of variation, 变异来源Spearman rank correlation, 斯皮尔曼等级相关Specific factor, 特殊因子Specific factor variance, 特殊因子方差Spectra , 频谱Spherical distribution, 球型正态分布Spread, 展布SPSS(Statistical package for the social science), SPSS 统计软件包Spurious correlation, 假性相关Square root transformation, 平方根变换Stabilizing variance, 稳定方差Standard deviation, 标准差Standard error, 标准误Standard error of difference, 差别的标准误Standard error of estimate, 标准估计误差Standard error of rate, 率的标准误Standard normal distribution, 标准正态分布Standardization, 标准化Starting value, 起始值Statistic, 统计量Statistical control, 统计控制Statistical graph, 统计图Statistical inference, 统计推断Statistical table, 统计表Steepest descent, 最速下降法Stem and leaf display, 茎叶图Step factor, 步长因子Stepwiseregression, 逐步回归Storage,存Strata, 层（复数）Stratified sampling, 分层抽样Stratified sampling, 分层抽样Strength, 强度Stringency, 严密性Structural relationship, 结构关系Studentized residual, 学生化残差/t 化残差Sub-class numbers, 次级组含量Subdividing, 分割Sufficient statistic, 充分统计量Sum of products, 积和Sum of squares, 离差平方和Sum of squares about regression, 回归平方和Sum of squares between groups, 组间平方和Sum of squares of partial regression, 偏回归平方和Sure event, 必然事件Survey, 调查Survival,生存分析Survival rate,生存率Suspended root gram, 悬吊根图Symmetry, 对称Systematic error, 系统误差Systematic sampling, 系统抽样Tags, 标签Tail area, 尾部面积Tail length, 尾长Tail weight, 尾重Tangent line, 切线Target distribution, 目标分布Taylor series, 泰勒级数Test(检验)Test of linearity, 线性检验Tendency of dispersion, 离散趋势Testing of hypotheses, 假设检验Theoretical frequency, 理论频数Timeseries, 时间序列Tolerance interval, 容忍区间Tolerance lower limit, 容忍下限Tolerance upper limit, 容忍上限Torsion, 扰率Total sum of square, 总平方和Total variation, 总变异Transformation, 转换Treatment, 处理Trend, 趋势Trend of percentage, 百分比趋势Trial, 试验Trial and error method, 试错法Tuning constant, 细调常数Twosided test, 双向检验Two-stage least squares, 二阶最小平方Two-stage sampling, 二阶段抽样Two-tailed test, 双侧检验Two-way analysis of variance, 双因素方差分析Two-way table, 双向表Type I error, 一类错误/α 错误TypeII error, 二类错误/β 错误UMVU, 方差一致最小无偏估计简称Unbiasedestimate, 无偏估计Unconstrained nonlinear regression , 无约束非线性回归Unequal subclass number, 不等次级组含量Ungrouped data, 不分组资料Uniform coordinate, 均匀坐标Uniform distribution, 均匀分布Uniformly minimum variance unbiased estimate, 方差一致最小无偏估计Unit, 单元Unordered categories, 无序分类Unweightedleast squares, 未加权最小平方法Upper limit,上限Upward rank, 升秩Vague concept, 模糊概念Validity, 有效性VARCOMP (Variance component estimation), 方差元素估计Variability, 变异性Variable, 变量Variance, 方差Variation, 变异Varimax orthogonal rotation, 方差最大正交旋转Volume of distribution, 容积W test, W 检验Weibull distribution, 威布尔分布Weight, 权数Weighted Chi-square test, 加权卡方检验/Cochran 检验Weighted linear regression method, 加权直线回归Weighted mean, 加权平均数Weighted mean square, 加权平均方差Weighted sum of square, 加权平方和Weighting coefficient, 权重系数Weighting method, 加权法W-estimation, W 估计量W-estimation of location, 位置W 估计量Width, 宽度Wilcoxon paired test, 威斯康星配对法/配对符号秩和检验Wild point, 野点/狂点Wild value, 野值/狂值Winsorized mean, 缩尾均值Withdraw, 失访Youden's index, 尤登指数Z test, Z 检验Zero correlation, 零相关Z-transformation, Z 变换。

聚类分析文献英文翻译

电气信息工程学院外文翻译英文名称：Data mining-clustering译文名称：数据挖掘—聚类分析专业：自动化姓名：****班级学号：****指导教师：******译文出处：Data mining：Ian H.Witten, EibeFrank 著二○一○年四月二十六日Clustering5.1 INTRODUCTIONClustering is similar to classification in that data are grouped. However, unlike classification, the groups are not predefined. Instead, the grouping is accomplished by finding similarities between data according to characteristics found in the actual data. The groups are called clusters. Some authors view clustering as a special type of classification. In this text, however, we follow a more conventional view in that the two are different. Many definitions for clusters have been proposed:●Set of like elements. Elements from different clusters are not alike.●The distance between points in a cluster is less than the distance betweena point in the cluster and any point outside it.A term similar to clustering is database segmentation, where like tuple (record) in a database are grouped together. This is done to partition or segment the database into components that then give the user a more general view of the data. In this case text, we do not differentiate between segmentation and clustering. A simple example of clustering is found in Example 5.1. This example illustrates the fact that that determining how to do the clustering is not straightforward.As illustrated in Figure 5.1, a given set of data may be clustered on different attributes. Here a group of homes in a geographic area is shown. The first floor type of clustering is based on the location of the home. Homes that are geographically close to each other are clustered together. In the second clustering, homes are grouped based on the size of the house.Clustering has been used in many application domains, including biology, medicine, anthropology, marketing, and economics. Clustering applications include plant and animal classification, disease classification, image processing, pattern recognition, and document retrieval. One of the first domains in which clustering was used was biological taxonomy. Recent uses include examining Web log data to detect usage patterns.When clustering is applied to a real-world database, many interesting problems occur:●Outlier handling is difficult. Here the elements do not naturally fallinto any cluster. They can be viewed as solitary clusters. However, if aclustering algorithm attempts to find larger clusters, these outliers will beforced to be placed in some cluster. This process may result in the creationof poor clusters by combining two existing clusters and leaving the outlier in its own cluster.● Dynamic data in the database implies that cluster membership may change over time.● Interpreting the semantic meaning of each cluster may be difficult. With classification, the labeling of the classes is known ahead of time. However, with clustering, this may not be the case. Thus, when the clustering process finishes creating a set of clusters, the exact meaning of each cluster may not be obvious. Here is where a domain expert is needed to assign a label or interpretation for each cluster.● There is no one correct answer to a clustering problem. In fact, many answers may be found. The exact number of clusters required is not easy to determine. Again, a domain expert may be required. For example, suppose we have a set of data about plants that have been collected during a field trip. Without any prior knowledge of plant classification, if we attempt to divide this set of data into similar groupings, it would not be clear how many groups should be created.● Another related issue is what data should be used of clustering. Unlike learning during a classification process, where there is some a priori knowledge concerning what the attributes of each classification should be, in clustering we have no supervised learning to aid the process. Indeed, clustering can be viewed as similar to unsupervised learning.We can then summarize some basic features of clustering (as opposed to classification):● The (best) number of clusters is not known.● There may not be any a priori knowledge concerning the clusters.● Cluster results are dynamic.The clustering problem is stated as shown in Definition 5.1. Here we assume that the number of clusters to be created is an input value, k. The actual content (and interpretation) of each cluster,j k ,1j k ≤≤, is determined as a result of the function definition. Without loss of generality, we will view that the result of solving a clustering problem is that a set of clusters is created: K={12,,...,k k k k }.D EFINITION 5.1.Given a database D ={12,,...,n t t t } of tuples and an integer value k , the clustering problem is to define a mapping f : {1,...,}D k → where each i t is assigned to one cluster j K ,1j k ≤≤. A cluster j K , contains precisely those tuples mapped to it; that is, j K ={|(),1,i i j t f t K i n =≤≤and i t D ∈}.A classification of the different types of clustering algorithms is shown in Figure 5.2. Clustering algorithms themselves may be viewed as hierarchical or partitional. With hierarchical clustering, a nested set of clusters is created. Each level in the hierarchy has a separate set of clusters. At the lowest level, each item is in its own unique cluster. At the highest level, all items belong to the same cluster. With hierarchical clustering, the desired number of clusters is not input. With partitional clustering, the algorithm creates only one set of clusters. These approaches use the desired number of clusters to drive how the final set is created. Traditional clustering algorithms tend to be targeted to small numeric database that fit into memory .There are, however, more recent clustering algorithms that look at categorical data and are targeted to larger, perhaps dynamic, databases. Algorithms targeted to larger databases may adapt to memory constraints by either sampling the database or using data structures, which can be compressed or pruned to fit into memory regardless of the size of the database. Clustering algorithms may also differ based on whether they produce overlapping or nonoverlapping clusters. Even though we consider only nonoverlapping clusters, it is possible to place an item in multiple clusters. In turn, nonoverlapping clusters can be viewed as extrinsic or intrinsic. Extrinsic techniques use labeling of the items to assist in the classification process. These algorithms are the traditional classification supervised learning algorithms in which a special input training set is used. Intrinsic algorithms do not use any a priori category labels, but depend only on the adjacency matrix containing the distance between objects. All algorithms we examine in this chapter fall into the intrinsic class.The types of clustering algorithms can be furthered classified based on the implementation technique used. Hierarchical algorithms can becategorized as agglomerative or divisive. ”Agglomerative ” implies that the clusters are created in a bottom-up fashion, while divisive algorithms work in a top-down fashion. Although both hierarchical and partitional algorithms could be described using the agglomerative vs. divisive label, it typically is more associated with hierarchical algorithms. Another descriptive tag indicates whether each individual element is handled one by one, serial (sometimes called incremental), or whether all items are examined together, simultaneous. If a specific tuple is viewed as having attribute values for all attributes in the schema, then clustering algorithms could differ as to how the attribute values are examined. As is usually done with decision tree classification techniques, some algorithms examine attribute values one at a time, monothetic. Polythetic algorithms consider all attribute values at one time. Finally, clustering algorithms can be labeled base on the mathematical formulation given to the algorithm: graph theoretic or matrix algebra. In this chapter we generally use the graph approach and describe the input to the clustering algorithm as an adjacency matrix labeled with distance measure.We discuss many clustering algorithms in the following sections. This is only a representative subset of the many algorithms that have been proposed in the literature. Before looking at these algorithms, we first examine possible similarity measures and examine the impact of outliers.5.2 SIMILARITY AND DISTANCE MEASURESThere are many desirable properties for the clusters created by a solution to a specific clustering problem. The most important one is that a tuple within one cluster is more like tuples within that cluster than it is similar to tuples outside it. As with classification, then, we assume the definition of a similarity measure, sim(,i l t t ), defined between any two tuples, ,i l t t D . This provides a more strict and alternative clustering definition, as found in Definition 5.2. Unless otherwise stated, we use the first definition rather than the second. Keep in mind that the similarity relationship stated within the second definition is a desirable, although not always obtainable, property.A distance measure, dis(,i j t t ), as opposed to similarity, is often used inclustering. The clustering problem then has the desirable property that given a cluster,j K ,,jl jm j t t K ∀∈ and ,(,)(,)i j jl jm jl i t K sim t t dis t t ∉≤.Some clustering algorithms look only at numeric data, usually assuming metric data points. Metric attributes satisfy the triangular inequality. The cluster can then be described by using several characteristic values. Given a cluster, m K of N points { 12,,...,m m mN t t t }, we make the following definitions [ZRL96]:Here the centroid is the “middle ” of the cluster; it need not be an actual point in the cluster. Some clustering algorithms alternatively assume that the cluster is represented by one centrally located object in the cluster called a medoid . The radius is the square root of the average mean squared distance from any point in the cluster to the centroid, and of points in the cluster. We use the notation m M to indicate the medoid for cluster m K .Many clustering algorithms require that the distance between clusters (rather than elements) be determined. This is not an easy task given that there are many interpretations for distance between clusters. Given clusters i K and j K , there are several standard alternatives to calculate the distance between clusters. A representative list is:● Single link : Smallest distance between an element in onecluster and an element in the other. We thus havedis(,i j K K )=min((,))il jm il i j dis t t t K K ∀∈∉and jm j i t K K ∀∈∉.● Complete link : Largest distance between an element in onecluster and an element in the other. We thus havedis(,i j K K )=max((,))il jm il i j dis t t t K K ∀∈∉and jm j i t K K ∀∈∉.● Average : Average distance between an element in onecluster and an element in the other. We thus havedis(,i j K K )=((,))il jm il i j mean dis t t t K K ∀∈∉and jm j i t K K ∀∈∉.● Centroid : If cluster have a representative centroid, then thecentroid distance is defined as the distance between the centroids.We thus have dis(,i j K K )=dis(,i j C C ), where i C is the centroidfor i K and similarly for j C .Medoid : Using a medoid to represent each cluster, thedistance between the clusters can be defined by the distancebetween the medoids: dis(,i j K K )=(,)i j dis M M5.3 OUTLIERSAs mentioned earlier, outliers are sample points with values much different from those of the remaining set of data. Outliers may represent errors in the data (perhaps a malfunctioning sensor recorded an incorrect data value) or could be correct data values that are simply much different from the remaining data. A person who is 2.5 meters tall is much taller than most people. In analyzing the height of individuals, this value probably would be viewed as an outlier.Some clustering techniques do not perform well with the presence of outliers. This problem is illustrated in Figure 5.3. Here if three clusters are found (solid line), the outlier will occur in a cluster by itself. However, if two clusters are found (dashed line), the two (obviously) different sets of data will be placed in one cluster because they are closer together than the outlier. This problem is complicated by the fact that many clustering algorithms actually have as input the number of desired clusters to be found.Clustering algorithms may actually find and remove outliers to ensure that they perform better. However, care must be taken in actually removing outliers. For example, suppose that the data mining problem is to predict flooding. Extremely high water level values occur very infrequently, and when compared with the normal water level values may seem to be outliers. However, removing these values may not allow the data mining algorithms to work effectively because there would be no data that showed that floods ever actually occurred.Outlier detection, or outlier mining, is the process of identifying outliers in a set of data. Clustering, or other data mining, algorithms may then choose to remove or treat these values differently. Some outlier detection techniques are based on statistical techniques. These usually assume that the set of data follows a known distribution and that outliers can be detected by well-known tests such as discordancy tests. However, thesetests are not very realistic for real-world data because real-world data values may not follow well-defined data distributions. Also, most of these tests assume single attribute value, and many attributes are involved in real-world datasets. Alternative detection techniques may be based on distance measures.聚类分析5.1简介聚类分析与分类数据分组类似。

Kamila混合类型聚类方法软件包说明书

Package‘kamila’October13,2022Type PackageVersion0.1.2Date2020-03-10Author Alexander Foss[aut,cre],Marianthi Markatou[aut]Maintainer Alexander Foss<************************>Title Methods for Clustering Mixed-Type DataDescription Implements methods for clustering mixed-type data,speciﬁcally combinations of continuous and nominal data.Special attentionis paid to the often-overlooked problem of equitably balancing thecontribution of the continuous and categorical variables.This packageimplements KAMILA clustering,a novel method for clusteringmixed-type data in the spirit of k-means clustering.It does not requiredummy coding of variables,and is efﬁcient enough to scale to rather largedata sets.Also implemented is Modha-Spangler clustering,which uses abrute-force strategy to maximize the cluster separation simultaneously in thecontinuous and categorical variables.For more information,see Foss,Markatou, Ray,&Heching(2016)<doi:10.1007/s10994-016-5575-7>and Foss&Markatou (2018)<doi:10.18637/jss.v083.i13>.Depends R(>=3.0.0)License GPL-3|ﬁle LICENSEURL https:///ahfoss/kamilaBugReports https:///ahfoss/kamila/issuesImports stats,abind,KernSmooth,gtools,Rcpp,plyrLinkingTo RcppSuggests testthat,clustMD,ggplot2,HmiscRoxygenNote7.1.0NeedsCompilation yesRepository CRANDate/Publication2020-03-1307:20:02UTC1R topics documented:kamila-package (2)classifyKamila (4)dptmCpp (5)dummyCodeFactorDf (5)genMixedData (6)gmsClust (7)kamila (9)wkmeans (12)Index14 kamila-package Clustering for mixed continuous and categorical data setsDescriptionA collection of methods for clustering mixed type data,including KAMILA(KAy-means for MIxedLArge data)and aﬂexible implementation of Modha-Spangler clusteringDetailsPackage:kamilaType:PackageVersion:0.1.0Date:2015-10-06License:GPL-3Author(s)Alex Foss and Marianthi MarkatouMaintainer:Alex Foss<************************>ReferencesAH Foss,M Markatou,B Ray,and A Heching(in press).A semiparametric method for clustering mixed data.Machine Learning,DOI:10.1007/s10994-016-5575-7.DS Modha and S Spangler(2003).Feature weighting in k-means clustering.Machine Learning 52(3),217-237.Examples##Not run:#import and format a mixed-type data setdata(Byar,package= clustMD )Byar$logSpap<-log(Byar$Serum.prostatic.acid.phosphatase)conInd<-c(5,6,8:10,16)conVars<-Byar[,conInd]conVars<-data.frame(scale(conVars))catVarsFac<-Byar[,-c(1:2,conInd,11,14,15)]catVarsFac[]<-lapply(catVarsFac,factor)catVarsDum<-dummyCodeFactorDf(catVarsFac)#Modha-Spangler clustering with kmeans default Hartigan-Wong algorithm gmsResHw<-gmsClust(conVars,catVarsDum,nclust=3)#Modha-Spangler clustering with kmeans Forgy-Lloyd algorithm#NOTE searchDensity should be>=10for optimal performance:#this is just a syntax demogmsResLloyd<-gmsClust(conVars,catVarsDum,nclust=3,algorithm="Lloyd",searchDensity=3)#KAMILA clusteringkamRes<-kamila(conVars,catVarsFac,numClust=3,numInit=10)#Plot resultsternarySurvival<-factor(Byar$SurvStat)levels(ternarySurvival)<-c( Alive , DeadProst , DeadOther )[c(1,2,rep(3,8))] plottingData<-cbind(conVars,catVarsFac,KamilaCluster=factor(kamRes$finalMemb),MSCluster=factor(gmsResHw$results$cluster))plottingData$Bone.metastases<-ifelse(plottingData$Bone.metastases== 1 ,yes= Yes ,no= No )#Plot Modha-Spangler/Hartigan-Wong resultsmsPlot<-ggplot(plottingData,aes(x=logSpap,y=Index.of.tumour.stage.and.histolic.grade,color=ternarySurvival,shape=MSCluster))plotOpts<-function(pl)(pl+geom_point()+scale_shape_manual(values=c(2,3,7))+geom_jitter())plotOpts(msPlot)#Plot KAMILA resultskamPlot<-ggplot(plottingData,aes(x=logSpap,y=Index.of.tumour.stage.and.histolic.grade,color=ternarySurvival,4classifyKamila shape=KamilaCluster))plotOpts(kamPlot)##End(Not run)classifyKamila Classify new data into existing KAMILA clustersDescriptionA function that classiﬁes a new data set into existing KAMILA clusters using the output object fromthe kamila function.UsageclassifyKamila(obj,newData)Argumentsobj An output object from the kamila function.newData A list of length2,withﬁrst element a data frame of continuous variables,andsecond element a data frame of categorical factors.DetailsA function that takes obj,the output from the kamila function,and newData,a list of length2,where theﬁrst element is a data frame of continuous variables,and the second element is a data frame of categorical factors.Both data frames must have the same format as the original data used to construct the kamila clustering.ValueAn integer vector denoting cluster assignments of the new data points.ReferencesFoss A,Markatou M;kamila:Clustering Mixed-Type Data in R and Hadoop.Journal of Statistical Software,83(13).2018.doi:10.18637/jss.v083.i13Examples#Generate toy data setset.seed(1234)dat1<-genMixedData(400,nConVar=2,nCatVar=2,nCatLevels=4,nConWithErr=2,nCatWithErr=2,popProportions=c(.5,.5),conErrLev=0.2,catErrLev=0.2)#Partition the data into training/test settrainingIds<-sample(nrow(dat1$conVars),size=300,replace=FALSE)catTrain<-data.frame(apply(dat1$catVars[trainingIds,],2,factor),stringsAsFactors=TRUE)dptmCpp5 conTrain<-data.frame(scale(dat1$conVars)[trainingIds,],stringsAsFactors=TRUE)catTest<-data.frame(apply(dat1$catVars[-trainingIds,],2,factor),stringsAsFactors=TRUE) conTest<-data.frame(scale(dat1$conVars)[-trainingIds,],stringsAsFactors=TRUE)#Run the kamila clustering procedure on the training setkamilaObj<-kamila(conTrain,catTrain,numClust=2,numInit=10)table(dat1$trueID[trainingIds],kamilaObj$finalMemb)#Predict membership in the test data setkamilaPred<-classifyKamila(kamilaObj,list(conTest,catTest))table(dat1$trueID[-trainingIds],kamilaPred)dptmCpp Calculate distances from a set of points to a set of centroidsDescriptionA function that calculates a NxM matrix of distances between a NxP set of points and a MxP set ofpoints.UsagedptmCpp(pts,myMeans,wgts)Argumentspts A matrix of pointsmyMeans A matrix of centroids,must have same ncol as ptswgts A Px1vector of variable weightsValueA MxP matrix of distancesdummyCodeFactorDf Dummy coding of a data frame of factor variablesDescriptionGiven a data frame of factor variables,this function returns a numeric matrix of0–1dummy-coded variables.UsagedummyCodeFactorDf(dat)Argumentsdat A data frame of factor variables6genMixedDataValueA numeric matrix of0–1dummy coded variablesExamplesdd<-data.frame(a=factor(1:8),b=factor(letters[1:8]),stringsAsFactors=TRUE)dummyCodeFactorDf(dd)genMixedData Generate simulated mixed-type data with cluster structure.DescriptionThis function simulates mixed-type data sets with a latent cluster structure,with continuous and nominal variables.UsagegenMixedData(sampSize,nConVar,nCatVar,nCatLevels,nConWithErr,nCatWithErr,popProportions,conErrLev,catErrLev)ArgumentssampSize Integer:Size of the simulated data set.nConVar The number of continuous variables.nCatVar The number of categorical variables.nCatLevels Integer:The number of categories per categorical variables.Currently must bea multiple of the number of populations speciﬁed in popProportions.nConWithErr Integer:The number of continuous variables with error.nCatWithErr Integer:The number of categorical variables with error.popProportions A vector of scalars that sums to one.The length gives the number of populations (clusters),with values denoting the prior probability of observing a memberof the corresponding population.NOTE:currently only two populations aresupported.conErrLev A scalar between0.01and1denoting the univariate overlap between clusters on the continuous variables speciﬁed to have error.catErrLev Univariate overlap level for the categorical variables with error.DetailsThis function simulates mixed-type data sets with a latent cluster structure.Continuous variables follow a normal mixture model,and categorical variables follow a multinomial mixture model.Overlap of the continuous and categorical variables(i.e.how clear the cluster structure is)can be manipulated by the user.Overlap between two clusters is the area of the overlapping region de-ﬁned by their densities(or,for categorical variables,the summed height of overlapping segments deﬁned by their point masses).The default overlap level is0.01(i.e.almost perfect separation).A user-speciﬁed number of continuous and categorical variables can be speciﬁed to be"error vari-ables"with arbitrary overlap within0.01and1.00(where1.00corresponds to complete overlap).NOTE:Currently,only two populations(clusters)are supported.While exact control of overlap between two clusters is straightforward,controlling the overlap between the K choose2pairwise combinations of clusters is a more difﬁcult task.ValueA list with the following elements:trueID Integer vector giving population(cluster)membership of each observation trueMus Mean parameters used for population(cluster)centers in the continuous vari-ablesconVars The continuous variableserrVariance Variance parameter used for continuous error distributionpopProbsNoErr Multinomial probability vectors for categorical variables without measurement errorpopProbsWithErrMultinomial probability vectors for categorical variables with measurement er-rorcatVars The categorical variablesExamplesdat<-genMixedData(100,2,2,nCatLevels=4,nConWithErr=1,nCatWithErr=1,popProportions=c(0.3,0.7),conErrLev=0.3,catErrLev=0.2)with(dat,plot(conVars,col=trueID))with(dat,table(data.frame(catVars[,1:2],trueID,stringsAsFactors=TRUE)))gmsClust A general implementation of Modha-Spangler clustering for mixed-type data.DescriptionModha-Spangler clustering estimates the optimal weighting for continuous vs categorical variables using a brute-force search strategy.UsagegmsClust(conData,catData,nclust,searchDensity=10,clustFun=wkmeans,conDist=squaredEuc,catDist=squaredEuc,...)ArgumentsconData A data frame of continuous variables.catData A data frame of categorical variables;the allowable variable types depend on the speciﬁc clustering function used.nclust An integer specifying the number of clusters.searchDensity An integer determining the number of distinct cluster weightings evaluated in the brute-force search.clustFun The clustering function to be applied.conDist The continuous distance function used to construct the objective function.catDist The categorical distance function used to construct the objective function....Arguments to be passed to the clustFun.DetailsModha-Spangler clustering uses a brute-force search strategy to estimate the optimal weighting for continuous vs categorical variables.This implementation admits an arbitrary clustering function and arbitrary objective functions for continuous and categorical variables.The input parameter clustFun must be a function accepting inputs(conData,catData,conWeight, nclust,...)and returning a list containing(at least)the elements cluster,conCenters,and catCenters.The list element"cluster"contains cluster memberships denoted by the integers1:nclust.The list elements"conCenters"and"catCenters"must be data frames whose rows denote cluster centroids.The function clustFun must allow nclust=1,in which case$centers returns a data frame with a single row.Input parameters conDist and catDist are functions that must each take two data frame rows as input and return a scalar distance measure.ValueA list containing the following results objects:results A results object corresponding to the base clustering algorithmobjFun A numeric vector of length searchDensity containing the values of the objec-tive function for each weight usedQcon A numeric vector of length searchDensity containing the values of the contin-uous component of the objective functionQcon A numeric vector of length searchDensity containing the values of the cate-gorical component of the objective functionbestInd The index of the most successful runweights A numeric vector of length searchDensity containing the continuous weightsusedReferencesFoss A,Markatou M;kamila:Clustering Mixed-Type Data in R and Hadoop.Journal of Statistical Software,83(13).2018.doi:10.18637/jss.v083.i13Modha DS,Spangler WS;Feature Weighting in k-Means Clustering.Machine Learning,52(3).2003.doi:10.1023/a:1024016609528Examples##Not run:#Generate toy data set with poor quality categorical variables and good#quality continuous variables.set.seed(1)dat<-genMixedData(200,nConVar=2,nCatVar=2,nCatLevels=4,nConWithErr=2,nCatWithErr=2,popProportions=c(.5,.5),conErrLev=0.3,catErrLev=0.8)catDf<-dummyCodeFactorDf(data.frame(apply(dat$catVars,2,factor),stringsAsFactors=TRUE)) conDf<-data.frame(scale(dat$conVars),stringsAsFactors=TRUE)msRes<-gmsClust(conDf,catDf,nclust=2)table(msRes$results$cluster,dat$trueID)##End(Not run)kamila KAMILA clustering of mixed-type data.DescriptionKAMILA is an iterative clustering method that equitably balances the contribution of continuous and categorical variables.Usagekamila(conVar,catFactor,numClust,numInit,conWeights=rep(1,ncol(conVar)),catWeights=rep(1,ncol(catFactor)),maxIter=25,conInitMethod="runif",catBw=0.025,verbose=FALSE,calcNumClust="none",numPredStrCvRun=10,predStrThresh=0.8)ArgumentsconVar A data frame of continuous variables.catFactor A data frame of factors.numClust The number of clusters returned by the algorithm.numInit The number of initializations used.conWeights A vector of continuous weights for the continuous variables.catWeights A vector of continuous weights for the categorical variables.maxIter The maximum number of iterations in each run.conInitMethod Character:The method used to initialize each run.catBw The bandwidth used for the categorical kernel.verbose Logical:Whether detailed results should be printed and returned.calcNumClust Character:Method for selecting the number of clusters.numPredStrCvRunNumeric:Number of CV runs for prediction strength method.Ignored unlesscalcNumClust==’ps’predStrThresh Numeric:Threshold for prediction strength method.Ignored unless calcNum-Clust==’ps’DetailsKAMILA(KAy-means for MIxed LArge data sets)is an iterative clustering method that equitably balances the contribution of the continuous and categorical variables.It uses a kernel density estima-tion technique toﬂexibly model spherical clusters in the continuous domain,and uses a multinomial model in the categorical domain.Weighting scheme:If no weights are desired,set all weights to1(the default setting).Let a_1, ...,a_p denote the weights for p continuous variables.Let b_1,...,b_q denote the weights for q categorical variables.Currently,continuous weights are applied during the calculation of Euclidean distance,as:Categorical weights are applied to the log-likelihoods obtained by the level probabil-ities given cluster membership as:Total log likelihood for the kth cluster is obtained by weighting the single continuous log-likelihood by the mean of all continuous weights plus logLikCat_k:Note that weights between0and1are admissible;weights equal to zero completely remove a variable’s inﬂuence on the clustering;weights equal to1leave a variable’s contribution unchanged.Weights between0and1may not be comparable across continuous and categorical variables.Estimatingkamila11 the number of clusters:Default is no estimation method.Setting calcNumClust to’ps’uses the prediction strength method of Tibshirani&Walther(J.of Comp.and Graphical Stats.14(3),2005).There is no perfect method for estimating the number of clusters;PS tends to give a smaller number than,say,BIC based methods for large sample sizes.The user must specify the number of cross-validation runs and the threshold for determining the number of clusters.The smaller the threshold, the larger the number of clusters selected.ValueA list with the following results objects:finalMemb A numeric vector with cluster assignment indicated by integer.numIterfinalLogLik The pseudo log-likelihood of the returned clustering.finalObjfinalCentersfinalProbsinput Object with the given input parameter values.nClust An object describing the results of selecting the number of clusters,empty if calcNumClust==’none’.verbose An optionally returned object with more detailed information.ReferencesFoss A,Markatou M;kamila:Clustering Mixed-Type Data in R and Hadoop.Journal of Statistical Software,83(13).2018.doi:10.18637/jss.v083.i13Examples#Generate toy data set with poor quality categorical variables and good#quality continuous variables.set.seed(1)dat<-genMixedData(200,nConVar=2,nCatVar=2,nCatLevels=4,nConWithErr=2,nCatWithErr=2,popProportions=c(.5,.5),conErrLev=0.3,catErrLev=0.8)catDf<-data.frame(apply(dat$catVars,2,factor),stringsAsFactors=TRUE)conDf<-data.frame(scale(dat$conVars),stringsAsFactors=TRUE)kamRes<-kamila(conDf,catDf,numClust=2,numInit=10)table(kamRes$finalMemb,dat$trueID)wkmeans Weighted k-means for mixed-type dataDescriptionWeighted k-means for mixed continuous and categorical variables.A user-speciﬁed weight conWeight controls the relative contribution of the variable types to the cluster solution.Usagewkmeans(conData,catData,conWeight,nclust,...)ArgumentsconData The continuous variables.Must be coercible to a data frame.catData The categorical variables,either as factors or dummy-coded variables.Must be coercible to a data frame.conWeight The continuous weight;must be between0and1.The categorical weight is 1-conWeight.nclust The number of clusters....Optional arguments passed to kmeans.DetailsA simple adaptation of stats::kmeans to mixed-type data.Continuous variables are multipliedby the input parameter conWeight,and categorical variables are multipled by1-conWeight.If factor variables are input to catData,they are transformed to0-1dummy coded variables with the function dummyCodeFactorDf.ValueA stats::kmeans results object,with additional slots conCenters and catCenters giving the actualcenters adjusted for the weighting process.See AlsodummyCodeFactorDfkmeansExamples#Generate toy data set with poor quality categorical variables and good#quality continuous variables.set.seed(1)dat<-genMixedData(200,nConVar=2,nCatVar=2,nCatLevels=4,nConWithErr=2,nCatWithErr=2,popProportions=c(.5,.5),conErrLev=0.3,catErrLev=0.8)catDf<-data.frame(apply(dat$catVars,2,factor),stringsAsFactors=TRUE)conDf<-data.frame(scale(dat$conVars),stringsAsFactors=TRUE)#A clustering that emphasizes the continuous variablesr1<-with(dat,wkmeans(conDf,catDf,0.9,2))table(r1$cluster,dat$trueID)#A clustering that emphasizes the categorical variables;note argument #passed to the underlying stats::kmeans functionr2<-with(dat,wkmeans(conDf,catDf,0.1,2,nstart=4))table(r2$cluster,dat$trueID)IndexclassifyKamila,4dptmCpp,5dummyCodeFactorDf,5,12 genMixedData,6gmsClust,7kamila,9kamila-package,2kmeans,12wkmeans,1214。

[整理]spss常用统计词汇中英对照表.

统计词汇英汉对照Absolute deviation, 绝对离差Absolute number, 绝对数Absolute residuals, 绝对残差Acceptable hypothesis, 可接受假设Accumulation, 累积ccuracy, 准确度Actual frequency, 实际频数Addition, 相加Additivity, 可加性Adjusted rate, 调整率Adjusted value, 校正值Admissible error, 容许误差Aggregation, 聚集性Alternative hypothesis, 备择假设Among groups, 组间Amounts, 总量Analysis of correlation, 相关分析Analysis of covariance, 协方差分析Analysis of regression, 回归分析Analysis of time series, 时间序列分析Analysis of variance, 方差分析ANOVA （analysis of variance）, 方差分析ANOVA Models, 方差分析模型Arcing, 弧/弧旋Arcsine transformation, 反正弦变换Area under the curve, 曲线面积AREG , 评估从一个时间点到下一个时间点回归相关时的误差Arithmetic mean, 算术平均数rrhenius relation, 艾恩尼斯关系Assessing fit, 拟合的评估Asymmetric distribution, 非对称分布Asymptotic efficiency, 渐近效率Asymptotic variance, 渐近方差Attributable risk, 归因危险度Attribute data, 属性资料Attribution, 属性Autocorrelation, 自相关Autocorrelation of residuals, 残差的自相关Average, 平均数Average confidence interval length, 平均置信区间长度Average growth rate, 平均增长率Bar chart, 条形图Bar graph, 条形图Base period, 基期Bayes' theorem , Bayes定理Bell-shaped curve, 钟形曲线Bernoulli distribution, 伯努力分布Best-trim estimator, 最好切尾估计量Bias, 偏性Binary logistic regression, 二元逻辑斯蒂回归Binomial distribution, 二项分布Bisquare, 双平方Bivariate Correlate, 二变量相关Bivariate normal distribution, 双变量正态分布Bivariate normal population, 双变量正态总体Biweight M-estimator, 双权M估计量BMDP(Biomedical computer programs), BMDP统计软件包Boxplots, 箱线图/箱尾图Canonical correlation, 典型相关Caption, 纵标目Case-control study, 病例对照研究Categorical variable, 分类变量Catenary, 悬链线Cauchy distribution, 柯西分布Cause-and-effect relationship, 因果关系Cell, 单元Censoring, 终检Center of symmetry, 对称中心Centering and scaling, 中心化和定标Central tendency, 集中趋势Central value, 中心值CHAID -χ2 Automatic Interaction Detector, 卡方自动交互检测Chance, 机遇Chance error, 随机误差Chance variable, 随机变量Characteristic equation, 特征方程Characteristic root, 特征根Characteristic vector, 特征向量Chebshev criterion of fit, 拟合的切比雪夫准则Chernoff faces, 切尔诺夫脸谱图Chi-square test, 卡方检验/χ2检验Choleskey decomposition, 乔洛斯基分解Circle chart, 圆图Class interval, 组距Class mid-value, 组中值Class upper limit, 组上限Classified variable, 分类变量Cluster analysis, 聚类分析Cluster sampling, 整群抽样Coefficient of contingency, 列联系数Coefficient of determination, 决定系数Coefficient of multiple correlation, 多重相关系数Coefficient of partial correlation, 偏相关系数Coefficient of production-moment correlation, 积差相关系数Coefficient of rank correlation, 等级相关系数Coefficient of regression, 回归系数Coefficient of skewness, 偏度系数Coefficient of variation, 变异系数Cohort study, 队列研究Column, 列Column effect, 列效应Column factor, 列因素Combination pool, 合并Combinative table, 组合表Common factor, 共性因子Common regression coefficient, 公共回归系数Common value, 共同值Common variance, 公共方差Common variation, 公共变异Communality variance, 共性方差Comparability, 可比性Comparison of bathes, 批比较Comparison value, 比较值Compartment model, 分部模型Compassion, 伸缩Complement of an event, 补事件Complete association, 完全正相关Complete dissociation, 完全不相关Complete statistics, 完备统计量Completely randomized design, 完全随机化设计Composite event, 联合事件Composite events, 复合事件Concavity, 凹性Conditional expectation, 条件期望Conditional likelihood, 条件似然Conditional probability, 条件概率Conditionally linear, 依条件线性Confidence interval, 置信区间Confidence limit, 置信限Confidence lower limit, 置信下限Confidence upper limit, 置信上限Confirmatory Factor Analysis , 验证性因子分析Confounding factor, 混杂因素Conjoint, 联合分析Consistency, 相合性Consistency check, 一致性检验Consistent asymptotically normal estimate, 相合渐近正态估计Consistent estimate, 相合估计Constrained nonlinear regression, 受约束非线性回归Contour, 边界线Contribution rate, 贡献率Control, 对照Controlled experiments, 对照实验Conventional depth, 常规深度Corrected factor, 校正因子Corrected mean, 校正均值Correction coefficient, 校正系数Correctness, 正确性Correlation coefficient, 相关系数Correlation index, 相关指数Counting, 计数Counts, 计数/频数Covariance, 协方差Cox Regression, Cox回归Criteria for fitting, 拟合准则Criteria of least squares, 最小二乘准则Critical ratio, 临界比Critical region, 拒绝域Critical value, 临界值Cumulative distribution function, 分布函数D test, D检验Data acquisition, 资料收集Data bank, 数据库Data capacity, 数据容量Data deficiencies, 数据缺乏Data handling, 数据处理Data manipulation, 数据处理Data processing, 数据处理Data set, 数据集Data sources, 数据来源Data transformation, 数据变换Data validity, 数据有效性Data-in, 数据输入Data-out, 数据输出Degree of freedom, 自由度Degree of reliability, 可靠性程度Density function, 密度函数Density of data points, 数据点的密度Dependent variable, 应变量/依变量/因变量Dependent variable, 因变量Depth, 深度Derivative matrix, 导数矩阵Derivative-free methods, 无导数方法Design, 设计Determinacy, 确定性Determinant, 行列式Determinant, 决定因素Deviation, 离差Deviation from average, 离均差Dichotomous variable, 二分变量Differential equation, 微分方程Direct standardization, 直接标准化法Discrete variable, 离散型变量DISCRIMINANT, 判断Discriminant analysis, 判别分析Discriminant coefficient, 判别系数Discriminant function, 判别值Dispersion, 散布/分散度Downward rank, 降秩Effect, 实验效应Eigenvalue, 特征值Eigenvector, 特征向量Ellipse, 椭圆Empirical distribution, 经验分布Empirical probability, 经验概率单位Enumeration data, 计数资料Equally likely, 等可能Equivariance, 同变性Error, 误差/错误Error of estimate, 估计误差Error type I, 第一类错误Error type II, 第二类错误Estimated error mean squares, 估计误差均方Estimated error sum of squares, 估计误差平方和Euclidean distance, 欧式距离Event, 事件Event, 事件Exceptional data point, 异常数据点Expected values, 期望值Experiment, 实验Experimental sampling, 试验抽样Experimental unit, 试验单位Explanatory variable, 说明变量Exploratory data analysis, 探索性数据分析Explore Summarize, 探索-摘要Exponential curve, 指数曲线Exponential growth, 指数式增长EXSMOOTH, 指数平滑方法Extended fit, 扩充拟合Extra parameter, 附加参数Extrapolation, 外推法Extreme observation, 末端观测值Extremes, 极端值/极值 F distribution, F分布F test, F检验Factor, 因素/因子Factor analysis, 因子分析Factor Analysis, 因子分析Factor score, 因子得分Family of distributions, 分布族Field investigation, 现场调查Field survey, 现场调查Finite population, 有限总体Finite-sample, 有限样本First derivative, 一阶导数First principal component, 第一主成分First quartile, 第一四分位数Fitted value, 拟合值Fitting a curve, 曲线拟合Fixed base, 定基Fluctuation, 随机起伏Forecast, 预测Four fold table, 四格表Fourth, 四分点Fraction blow, 左侧比率Fractional error, 相对误差Frequency, 频率Frequency polygon, 频数多边图Frontier point, 界限点Function relationship, 泛函关系Gamma distribution, 伽玛分布Gauss increment, 高斯增量Gaussian distribution, 高斯分布/正态分布General census, 全面普查GENLOG (Generalized liner models), 广义线性模型Geometric mean, 几何平均数GLM (General liner models), 通用线性模型Goodness of fit, 拟和优度/配合度Gradient of determinant, 行列式的梯度Grand mean, 总均值Group averages, 分组平均Grouped data, 分组资料Guessed mean, 假定平均数Half-life, 半衰期Happenstance, 偶然事件Harmonic mean, 调和均数Hazard function, 风险均数Hazard rate, 风险率Heading, 标目Heavy-tailed distribution, 重尾分布Heterogeneity of variance, 方差不齐Hierarchical classification, 组内分组Hierarchical clustering method, 系统聚类法HILOGLINEAR, 多维列联表的层次对数线性模型Hinge, 折叶点Histogram, 直方图HOMALS, 多重响应分析Homogeneity of variance, 方差齐性Homogeneity test, 齐性检验Huber M-estimators, 休伯M估计量Hyperbola, 双曲线Hypothesis testing, 假设检验Hypothetical universe, 假设总体Impossible event, 不可能事件Independence, 独立性Independent variable, 自变量Index, 指标/指数Indirect standardization, 间接标准化法Individual, 个体Inference band, 推断带Infinite population, 无限总体Infinitely great, 无穷大Infinitely small, 无穷小Influence curve, 影响曲线Information capacity, 信息容量Initial condition, 初始条件Initial estimate, 初始估计值Initial level, 最初水平Interaction, 交互作用Interaction terms, 交互作用项Intercept, 截距Interpolation, 内插法Interquartile range, 四分位距Interval estimation, 区间估计Intervals of equal probability, 等概率区间Intrinsic curvature, 固有曲率Invariance, 不变性Inverse matrix, 逆矩阵Inverse probability, 逆概率Inverse sine transformation, 反正弦变换Iteration, 迭代Jacobian determinant, 雅可比行列式Joint distribution function, 分布函数Joint probability, 联合概率Joint probability distribution, 联合概率分布K means method, 逐步聚类法Kaplan-Merier chart, Kaplan-Merier图Kendall's rank correlation, Kendall等级相关Kolmogorov-Smirnove test, 柯尔莫哥洛夫-斯米尔诺夫检验Kruskal and Wallis test, Kruskal及Wallis检验/多样本的秩和检验/H检验Kurtosis, 峰度Lack of fit, 失拟Ladder of powers, 幂阶梯Large sample, 大样本Large sample test, 大样本检验Latin square, 拉丁方Latin square design, 拉丁方设计Least favorable configuration, 最不利构形Least favorable distribution, 最不利分布Least significant difference, 最小显著差法Least square method, 最小二乘法Least-absolute-residuals estimates, 最小绝对残差估计Least-absolute-residuals fit, 最小绝对残差拟合Least-absolute-residuals line, 最小绝对残差线Legend, 图例L-estimator, L估计量L-estimator of location, 位置L估计量L-estimator of scale, 尺度L估计量Level, 水平Life table, 寿命表Life table method, 生命表法Light-tailed distribution, 轻尾分布Likelihood function, 似然函数Likelihood ratio, 似然比line graph, 线图Linear correlation, 直线相关Linear equation, 线性方程Linear programming, 线性规划Linear regression, 直线回归Linear Regression, 线性回归Linear trend, 线性趋势Loading, 载荷Location and scale equivariance, 位置尺度同变性Location equivariance, 位置同变性Location invariance, 位置不变性Location scale family, 位置尺度族Log rank test, 时序检验Logarithmic curve, 对数曲线Logarithmic normal distribution, 对数正态分布Logarithmic scale, 对数尺度Logarithmic transformation, 对数变换Logic check, 逻辑检查Logistic distribution, 逻辑斯特分布Logit transformation, Logit转换LOGLINEAR, 多维列联表通用模型Lognormal distribution, 对数正态分布Lost function, 损失函数Low correlation, 低度相关Lower limit, 下限Lowest-attained variance, 最小可达方差LSD, 最小显著差法的简称Lurking variable, 潜在变量Main effect, 主效应Marginal density function, 边缘密度函数Marginal probability, 边缘概率Marginal probability distribution, 边缘概率分布Matching of transformation, 变换的匹配Mathematical expectation, 数学期望Mathematical model, 数学模型Maximum L-estimator, 极大极小L 估计量Maximum likelihood method, 最大似然法Mean, 均数Mean squares between groups, 组间均方Mean squares within group, 组内均方Means (Compare means), 均值-均值比较Median, 中位数Median effective dose, 半数效量Median polish, 中位数平滑Median test, 中位数检验Minimal sufficient statistic, 最小充分统计量Minimum distance estimation, 最小距离估计Minimum variance estimator, 最小方差估计量MINITAB, 统计软件包Missing data, 缺失值Model specification, 模型的确定Modeling Statistics , 模型统计Models for outliers, 离群值模型Modifying the model, 模型的修正Most favorable configuration, 最有利构形Multidimensional Scaling (ASCAL), 多维尺度/多维标度Multinomial Logistic Regression , 多项逻辑斯蒂回归Multiple comparison, 多重比较Multiple correlation , 复相关Multiple covariance, 多元协方差Multiple linear regression, 多元线性回归Multiple response , 多重选项Multiple solutions, 多解Multiplication theorem, 乘法定理Multiresponse, 多元响应Multi-stage sampling, 多阶段抽样Multivariate T distribution, 多元T分布Mutual exclusive, 互不相容Mutual independence, 互相独立Negative correlation, 负相关Negative linear correlation, 负线性相关Negatively skewed, 负偏Newman-Keuls method, q检验NK method, q检验No statistical significance, 无统计意义Nominal variable, 名义变量Nonlinear regression, 非线性相关Nonparametric statistics, 非参数统计Nonparametric test, 非参数检验Normal deviate, 正态离差Normal distribution, 正态分布Normal ranges, 正常范围Normal value, 正常值Nuisance parameter, 多余参数/讨厌参数Null hypothesis, 无效假设Numerical variable, 数值变量Objective function, 目标函数Observation unit, 观察单位Observed value, 观察值One sided test, 单侧检验One-way analysis of variance, 单因素方差分析Oneway ANOVA , 单因素方差分析Order statistics, 顺序统计量Ordered categories, 有序分类Ordinal logistic regression , 序数逻辑斯蒂回归Ordinal variable, 有序变量Orthogonal basis, 正交基Orthogonal design, 正交试验设计Orthogonality conditions, 正交条件ORTHOPLAN, 正交设计Outlier cutoffs, 离群值截断点Outliers, 极端值OVERALS , 多组变量的非线性正规相关Paired design, 配对设计Paired sample, 配对样本Parallel tests, 平行试验Parameter, 参数Parametric statistics, 参数统计Parametric test, 参数检验Partial correlation, 偏相关Partial regression, 偏回归Pearson curves, 皮尔逊曲线Percent bar graph, 百分条形图Percentage, 百分比Percentile, 百分位数Percentile curves, 百分位曲线Periodicity, 周期性Permutation, 排列P-estimator, P估计量Pie graph, 饼图Pitman estimator, 皮特曼估计量Point estimation, 点估计Poisson distribution, 泊松分布Population, 总体Positive correlation, 正相关Positively skewed, 正偏Posterior distribution, 后验分布Power of a test, 检验效能Precision, 精密度Predicted value, 预测值Principal component analysis, 主成分分析Prior distribution, 先验分布Prior probability, 先验概率Probabilistic model, 概率模型probability, 概率Probability density, 概率密度Product moment, 乘积矩/协方差Profile trace, 截面迹图Proportion, 比/构成比Proportion allocation in stratified random sampling, 按比例分层随机抽样Proportionate sub-class numbers, 成比例次级组含量Pseudo F test, 近似F检验Pseudo model, 近似模型Pseudosigma, 伪标准差Purposive sampling, 有目的抽样QR decomposition, QR分解Quadratic approximation, 二次近似Qualitative classification, 属性分类Qualitative method, 定性方法Quantile-quantile plot, 分位数-分位数图/Q-Q图Quantitative analysis, 定量分析Quartile, 四分位数Quick Cluster, 快速聚类Radix sort, 基数排序Random allocation, 随机化分组Random blocks design, 随机区组设计Random event, 随机事件Randomization, 随机化Range, 极差/全距Rank correlation, 等级相关Rank sum test, 秩和检验Rank test, 秩检验Ranked data, 等级资料Rate, 比率Ratio, 比例Raw data, 原始资料Raw residual, 原始残差Reciprocal, 倒数Reducing dimensions, 降维Region of acceptance, 接受域Regression coefficient, 回归系数Regression sum of square, 回归平方和Relative dispersion, 相对离散度Relative number, 相对数Reliability, 可靠性Reparametrization, 重新设置参数Replication, 重复Report Summaries, 报告摘要Residual sum of square, 剩余平方和Resistance, 耐抗性R-estimator of location, 位置R估计量R-estimator of scale, 尺度R估计量Retrospective study, 回顾性调查Rotation, 旋转Row, 行Row factor, 行因素Sample, 样本Sample regression coefficient, 样本回归系数Sample size, 样本量Sample standard deviation, 样本标准差Sampling error, 抽样误差SAS(Statistical analysis system ), SAS统计软件包Scale, 尺度/量表Scatter diagram, 散点图Schematic plot, 示意图/简图Second derivative, 二阶导数Second principal component, 第二主成分SEM (Structural equation modeling), 结构化方程模型Sequential analysis, 贯序分析Sequential data set, 顺序数据集Sequential design, 贯序设计Sequential method, 贯序法Sequential test, 贯序检验法Sigmoid curve, S形曲线Sign test, 符号检验Signed rank, 符号秩Significance test, 显著性检验Significant figure, 有效数字Simple cluster sampling, 简单整群抽样Simple correlation, 简单相关Simple random sampling, 简单随机抽样Simple regression, 简单回归simple table, 简单表Single-valued estimate, 单值估计Singular matrix, 奇异矩阵Skewed distribution, 偏斜分布Skewness, 偏度Slash distribution, 斜线分布Smirnov test, 斯米尔诺夫检验Spearman rank correlation, 斯皮尔曼等级相关Specific factor, 特殊因子Specific factor variance, 特殊因子方差Spherical distribution, 球型正态分布SPSS(Statistical package for the social science), SPSS统计软件包Standard deviation, 标准差Standard error, 标准误Standard error of difference, 差别的标准误Standard error of estimate, 标准估计误差Standard error of rate, 率的标准误Standard normal distribution, 标准正态分布Standardization, 标准化Starting value, 起始值Statistic, 统计量Statistical control, 统计控制Statistical graph, 统计图Statistical inference, 统计推断Statistical table, 统计表Steepest descent, 最速下降法Stem and leaf display, 茎叶图Step factor, 步长因子Stepwise regression, 逐步回归Storage, 存Strata, 层（复数）Stratified sampling, 分层抽样Stratified sampling, 分层抽样Studentized residual, 学生化残差/t化残差Sufficient statistic, 充分统计量Sum of products, 积和Sum of squares, 离差平方和Sum of squares about regression, 回归平方和Sum of squares between groups, 组间平方和Sum of squares of partial regression, 偏回归平方和Sure event, 必然事件Survey, 调查Survival, 生存分析Survival rate, 生存率Symmetry, 对称Systematic error, 系统误差Systematic sampling, 系统抽样Tags, 标签Tail area, 尾部面积Tail length, 尾长Tail weight, 尾重Target distribution, 目标分布Taylor series, 泰勒级数Tendency of dispersion, 离散趋势Testing of hypotheses, 假设检验Theoretical frequency, 理论频数Time series, 时间序列Tolerance interval, 容忍区间Total sum of square, 总平方和Total variation, 总变异Transformation, 转换Treatment, 处理Trend, 趋势Trend of percentage, 百分比趋势Trial, 试验Trial and error method, 试错法Two sided test, 双向检验Two-stage least squares, 二阶最小平方Two-tailed test, 双侧检验Two-way analysis of variance, 双因素方差分析Type I error, 一类错误/α错误Type II error, 二类错误/β错误UMVU, 方差一致最小无偏估计简称Unbiased estimate, 无偏估计Unconstrained nonlinear regression , 无约束非线性回归Unequal subclass number, 不等次级组含量Ungrouped data, 不分组资料Uniform coordinate, 均匀坐标Uniform distribution, 均匀分布Uniformly minimum variance unbiased estimate, 方差一致最小无偏估计Upper limit, 上限Upward rank, 升秩Validity, 有效性VARCOMP (Variance component estimation), 方差元素估计Variability, 变异性Variable, 变量Variance, 方差Variation, 变异Varimax orthogonal rotation, 方差最大正交旋转W test, W检验Weibull distribution, 威布尔分布Weight, 权数Weighted Chi-square test, 加权卡方检验/Cochran检验Weighted linear regression method, 加权直线回归Weighted mean, 加权平均数Weighted mean square, 加权平均方差Weighted sum of square, 加权平方和Weighting coefficient, 权重系数Weighting method, 加权法W-estimation, W估计量W-estimation of location, 位置W估计量Width, 宽度Wilcoxon paired test, 威斯康星配对法/配对符号秩和检验Z test, Z检验Zero correlation, 零相关Z-transformation, Z变换。

Hierarchical cluster analysis

Chapter 7Hierarchical cluster analysisIn Part 2 (Chapters 4 to 6) we defined several different ways of measuring distance (or dissimilarity as the case may be) between the rows or between the columns of the data matrix, depending on the measurement scale of the observations. As we remarked before, this process often generates tables of distances with even more numbers than the original data, but we will show now how this in fact simplifies our understanding of the data. Distances between objects can be visualized in many simple and evocative ways. In this chapter we shall consider a graphical representation of a matrix of distances which is perhaps the easiest to understand – a dendrogram, or tree – where the objects are joined together in a hierarchical fashion from the closest, that is most similar, to the furthest apart, that is the most different. The method of hierarchical cluster analysis is best explained by describing the algorithm, or set of instructions, which creates the dendrogram results. In this chapter we demonstrate hierarchical clustering on a small example and then list the different variants of the method that are possible.ContentsThe algorithm for hierarchical clusteringCutting the treeMaximum, minimum and average clusteringValidity of the clustersClustering correlationsClustering a larger data setThe algorithm for hierarchical clusteringAs an example we shall consider again the small data set in Exhibit 5.6: seven samples on which 10 species are indicated as being present or absent. In Chapter 5 we discussed two of the many dissimilarity coefficients that are possible to define between the samples: the first based on the matching coefficient and the second based on the Jaccard index. The latter index counts the number of ‘mismatches’ between two samples after eliminating the species that do not occur in either of the pair. Exhibit 7.1 shows the complete table of inter-sample dissimilarities based on the Jaccard index.Exhibit 7.1 Dissimilarities, based on the Jaccard index, between all pairs ofseven samples in Exhibit 5.6. For example, between the first two samples, A andB, there are 8 species that occur in on or the other, of which 4 are matched and 4are mismatched – the proportion of mismatches is 4/8 = 0.5. Both the lower andupper triangles of this symmetric dissimilarity matrix are shown here (the lowertriangle is outlined as in previous tables of this type.samples A B C D E F GA00.50000.4286 1.00000.25000.62500.3750B0.500000.71430.83330.66670.20000.7778C0.42860.71430 1.00000.42860.66670.3333D 1.00000.8333 1.00000 1.00000.80000.8571E0.25000.66670.4286 1.000000.77780.3750F0.62500.20000.66670.80000.777800.7500G0.37500.77780.33330.85710.37500.75000The first step in the hierarchical clustering process is to look for the pair of samples that are the most similar, that is are the closest in the sense of having the lowest dissimilarity – this is the pair B and F, with dissimilarity equal to 0.2000. These two samples are then joined at a level of 0.2000 in the first step of the dendrogram, or clustering tree (see the first diagram of Exhibit 7.3, and the vertical scale of 0 to 1 which calibrates the level of clustering). The point at which they are joined is called a node.We are basically going to keep repeating this step, but the only problem is how to calculated the dissimilarity between the merged pair (B,F) and the other samples. This decision determines what type of hierarchical clustering we intend to perform, and there are several choices. For the moment, we choose one of the most popular ones, called the maximum, or complete linkage, method: the dissimilarity between the merged pair and the others will be the maximum of the pair of dissimilarities in each case. For example, the dissimilarity between B and A is 0.5000, while the dissimilarity between F and A is 0.6250. hence we choose the maximum of the two, 0.6250, to quantify the dissimilarity between (B,F) and A. Continuing in this way we obtain a new dissimilarity matrix Exhibit 7.2.Exhibit 7.2 Dissimilarities calculated after B and F are merged, using the‘maximum’ method to recomputed the values in the row and column labelled(B,F).samples A(B,F)C D E GA00.62500.4286 1.00000.25000.3750(B,F)0.625000.71430.83330.77780.7778C0.42860.71430 1.00000.42860.3333D 1.00000.8333 1.00000 1.00000.8571E0.25000.77780.4286 1.000000.3750G0.37500.77780.33330.85710.37500Exhibit 7.3 First two steps of hierarchical clustering of Exhibit 7.1, using the ‘maximum’ (or ‘complete linkage’) method.The process is now repeated: find the smallest dissimilarity in Exhibit 7.2, which is 0.2500 for samplesA and E , and then cluster these at a level of 0.25, as shown in the second figure of Exhibit 7.3. Then recomputed the dissimilarities between the merged pair (A ,E ) and the rest to obtain Exhibit 7.4. For example, the dissimilarity between (A ,E ) and (B ,F ) is the maximum of 0.6250 (A to (B ,F )) and 0.7778 (E to (B ,F )).Exhibit 7.4 Dissimilarities calculated after A and E are merged, using the ‘maximum’ method to recomputed the values in the row and column labelled (A ,E ).samples(A,E)(B,F)CDG(A,E)00.77780.4286 1.00000.3750(B,F)0.777800.71430.83330.7778C 0.42860.71430 1.00000.3333D1.00000.8333 1.000000.8571G 0.37500.77780.33330.85710In the next step the lowest dissimilarity in Exhibit 7.4 is 0.3333, for C and G – these are merged, as shown in the first diagram of Exhibit 7.6, to obtain Exhibit 7.5. Now the smallest dissimilarity is 0.4286, between the pair (A ,E ) and (B ,G ), and they are shown merged in the second diagram of Exhibit 7.6. Exhibit 7.7 shows the last two dissimilarity matrices in this process, and Exhibit 7.8 the final two steps of the construction of the dendrogram, also called a binary tree because at each step two objects (or clusters of objects) are merged. Because there are 7 objects to be clustered, there are 6 steps in the sequential process (i.e., one less) to arrive at the final tree where all objects are in a single cluster. For botanists that may be reading this: this is an upside-down tree, of course!1.00.50.0B F B F A E1.00.50.0Exhibit 7.5 Dissimilarities calculated after C and G are merged, using the‘maximum’ method to recomputed the values in the row and column labelled (C,G).samples(A,E)(B,F)(C,G)D(A,E)00.77780.4286 1.0000(B,F)0.777800.77780.8333(C,G)0.42860.77780 1.0000D 1.00000.8333 1.00000Exhibit 7.6 The third and fourth steps of hierarchical clustering of Exhibit 7.1, using the ‘maximum’ (or ‘complete linkage’) method. The point at which objects (or clusters of objects) are joined is called a node.Exhibit 7.7 Dissimilarities calculated after C and G are merged, using the‘maximum’ method to recomputed the values in the row and column labelled (C,G).samples(A,E,C,G)(B,F)D samples(A,E,C,G,B,F)D (A,E,C,G)00.7778 1.0000(A,E,C,G,B,F)0 1.0000 (B,F)0.777800.8333D 1.00000D 1.00000.83330B F A EC G1.00.50.0B F A EC G1.00.50.0Exhibit 7.8 The fifth and sixth steps of hierarchical clustering of Exhibit 7.1, using the ‘maximum’ (or ‘complete linkage’) method. The dendrogram on the right is the final result of the cluster analysis. In the clustering of n objects, there are n–1 nodes (i.e. 6 nodes in this case).Cutting the treeThe final dendrogram on the right of Exhibit 7.8 is a compact visualization of the dissimilarity matrix in Exhibit 7.1, based on the presence-absence data of Exhibit 5.6. Interpretation of the structure of data is made much easier now – we can see that there are three pairs of samples that are fairly close, two of these pairs ((A,E) and (C,G)) are in turn close to each other, while the single sample D separates itself entirely from all the others. Because we used the ‘maximum’ method, all samples clustered below a particular level of dissimilarity will have inter-sample dissimilarities less than that level. For example, 0.5 is the point at which samples are exactly as similar to one another as they are dissimilar, so if we look at the clusters of samples below 0.5 – i.e., (B,F), (A,E,C,G) and (D) – then within each cluster the samples have more than 50% similarity, in other words more than 50% co-presences of species. The level of 0.5 also happens to coincide in the final dendrogram with a large jump in the clustering levels: the node where (A,E) and (C,G) are clustered is at level of 0.4286, while the next node where (B,F) is merged is at a level of 0.7778. This is thus a very convenient level to cut the tree. If the branches are cut at 0.5, we are left with the three clusters of samples (B,F), (A,E,C,G) and (D), which can be labelled types 1, 2 and 3 respectively. In other words, we have created a categorical variable, with three categories, and the samples are categorized as follows:A B C D E F G2 1 23 2 1 2Checking back to Chapter 2, this is exactly the objective which we described in the lower right hand corner of the multivariate analysis scheme (Exhibit 2.2) – to reveal a categorical variable which underlies the structure of a data set.B F A EC G1.00.50.0B F A EC G D1.00.50.0Maximum, minimum and average clusteringThe crucial choice when deciding on a cluster analysis algorithm is to decide how to quantify dissimilarities between two clusters. The algorithm described above was characterized by the fact that at each step, when updating the matrix of dissimilarities, the maximum of the between-cluster dissimilarities was chosen. This is also known as complete linkage cluster analysis, because a cluster is formed when all the dissimilarities (‘links’) between pairs of objects in the cluster are less then a particular level. There are several alternatives to complete linkage as a clustering criterion, and we only discuss two of these: minimum and average clustering.The ‘minimum’ method goes to the other extreme and forms a cluster when only one pair of dissimilarities (not all) is less than a particular level – this is known as single linkage cluster analysis. So at every updating step we choose the minimum of the two distances and two clusters of objects can be merged when there is a single close link between them, irrespective of the other inter-object distances. In general, this is not a suitable choice for most applications, because it can lead to clusters that are quite heterogeneous internally, and the usual object of clustering is to obtain homogeneous clusters.The ‘average’ method is an attractive compromise where dissimilarities are averaged at each step, hence the name average linkage cluster analysis. For example, in Exhibit 7.1 the first step of all types of cluster analysis would merge B and F. But then calculating the dissimilarity between A, for example, and (B,F) is where the methods distinguish themselves. The dissimilarity between A and B is 0.5000, and between A and F it is 0.6250. Complete linkage chooses the maximum: 0.6250; single linkage chooses the minimum: 0.5000; while average linkage chooses the average: (0.5000+0.6250)/2 = 0.5625.Validity of the clustersIf a cluster analysis is performed on a data matrix, a set of clusters can always be obtained, even if there is no actual grouping of the objects, in this case the samples. So how can we evaluate whether the three clusters in this example are not just any old three groups which we would have obtained on random data with no structure? There is a vast literature on validity of clusters (we give some references in the Bibliography, Appendix E) and here we shall explain one approach based on permutation testing. In our example, the three clusters were formed so that internally in each cluster formed by more than one sample the between-sample dissimilarities were all less than 0.5000. In fact, if we look.at the result in the right hand picture of Exhibit 7.8, the cutpoint for three clusters can be brought down to the level of 0.4286, where (A,E) and (C,G) joined together. As in all statistical considerations of significance, we ask whether this is an unusual result or whether it could have arisen merely by chance. To answer this question we need an idea of what might have happened in chance results, so that we can judge our actual finding. This so-called “null distribution” can be generated through permuting the data in some reasonable way, evaluating the statistic of interest, and doing this many times (or for all permutations if this is feasible computationally) to obtain a distribution of the statistic. The statistic of interest could be that value at which the three clusters are formed, but we need to choose carefullyhow we perform the permutations, and this depends on how the data were collected. We consider two possible assumptions, and show how different the results can be.The first assumption is that the column totals of Table Exhibit 5.6 are fixed; that is, that the 10 species are present, respectively, 3 times in the 7 samples, 6 times, 4 times, 3 times and so on. Then the permutation involved would be to simply randomly shuffle the zeros and ones in each column to obtain a new presence-absence matrix with exactly the same column totals as before. Performing the compete linkage hierarchical clustering on this matrix leads to that value where the three cluster solution is achieved, and becomes one observation of the null permutation distribution. We did this 9999 times, and along with our actual observed value of 0.4286, the 10000 values are graphed in Exhibit 7.9 (we show it as a horizontal bar chart because there are only 15 different values observed of this value, shown here with their frequencies). The value we actually observed is one of the smallest – the number of permuted matrices that generates this value or a lower value is 26 out of 10000, so that in this sense our data are very unusual and the ‘significance’ of the three-cluster solution can be quantified with a p -value of 0.0026. The other 9974 randompermutations all lead to generally higher inter-sample dissimilarities such that the level at which three-cluster solutions are obtained is 0.4444 or higher (0.4444 corresponds to 4 mistmatches out of 9.Exhibit 7.9 Bar chart of the 10000 values of the three-cluster solutions obtained by permuting the columns of the presence-absence data, including the value we observed in the original unpermuted data matrix.The second and alternative possible assumption for the computation of the null distribution could be that the column margins are not fixed, but random; in other words, we relax the fact that there were exactly 3 samples that had species sp1, for example, and assume a binomial distribution for each column, using the observed proportion (3 out of 7 forspecies sp1) and the number of samples (7) as the binomial parameters. Thus there can be 0 up to 7 presences in each column, according to the binomial probabilities for eachspecies. This gives a much wider range of possibilities for the null distribution, and leads us to a different conclusion about our three observed clusters. The permutation distributionlevel frequency0.800020.7778350.75003630.714313600.70001890.666729670.625021990.60008220.571413810.55552070.50004410.444480.4286230.400020.37501is now shown in Exhibit 7.10, and now our observed value of 0.4286 does not look sounusual, since 917 out of 10000 values in the distribution are less than or equal to it, giving an estimated P -value of 0.0917.Exhibit 7.10 Bar chart of the 10000 values of the three-cluster solutionsobtained by generating binomial data in each column of the presence-absence matrix, according to the probability of presence of each species.So, as in many situations in statistics, the result and decision depends on the initialassumptions. Could we have observed the presence of species s1 less or more than 3 times in the 7 samples (and so on for the other species)? In other words, according to thebinomial distribution with n = 7, and p = 3/7, the probabilities of observing k presences of species sp1 (k = 0, 1, …, 7) are:0 1 2 3 4 5 6 7 0.020 0.104 0.235 0.294 0.220 0.099 0.025 0.003If this assumption (and similar ones for the other nine species) is realistic, then the cluster significance is 0.0917. However, if the first assumption is adopted (that is, the probability of observing 3 presences for species s1 is 1 and 0 for other possibilities), then the significance is 0.0028. Our feeling is that perhaps the binomial assumption is more realistic, in which case our cluster solution could be observed in just over 9% of random cases – this gives us an idea of the validity of our results and whether we are dealing with real clusters or not. The value of 9% is a measure of ‘clusteredness’ of our samples in terms of the Jaccard index: the lower this measure, the more they are clustered, and the hoihger the measure, the more the samples lie in a continuum. Lack of evidence oflevel frequency0.875020.857150.8333230.8000500.7778280.75002010.71434850.7000210.666712980.625011710.60008950.571419600.55554680.500022990.44441770.42865670.40001620.37501070.3333640.300010.2857120.250030.20001‘clusteredness’ does not mean that the clustering is not useful: we might want to divide up the space of the data into separate regions, even though the borderlines between them are ‘fuzzy’. And speaking of ‘fuzzy’, there is an alternative form of cluster analysis (fuzzy cluster analysis, not treated specifically in this book) where samples are classified fuzzily into clusters, rather than strictly into one group or another – this idea is similar to the fuzzy coding we described in Chapter 3.Clustering correlations on variablesJust like we clustered samples, so we can cluster variables in terms of their correlations, or distances based on their correlations as described in Chapter 6. The dissimilarity based on the Jaccard index can also be used to measure similarity between species – the index counts the number of samples that have both species of the pair, relative to the number of samples that have at least one of the pair, and the dissimilarity is 1 minus this index.Exhibit 7.11 shows the cluster analyses based on these two alternatives, for the columns of Exhibit 5.6, using the graphical output this time of the R function hclust for hierarchical clustering. The fact that these two trees are so different is no surprise: the first one is based on the correlation coefficient takes into account the co-absences, which strengthens the correlation, while the second does not. Both have the pairs (sp2,sp5) and (sp3,sp8) at zero dissimilarity because these are identically present and absent across the samples. Species sp1 and sp7 are close in terms of correlation, due to co-absences – sp7 only occurs in one sample, sample E , which also has sp1, a species which is absent in four other samples. Notice in Exhibit 7.11(b) how species sp10 and sp1 both join the cluster (sp2,sp5) at the same level (0.5).Exhibit 7.11 Complete linkage cluster analyses of (a) 1–r (1 minus the correlation coefficient between species); (b) Jaccard dissimilarity between species (1 minus the Jaccard similarity index). The R function hclust which calculates the dendrograms places the object (species) labels at a constant distance below its clustering level.(a) (b)s p 1s p 7s p 2s p 5s p 9s p 3s p 8s p 6s p 4s p 100.00.51.01.52.0H e i g h ts p 4s p 6s p 10s p 1s p 2s p 5s p 7s p 9s p 3s p 80.00.20.40.60.81.0H e i g h tClustering a larger data setThe more objects there are to cluster, the more complex becomes the result. In Exhibit 4.5 we showed part of the matrix of standardized Euclidean distances between the 30 sites of Exhibit 1.1, and Exhibit 7.12 shows the hierarchical clustering of this distance matrix, using compete linkage. There are two obvious places where we can cut the tree, at about level 3.4, which gives four clusters, or about 2.7, which gives six clusters. To get an ideaExhibit 7.12 Complete linkage cluster analyses of the standardized Euclidean distances of Exhibit 4.5.of the ‘clusteredness’ of these data, we performed a permutation test similar to the one described above, where the data are randomly permuted within their columns and the cluster analysis repeated each time to obtain 6 clusters. The permutation distribution of levels at which 6 clusters are formed is shown in Exhibit 7.13 – the observed value in Exhibit 7.12 (i.e., where (s2,s14) joins (s25,s23,s30,s12,s16,s27)) is 2.357, which is clearly not an unusual value. The estimated p -value according to the proportion of the distribution to the left of 2.357 in Exhibit 7.13 is p = 0.3388, so we conclude that these samples do not have a non-random cluster structure – they form more of a continuum, which will be the subject of Chapter 9.s s 2s 23s 3016s 27s s s 2 4 classes6 classes7-11Exhibit 7.13 Estimated permutation distribution for the level at which 6clusters are formed in the cluster analysis of Exhibit 7.12, showing the valueactually observed. Of the 10000 permutations, including the observed value,3388 are less than or equal to the observed value, giving an estimated p -valuefor clusteredness of 0.3388.SUMMARY: Hierarchical cluster analysis1. Hierarchical cluster analysis of n objects is defined by a stepwise algorithm whichmerges two objects at each step, the two which have the least dissimilarity.2. Dissimilarities between clusters of objects can be defined in several ways; forexample, the maximum dissimilarity (complete linkage), minimum dissimilarity (single linkage) or average dissimilarity (average linkage).3. Either rows or columns of a matrix can be clustered – in each case we choose theappropriate dissimilarity measure that we prefer.4. The results of a cluster analysis is a binary tree, or dendrogram, with n – 1 nodes. Thebranches of this tree are cut at a level where there is a lot of ‘space’ to cut them, that is where the jump in levels of two consecutive nodes is large.5. A permutation test is possible to validate the chosen number of clusters, that is to seeif there really is a non-random tendency for the objects to group together. 6-clu ster level f r e q u e n c y 2.0 2.5 3.002004006008001000(observed value)。

07SM101

07SM1011. IntroductionIn this document, we will discuss the key features and functionalities of the 07SM101 model. The 07SM101 is a state-of-the-art machine learning algorithm developed for solving complex problems in various domains. It has been widely used in industries such as finance, healthcare, and e-commerce to gain insights and make data-driven decisions. In this document, we will provide an overview of the model, explain its core concepts, and discuss its advantages and limitations.2. Model ArchitectureThe 07SM101 model is based on a deep neural network architecture, specifically designed to handle large-scale datasets and complex patterns. It consists of multiple layers of artificial neurons, each responsible for processing and transforming the input data. The architecture includes input layers, hidden layers, and an output layer. The input layer receives the raw data, which is then processed by the hidden layers to extract relevant features. Finally, the output layer produces the desired prediction or classification result.3. Key Features3.1. FlexibilityThe 07SM101 model is highly flexible and can be applied to a wide range of problem domains. It can handle both numericaland categorical data, making it suitable for various types of machine learning tasks such as regression, classification, and clustering. The model can also handle both structured and unstructured data, allowing for the analysis of text, images, and other complex data types.3.2. ScalabilityThe 07SM101 model is designed to handle large-scale datasets efficiently. It can effectively process and analyze millions of data points, making it suitable for big data applications. The model’s scalability is achieved through the use of parallel processing and distributed computing techniques, which allow for the efficient utilization of computational resources.3.3. High AccuracyOne of the key advantages of the 07SM101 model is its high accuracy in prediction and classification tasks. The model is trained on large datasets using advanced optimization algorithms, allowing it to learn complex patterns and make precise predictions. This high accuracy makes the model ideal for applications where accurate predictions are critical, such as fraud detection and disease diagnosis.3.4. InterpretabilityUnlike some black-box machine learning models, the07SM101 model provides interpretability, allowing users to understand the reasoning behind its predictions. The model can provide feature importance scores, indicating the most significant variables that contribute to the prediction. Thisinterpretability feature is particularly useful in domains where transparency and explainability are important, such as healthcare and finance.4. LimitationsWhile the 07SM101 model has many advantages, it also has certain limitations that should be taken into account:4.1. Computational Resource RequirementsDue to its architecture and complexity, the 07SM101 model requires significant computational resources, including memory and processing power. This can be a limitation for organizations with limited resources or for applications where real-time predictions are required.4.2. Training Data RequirementsTo achieve high accuracy, the 07SM101 model requires a large amount of training data. The availability of such data can be a challenge in some domains or for small organizations. Additionally, the quality and diversity of the training data can significantly impact the model’s performance.4.3. Interpretability Trade-offWhile the 07SM101 model provides interpretability, there is often a trade-off between interpretability and model complexity. Highly interpretable models may not have the same level of accuracy as more complex black-box models. Therefore, users need to carefully consider the trade-offbetween interpretability and accuracy based on the specific requirements of their application.5. ConclusionThe 07SM101 model is a versatile and powerful machine learning algorithm that has proven to be effective in solving complex problems in various domains. Its flexibility, scalability, and high accuracy make it suitable for a wide range of applications. However, organizations should consider its computational resource requirements, training data requirements, and interpretability trade-offs before adopting the model. Overall, the 07SM101 model provides a valuable tool for data analysis and decision-making in today’s data-driven world.。

6Cluster

The quality of a clustering result depends on both the similarity measure used by the method and its
implementation

The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns
It is hard to define “similar enough” or “good enough”

the answer is typically highly subjective.
Data Mining: Concepts and Techniques 7
June 27, 2012
where

m f 1 (x 1 f x 2 f n
...
x nf
).
Calculate the standardized measurement (z-score)
x m f z if if s
f

Using mean absolute deviation is more robust than using standard deviation
2
What is Cluster Analysis?

Cluster: a collection of data objects

Similar to one another within the same cluster
Dissimilar to the objects in other clusters

心理学专业英语词汇(C1)

心理学专业英语词汇(C1)心理学专业英语词汇(C1)心理学专业英语词汇(C1)c factor c 因素c fibre c 纤维c light source c 光源c reaction c 反应c score c 分数c value c 值ca 实足年龄cacesthesia 感觉异常cachectic 恶病质的cachexia 恶病质cachexia 衰弱期cachinnate 狂笑cachinnation 痴笑cacodemonomania 魔附妄想cacoepy 发音不准cacoethes operandi 手术癖cacogenesis 畸形cacogenesis 劣生cacogenic 种族退化的cacogenics 劣生学cacogenics 种族退化学cacogeusia 恶味cacography 拼写错误cacology 发音不准cacoplastic 成形不良的cacoplastic 构造异常的cacosmia 恶臭cacothenics 种族衰退cacothymia 心情恶劣cacotrophia 营养不良cacotrophy 营养不良cad 计算机辅助设计caddish 缺乏教养的cadger 乞丐caducity 暂时性caecitas 盲caecum vestibulare 前庭盲端cafeteria feeding 自选食cafeteria feeding 自助式喂儿法caffeine 咖啡因caffeinism 咖啡因中毒cage 笼cage apparatus 笼状仪器cai 计算机辅助教学cain complex 凯因情结cain complex 兄弟敌对情结cainotophobia 新事恐怖症cain levine social competency scale 凯莱二氏社会能力量表cairophobia 临场恐怖症caitiff 卑鄙的人cajal method 卡捷法cake of custom 反变现象cal 计算机辅助学习calcarine fissure 距状裂calcium pump 钙棒calculate 计算calculated error 计算误差calculated value 计算值calculating card 计算卡片calculating chart 计算图calculating data 计算数据calculating inspection 计算检查calculating machine 计算机calculating sorting machine 分类计算机calculation 计算calculation chart 计算图calculative skill 计算技能calculative strategy 计算策略calculus 演算calculus of observation 观测演算calculus of proposition 命题演算calf love 雏恋calibration 刻度calibration 校准calibration instrument 校准仪器california achievement tests 加州成就测验california infant scale 加州婴儿量表california occupational preference survey 加州职业偏好调查表california personality inventory 加州人格量表california psychological inventory 加州心理测验量表california test of mental maturity 加州心理成熟测验california test of personality 加州人格测验caliper 测径器call 喊叫callback 复查calligraphy 笔迹calling 感召calling sequence 引入序列callipedia 美婴欲callomania 美貌狂callosal apraxia 胼胝体运用不能callosum 胼胝体calm type 沉静型calmative 镇静的caloric nystagmus 温度性眼震calorie 卡(热量单位）caloriemeter 热量计calque 仿照calvaria 颅盖calvarium 颅盖cal大卡cam 计算机辅助制造camaraderie 同志感camera lucida 投影描图器camera obscure 暗箱camouflage 伪装campcampbell chart 坎贝尔图campimeter 视野测量器canadian psychological association 加拿大心理学会canalis facialis 面神经管canalis semicircularis anterior 前半规管canalis semicircularis anterior 上半规管canalis semicircularis posterior 后半规管canalis semicircularis posterior 下半规管canalization 定型化canalization 疏引作用cancel 删去cancellation 废除cancellation 划消cancellation ability 消字能力cancellation method 划消法cancellation test 划消测验cancer 癌症cancer cells 癌细胞cancerogenic 致癌的cancerophobia 癌症恐怖症candela 坎德拉candela per square foot 坎每平方英尺candela per square meter 坎每平方米candid 公正的candle 烛candle power 烛光candle problem 腊烛问题candor 坦率canine madness 狂犬病cannabiomania 大麻癖cannabis 大麻cannabism 大麻依赖cannabism 大麻中毒canned data 存储的信息cannibalism 同类相残canniness 精明cannon emergency function 嘉农应急机能cannon theory of emotion 嘉龙情绪理论cannon s theory 嘉农氏理论cannon bard theory of emotioncanon 准则canonic form 典型形式canonical analysis 双复式分析canonical correlation 双复式相关canonical correlation analysis 典型相关分析canonical correlation coefficient 典型相关系数canonical order 双复式次序canonist 经院派canvasser 推销员capcapability 能力capacitation 获得能力capacity 能力capacity 容量capacity of bearing legal responsibility 法律责任能力capacity of resisting disturbance 抗干扰能力capacity test 能量测验capgras syndrome 卡氏综合症capillarectasia 毛细管扩张capillaritis 毛细管炎capillary 毛细管caprice 任性caprice 突发capsicism 辣椒癖capsula auditus 耳软骨囊capsula externa 外囊capsula interna 内囊capsule 囊caption 题目captious 吹毛求疵的captivate 迷住captivation 吸引力captivator 有吸引力的人captive 被控制的capture 引起注意captured audience 被吸引的听众carbachol 碳酰胆碱carbamate 氨基甲酸酯carbamic acid 氨基甲酸carbon dioxide 二氧化碳carbon monoxide 一氧化碳carbon monoxide poisoning 一氧化碳中毒carcinogen 致癌物carcinoma 癌carcinomatosis 癌病carcinophobia 恐癌症card 卡片card index 卡片索引card reader 卡片读出器cardiac 心脏的cardiac activity 心博cardiac center 心博中枢cardiac cost 心脏的负担cardiac cycle 心博周期cardiac efficiency 心效率cardiac impulse 心冲动cardiac index 心脏指标cardiac neurosis 心脏神经官能症cardiac output 心脏输出量cardiac plexus 心神经丛cardiac rate 心率cardiac reflex 心反射cardiac rhythm 心节律cardiac wave 心波cardiagraphy 心动描记法cardiant 心兴奋药cardiasthenia 心衰弱cardiataxia 心博失调cardinal disposition 首要性格cardinal point 基点cardinal points reflex 基本方位反射cardinal trait 首要特质cardinal utility 基数效用cardiochronograph 心脏计时器cardiocinetic 促心动的cardiocybernetics 心脏控制论cardioexcitatory 兴奋心脏的cardiogram 心动图cardiograph 心动仪cardiographic 心动描记的cardiographic curve 心动描记曲线cardiography 心动描记术cardioinhibitor 心动抑制剂cardiology 心脏学cardiomotility 心脏活动cardioneurosis 心脏神经症cardiopalmus 心悸cardiophobia 心脏病恐怖症cardiophonogram 心音图cardiophonography 心音描记法cardiopneumograph 心肺运动描记器cardioregulatory center 心脏调节中枢cardiosphygmogram 心动脉博图cardiosphygmograph 心动脉博描记器cardiosphygmography 心动脉博描记法cardiotachometer 心动频率计cardiotachometry 心率记录法cardiovascular disorder 心血管障碍cardiovascular system 心血管系统cardio accelerator 心动加速剂cardio acceleratory 心动加速的cardio acceleratory center 心动加速中枢cardio acceleratory mechanism 心动加速机制cardio inhibitory 心动抑制的cardio inhibitory center 心动抑制中枢cardio inhibitory mechanism 心动抑制机制cardio respiratory ratio 心博呼吸比率cardi accelerator 心动加速剂card sorting test 卡片分类测验care 小心care of young 护幼career 职业经历career adjustment and development inventory 职业调适和发展问卷career anchors 事业定位career appraisal 事业评价career assessment inventory 职业评估测验career change 职业变换career choice 事业选择career commitment 事业承诺career counseling 事业咨询career cycle 事业周期career decision making self efficacy scale 职业决策自我效能感量表career development 事业发展career development for woman 妇女职业发展career development inventory 事业发展问卷career development program 职业发展制度career education 事业教育career field 事业领域career goal 事业目标career guidance 就业辅导career identity 职业认同career insight 职业洞察力career management 事业管理career maturity 事业成熟性career maturity inventory 事业成熟度量表career planning 职业计划career policy 职业政策career quidance 事业辅导career resilience 事业活力career stages 事业阶段career woman 职业妇女careerism 野心careerist 野心家carefree 无忧无虑careful 小心的careful reflection 慎思careless 粗心的caress 爱抚caretaker 看管人caretaker government 看守政府caretaking 看管careworn 受忧虑折磨的care laden 忧心忡忡的carinal cavity 背腔carking 烦恼的carnage 残杀carnality 肉欲carney 哄骗carnivore 食肉动物carotid artery 颈动脉carotid sinus baroreceptor 颈动脉窦压力感受器carp 挑剔carpal 腕关节的carpal bone age 腕骨年龄carpale 腕骨carphology 摸空carriertheory 载体学说carrier free 无载体的carry away 兴奋carry back 使回想起carry initiating signal 进位起始信号carry on 行动失常carry propagation 进位传递carrying 负载力carry over effect 延续效应cartesian coordinate 笛卡儿坐标cartesian coordinate 正坐标cartesian linguistics 笛卡儿语言学cartilagines laryngis 喉软骨cartilago cranialis 脑软骨cartilago septi nasi 鼻隔软骨cartogram 统计图cartography 制图法cartoon test 漫画测验cartridge weight 重量盒cascade control 级联控制cascaded carry 逐位进位case 病例case 个案case analysis 个案分析case grammar 格语法case history 病历case history 个案史case history interview 个案史访谈case history method 个案史法case method 个案法case record 个案记录case report 个案报告case study 个案研究case study method 个案研究法case work 个案工作casebook 病例本casebook 个案资料caseworker 个案工作者case control study 个案控制研究case group method 个案分组法case study interview 个案研究面谈cast away 抛弃cast back 追溯cast off 释放caste 社会等级caste system 等级制castellani low symptomcastigate 惩罚castle builder 空想家castrate 阉割castrated person 去势者castrated person 阉人castration 阉割castration anxiety 阉割焦虑castration complex 阉割情结castration fear 阉割恐惧castroid 类去势者casual path model 因径模式casual ward 临时收容所casual comparative study 原因比较研究casuistics 病案讨论cat 计算机轴断层摄影术cat 猫cat scanner 电脑断层扫描仪catabasis 缓解期catabiosis 分解代谢catabiosis 老化现象catabolic phase 分解阶段catabolic process 分解过程catabolism 分解代谢catabythismomania 自溺狂cataclysm 灾变catadioptric 反射折射的catagenesis 退化catagenetic 退化的catalepsy 木僵cataleptic 僵住症的catalepticform 僵住症样的catalexia 重读症catalog 目录catalog method 目录法catalog technique 目录法catalogia 言语重复cataloging 目录catalogue 目录catalogue cards 目录卡片catalysis 催化作用catalyst 催化剂catalyze 催化catamnesis 诊后病历cataphasia 重复语言cataphora 嗜睡样昏迷cataphoria 下隐斜视cataphrenia 可逆性痴呆cataplasia 退化cataplexis 猝倒cataract 白内障catastaltic 抑制的catastaltica 抑制剂catastrophe 灾变catastrophe model 激变模型catastrophic reaction 灾害性反应catathermometer 干湿球温度计catathymic amnesia 情综性遗忘catathymic amnesia 选择性失忆症catatonia 肌肉紧张症catatonia 紧张症catatonic 紧张症的catatonic dementia 紧张症型痴呆catatonic excitement 紧张症兴奋catatonic rage 紧张症愤怒catatonic schizophrenia 紧张型精神分裂症catatonic stupor紧张性木僵紧张性僵呆catatonic syndrome 紧张综合症catatonic type 僵直型catatonosis 张力减退catatony 紧张症catch test 捕捉测验catching 捕捉catechetic method 问答法catechetical method 问答法catechism 问答教学法catecholamine 儿茶酚胺catechol o methyltransferase 儿茶酚氧位甲基移位酶categorematic 范畴性的categorical 分类的categorical attitude 归类心态categorical behavior 归类行为categorical clustering 范畴群集categorical data 分类数据categorical data analysis 分类数据分析categorical distinction范畴特征范畴特徵categorical distribution 类别分布categorical form 范畴形式categorical frequency 组限频次categorical judgment 范畴判断categorical measure 类别测量categorical perception 类别知觉categorical perception of speech 语言范畴知觉categorical rating 分类评定categorical rule 范畴规则categorical scale 分类量表categorical variable 类别变项categories of organization communication 组织沟通分类categorization 分类categorizing process 分类过程category 范畴category clustering 范畴群集category limen 范畴阈限category method 范畴法category object 范畴对象category observation 范畴观察category of being 存在范畴category of relation 关系范畴category of speech 言语范畴category of unthought 非思想范畴category scales 范畴量表category system 范畴系统category test 范畴实验category width 范畴幅度catharsis 宣泄catharsis hypothesis 宣泄假说catharsis method 宣泄法catharsis theory 宣泄论catharsis theory of play 游戏宣泄论cathartic 导泄的cathartic hypnosis 宣泄催眠法cathartic method 宣泄法cathartic method of emotion 情绪宣泄法cathartic play 宣泄性游戏cathected energy 投附的能量cathectic 投注catheresis 虚弱cathexis 情感投注cathexis 性力投注cathisophobia 静坐恐怖症cathode 阴极cathode ray 阴极射线cathode ray displays 阴极射线显示器cathode ray oscillograph 阴极射线示波器cathode ray tube 示波器catholicity 普遍性cati 计算机辅助电话访谈cation 阳离子catophoria 下隐斜视catoptric 反射光的catoptrical 反射光的catoptrics 反射光学catoptrophobia 恐镜子症catoptroscope 反光检查镜cattell infant intelligence scale 卡特尔婴儿智力量表cattell infant scale 卡特尔婴儿量表cattell s factor theory 卡特尔因素论cattell s trait theory 卡特尔特质论cattell s 16 personality factor questionnaire 卡氏cattishness 敏捷cat s cry syndrome 猫叫综合症caudal 尾部caudate 有尾的caudate nucleus 尾状核caudex drosalis 延髓caudex encephalic pontilis 脑桥caumesthesia 触冷感热causal 原因的causal analysis 因果分析causal anomaly 异常的因果分析causal association 因果联想causal attribution 因果归因causal connection 因果联系causal efficacy 因果功效causal explanation 因果性解释causal inference 因果性推断causal investigation 因果调查causal model 因果模式causal path 因果路径causal path analysis 因果路径分析causal path model 因径模式causal relation 因果关系causal research 因果研究causal schema 因果图式causal theory of perception 知觉因果论causalgia 灼痛causality 因果关系causality 缘由causality concept 因果概念causality condition 因果条件causal comparative study 原因比较研究causation 因果作用causation 原因causationism 因果律causationism of crime 犯罪原因论causationist 因果律cause 原因cause and effect 因果cause of trouble 事故原因causeless 无原因的cause and effect test 因果测验cause effect relation 因果关系caution 小心caution device 警告装置caution signals 警告信号cautionary 警告的cautious shift 谨慎偏移cautious style 谨慎型cautious shift effect 谨慎偏移效应cavd 抽象智力cavd test 抽象智力测验cave 洞穴caveman 穴居人cavil 挑剔cavity of skull 颅腔cavum 腔cavum craniale 颅腔cavum laryngis 喉腔cavum medullare 髓腔cavum nasi ossei 骨鼻腔cavum oris 口腔cavum tympani 鼓室cbe 能力本位教育cbf 脑血流量cbi 能力本位教学cbs 慢性脑综合症cbt 能力本位训练cbte 能力本位师资教育ccff 颜色临界融合频率ccs 计算机控制系统cd form 概念从属形式cd ratiocd ratiocdmsecease 停息ceiling 上限ceiling age 上限年龄ceiling and floor effect 上限下限效应ceiling effect 上限效应celerity 迅速celestial 天空的celestial body 天体celestial illusion 天空错视celibacy 禁欲celibatarian 独身主义者cell 单元cell 细胞cell aggregation 细胞集合cell assembly 细胞结集cell body 细胞体cell death 细胞坏死cell division 细胞分裂cell frequency 格内频次cell membrane 细胞膜cell nucleus 细胞核cell theory 细胞学说cellophane slide 透明纸幻灯片cellular 细胞的cellularity 细胞结构cellule 细胞celluloneuritis 神经细胞炎celsius 摄氏的celsius scale 摄氏温标celsius temperature 摄氏温度celsius thermometer 摄氏温度计cem 中心兴奋机制cenesthesia 一般机体觉cenesthesiopathy 体觉违合cenesthesiopathy 一般感觉紊乱cenesthesis 一般机体觉cenesthetic 普通感觉cenesthetic hallucination 机体统觉性幻觉cenesthopathy 体感幻觉cenophobia 恐荒野症cenotoxin 疲劳毒素censor 潜意识压抑censorship 潜意识稽查censure 指责census 人口调查census data 人口资料center 中枢center 中心center clipping 中部削波center clipping distortion 中心削波失真中心削波失真center for creative leadership 创造性领导中心center for epidemiological studies depression scale 流行病学研究中心抑郁量表center of body temperature regulation 体温调节中心center of equilibrium平衡中心平衡中心center of gravity 重心center of interest 兴趣中心center of rotation 旋转中心center of wernicke 韦尼克中心centered thought 中心化思考centibar 厘巴centigrade degree 摄氏温度centile 百分位等级central 中心的central aphasia 中枢性失语症central artery 中央动膜central body 中心体central body temperature 中心体温central canal 中心管central conflict 中心冲突central control room 中央控制室central deafness 中枢性聋central difference notation 中央差分记法central disposition 中心气质central excitatory mechanism 中心兴奋机制central excitatory state 中心兴奋状态central fatigue 中枢性疲劳central field of view 中心视野central fissure 中央沟central gray 中灰质central gray matter 中灰质层central inhibition 中枢抑制central lesion 中枢神经系统损伤central limit theorem 中央极限定理central lobe 大脑中叶central motive state 中枢动机状态central nerve trunk 中枢神经干central nervous system 中枢神经系统central nervous system disorder 中枢神经系统失调central nucleus 中央核central operation board 中央操纵台central processor 中央处理装置central processor model 中央处理器模式central projection 中央突central scotoma 中心盲central stimulants 中枢兴奋药central sulcus 中央沟central tendency 集中趋势central tendency measure 集中趋势测量central tendency of judgment 集中趋势判断central theory 中枢论central tissue 中心组织central trait 中心特质central type 中间型central value 代表值central vision 中央视觉central vision field 中央视野centralism 中枢机能论centrality 向心性centrality index 向心性指标centralization 集权化centralization 中央集权centralized data processing 数据集中处理centralized management 集权管理centraphose 中枢性暗觉centration 向心性centre 中心centrifugal 传出的centrifugal 离心的centrifugal fiber 传出纤维centrifugal fiber 离心纤维centrifugal force 离心力centrifugal nerve 传出神经centrifugal swing 离心摇摆centrifugal centripetal migration model模式centripetal 向心的centripetal fiber 传入纤维centripetal fiber 向心纤维centripetal force 向心力centripetal nerve 向心神经centrocinesia 中枢性运动centrocinetic 中枢性运动的centroid analysis 图心分析centroid factor analysis 形心因子分析centroid method 重心法centromere 着丝点centronervin 中枢神经素centrophose 中枢性暗觉centrophose 中枢性光幻觉centrostaltic 运动中心的centrotherapy 中枢疗法centrum 中枢cephalalgia 头痛cephalic 头的cephalic index 头颅指数cephalic neuromere 头部神经元节cephalin 脑磷脂cephalin flocculation 脑磷脂絮状反应cephalitis 脑炎cephalization 头向集中cephalocaudal development 首到尾的发展cephalocaudal direction 首尾方向cephalocele 脑膨出cephalodynia 头痛cephalogenesis 头部形成cephalograph 头描记器cephalometer 头颅测量器cephalometry 测颅法cephalone 大头白痴cephalonia 巨头症cephalotropic 向脑的ceptor 受体cer 条件性情绪反应cerebellar 小脑的cerebellar ataxia 小脑运动失调cerebellar cortex 小脑皮质cerebellar gait 小脑步态cerebellar hemispheres 小脑半球cerebellar lobe 小脑叶cerebellar nuclei 小脑核cerebellar peduncle 小脑脚cerebellifugal 离小脑的cerebellipetal 向小脑的cerebellitis 小脑炎cerebellofugal 小脑传出的cerebellopontile 小脑脑桥的cerebellospinal 小脑脊髓的cerebellospinal tract 小脑脊髓径cerebello olivary 小脑橄榄体的cerebellum 小脑cerebell bulbar tract 小脑延髓束cerebral 大脑的cerebral achromatopsia 皮质性色盲cerebral activity 大脑活动cerebral akinesia 大脑性运动不能cerebral anesthesia 大脑性感觉缺失cerebral apoplexy 大脑卒中cerebral aqueduct 大脑导水管cerebral arteriosclerosis 大脑动脉硬化cerebral basal ganglia 大脑基底核cerebral biocurrent 脑生物电流cerebral blindness 中枢性盲cerebral blood flow 脑血流量cerebral concussion 脑震荡cerebral connectionism 皮质纤维连接主义cerebral contusion 脑挫伤cerebral convolution 脑回cerebral cortex 大脑皮质cerebral cranium 脑颅cerebral dominance 大脑优势cerebral dura mater 硬脑膜cerebral dysrhythmia 脑波节律障碍cerebral electro therapy 大脑电疗法cerebral embolism 脑栓塞cerebral ganglion 丘脑cerebral gyri 大脑回cerebral haemorrhage 脑溢血cerebral hemisphere activity 大脑半球活动cerebral hemispheres 大脑半球cerebral hemispheric dominance 大脑半球优势cerebral hemorrhage 大脑出血cerebral hyperplasia 大脑发育过速cerebral hypoplasia 大脑发育不全cerebral hypoxia 大脑低氧病cerebral infarction 脑梗塞cerebral integration 大脑整合作用cerebral laceration 大脑裂伤cerebral lesion 大脑损伤cerebral limbic system 大脑边缘系统cerebral localization 大脑定位作用cerebral metabolic rate 大脑代谢率cerebral neocortex 大脑新皮层cerebral nerve 脑神经cerebral palsied child 脑性麻痹儿cerebral palsy 大脑麻痹cerebral palsy 脑性瘫痪cerebral peduncle 大脑脚cerebral sulci 大脑沟cerebral thrombosis 脑血栓cerebral trauma 脑外伤cerebral trigone 大脑穹窿cerebral tumor 脑瘤cerebral ventricle 脑室cerebrale 脑病性cerebrate 用脑cerebration 大脑作用cerebrifugal 离大脑的cerebripetal 向大脑的cerebritis 大脑炎cerebrocentric 大脑中枢的cerebrology 脑学cerebromedullary 脑脊髓的cerebromeningeal 脑膜的cerebromeningitis 脑膜炎cerebropathia 脑病cerebropathy 脑病cerebrophysiology 大脑生理学cerebropontile 大脑脑桥的cerebrorachidian 脑脊髓的cerebrosclerosis 脑硬化cerebroscopy 脑病检眼镜cerebrosis 脑病cerebrospinal 脑脊的cerebrospinal fluid 脑脊液cerebrospinal ganglion 脑脊神经节cerebrospinal nervous system 脑脊神经系统cerebrospinal system 脑脊髓系统cerebrotomy 脑切开术cerebrotonia 大脑紧张症cerebrotonia 头脑型cerebrotonia type 大脑紧张型cerebrovascular 脑血管的cerebrovascular accident 脑血管伤害cerebrovascular disease 脑血管病cerebrovascular stroke 脑血发作cerebrovascular system 脑血系统cerebrums 大脑ceremony 仪式certain 确凿的certain knowledge 准确的知识certainty 必然性certainty of thinking 思维的确定性certes 必然的certifiable 可证明的certificate of safety inspection 安全合格证certificate of safety operation 安全操作合格证certification 证明certitude 必然性cervical nerve 颈神经cestan chenais syndromecet 大脑电疗法cfa 验证性因素分析cff 临界闪烁频率cg 对照组cg 控制组cgr 皮电反应chaddock s reflex sign 查多克反射症chafe 惹怒chagrin 懊恼chain 链锁chain communication network 链形联络网chain complex 链锁情结chain index number 链指数chain learning 链锁学习chain of association 联想链chain of command 命令链chain reflex 链锁反射chained equipercentile equating41。

K-Modes聚类算法优化研究

摘要摘要聚类算法可以用于将整个样本集合划分为多个群落，从而发现有意义的样本群体。

因此作为一种高效的数据分析工具，聚类算法早已成为国内外学者广泛研究的热点技术之一。

Huang提出的K-Modes聚类算法使用属性匹配度量公式拓展了K-Means聚类算法，使其可以对无序型分类属性数据执行聚类分析。

该算法使用的0-1简单匹配相异度度量方法弱化了类内同一维度属性下属性值之间的相似性，忽略了不同维度属性之间的差异性。

单一属性值的聚类中心Modes忽视了某一维度属性可能存在多属性值组合，且算法受初始中心点影响很大。

以上问题都有可能导致分类数据的聚类效果较差。

此外由于数据的爆炸式增长，串行执行的传统算法难以在有效时间内处理超大规模的数据集和超高维度的数据模型。

Spark作为最新的大数据平台善于执行海量数据的分析任务，然而Spark现有的机器学习算法库中缺少分类数据的聚类算法，导致无法有效利用Spark平台处理海量分类数据的聚类问题。

针对以上问题本文首先提出了一种MA V-K-Modes聚类算法，使用基于预聚类的多属性值聚类中心Modes初始化方法和基于多属性值聚类中心Modes的相异度度量方法。

其次本文在改进后的MA V-K-Modes聚类算法基础上基于Spark 平台对该算法进行并行化改造，并分别针对静态数据集和增量数据集提出了相应的设计方案。

本文的主要研究内容有以下几点：(1)针对算法准确率问题，本文提出了一种基于多属性值聚类中心Modes的MA V-K-Modes聚类算法有效提升了无序型分类数据聚类结果的准确率。

该算法使用基于预聚类的多属性值聚类中心Modes初始化方法，减轻了算法受局部最优解的影响。

使用的基于多属性值聚类中心Modes的相异度度量方法改进了传统K-Modes算法简单0-1匹配度量方法的缺点，有效防止了聚类过程中重要属性值丢失，强化了类内同一维度属性下属性值之间的相似性。

使用信息熵理论计算不同维度属性的权重，强化了不同维度属性之间的差异性。

数据挖掘导论第5课数据聚类技术

where m f 1 (x1 f x2 f ... xnf ) n
Calculate the
.
xif m f zif sf
standardized measurement (z-score)

Using mean absolute deviation is more robust than using
Clustering: Rich Applications and Multidisciplinary Efforts

Pattern Recognition
Spatial Data Analysis
Create thematic maps
in GIS by clustering feature spaces
Dissimilarity Between Binary Object j 1 0 Variables
Object i

sum a b cd p
A contingency table for binary data
1 0
a c
b d
sum a c b d

Distance measure for symmetric

The quality of a clustering result depends on both the
similarity measure used by the method and its implementation

The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns

Process Mining学习笔记(一) _ 面子与里子

Date Science and Big Data当今的时代，海量数据不断地产生，在过去的10分钟产生的数据量，都超过了2003年之前人类历史上产生的所有数据。

人类的各种活动，都会不断地产生一系列的eventdata（事件数据）。

人类的事件数据形成了一个网，即Internet of Events。

它的数据主要有4种来源：1. Internet of content 如web 页面数据2. Internet of People 社交网络上人们通过各种关系产生3. Internet of Things 物联网4. Internet of Places 地理位置信息由数据的指数级增长，又谈到了摩尔定律（Moore’s Law）,每两年芯片中晶体管的数量将翻一翻，在过去的40年间，数量增长了2^20=1048576。

这种增长是非常惊人的。

40年前从 Amsterdam 到 Newyork 要坐7小时的飞机，如果飞机的飞行速度发展也能遵照摩尔定律，那么40年后只需0.024秒（可惜并没有如此高速发展）……如今，我们关注的不是如何生成数据，而是如何从海量数据中发现有价值的内容。

大数据领域，我们经常会关注的4个V：大数据的4V1. Volume（容量）：海量数据2. Velocity（速度）：数据在不断的变化3. Variety（多样）：数据的多样性，文本，图象，音视频等4. Veracity（真实）：数据的真实性数据科学领域，我们会提出以下的4个问题：1. What happened？过去发生了什么？2. Why did it happend？为什么会发生？3. What will happen？将会发生什么？（做预测Prediction）4. What is the best that can happend？如何更好的发生？这门课程集中在基于过程process的数据，利用 event data，来改进过程。

聚类分析外文文献

11 Cluster Analysis
The next two chapters address classiﬁcation issues from two varying perspectives. When considering groups of objects in a multivariate data set, two situations can arise. Given a data set containing measurements on individuals, in some cases we want to see if some natural groups or classes of individuals exist, and in other cases, we want to classify the individuals according to a set of existing groups. Cluster analysis develops tools and methods concerning the former case, that is, given a data matrix containing multivariate measurements on a large number of individuals (or objects), the objective is to build some natural subgroups or clusters of individuals. This is done by grouping individuals that are “similar” according to some appropriate criterion. Once the clusters are obtained, it is generally useful to describe each group using some descriptive tool from Chapters 1, 8 or 9 to create a better understanding of the diﬀerences that exist among the formulated groups.

聚集(三)

n i =1
I
, Q)
最小
对所有的j=1,…,m有
f r ( A j = q j | X ) ≥ f r ( A j = ck , j | X ) f r ( A j = ck , j | X ) = nck , j n
q j ≠ ck , j
Nck,j是在属性上Ai值为ck,j的对象数
K模算法
1.为每个簇选择初始模，共k个 2.根据d，把对象分配给最近的簇。根据定理重新计算簇的模 3.计算每个对象对当前模的相异度，重新分配对象到簇 4.重复上述2，3过程，直到簇中的对象不再发生变化
p q p1 p o q
DBSCAN基本思想
簇：基于密度可达性的最大的密度相连对
象的集合噪音：不在任何簇中的对象边界对象：不是核心对象，但在簇中，即至少从一个核心对象直接可达
噪音边界点核心对象
ε = 1cm
MinPts =Biblioteka 5DBSCAN算法1）任意选择没有加簇标签的点 p 2）找到从p关于ε and MinPts 密度可达的所有点 3）如果|Nε(q)|≥MinPts ，则p是核心对象,形成一个新的簇，给簇内所有的对象点加簇标签 4）如果p 是边界点, 则处理数据库的下一点 5）重复上述过程，直到所有的点处理完毕
如果 ,（t为一个指定域值），则dist 为NONE. 否则，dist不变.
conflt >t n
统计信息（3）
n=220 m=20.27 s=2.37 min=3.8 max=40 dist=normal
dist4≠dist confl=10 confl/n=0.045<0.05
自顶向下地回答查询
定义
给定半径ε和MinPts ，每个聚类中的对象的 ε-邻域中至少包含MinPts个对象给定对象集合D

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Clustering Large Categorical Data Setvia Genetic AlgorithmsAlgoritmi genetici per il raggruppamento di dati qualitativiRaffaella PiccarretaIstituto di Metodi QuantitativiUniversità Bocconi, Milanoraffaella.piccarreta@uni-bocconi.itRiassunto: Illustriamo due estensioni − proposte da Huang (1998) e da Kauffman e Rousseew (1990) − del noto algoritmo di raggruppamento a centri mobili al caso in cui si considerino variabili qualitative. Proponiamo una nuova metodologia di raggruppa-mento basata sulla logica degli algoritmi genetici. L’applicazione di tali algoritmi ad un data set reale e l’analisi dei risultati ottenuti evidenziano anche in questo caso, come già nel caso quantitativo, l’opportunità del loro utilizzo.Parole chiave: clustering, categorical variables, genetic algorithms, K-means algorithm1. IntroductionCluster analysis is one of the most important operations for dealing with large data sets. The partition of objects into homogeneous clusters, which can, if desired, be analyzed separately, results, in fact, in a considerable simplification of analysis. The aim of this work is to present techniques to cluster large data sets in the case when measurements are available on Q categorical variables, X1, X2, …, X Q, each describing a characteristic. Clustering can be performed either referring to hierarchical procedures or to non hierarchical procedures. Hierarchical procedures require as a preliminary step the definition of a dissimilarity matrix, D = [d ij], d ij being a measure of the dissimilarity between the i-th and the j-th object. As regards to the categorical case, it is well known (see e.g. Kauffman and Rousseeuw, 1990) that there are different ways to properly measure the dissimilarity between two objects. In the case when the number of o bjects to be clustered is very high, hierarchical procedures are not efficient. Non hierarchical procedures are then preferred, such as, for example, the well known K-means algorithm (MacQueen, 1967) which is efficient in processing large data sets, and often leads to a local optimum solution (MacQueen, 1967). To improve results obtained by referring to the “classical” version of the algorithm, it is possible to refer to modern optimization techniques, such as, for example, genetic algorithms (Murthy and Chowdhury, 1996; Paterlini et al., 2001).A first possible way to apply the K-means algorithm to cluster N objects on the basis of Q categorical variables is to convert multiple category attributes into binary attributes, each measuring the presence (1) or the absence (0) of a certain characteristic. This procedure can become inefficient in cases when the number of levels of the involved variables is very high.In this work we will thus refer to two proposals, the first introduced by Kauffman and Rousseeuw (1990), the second introduced by Huang (1998), both attempting to extendthe K -means algorithm to the “categorical case”, still avoiding the transformation of the categorical problem into a quantitative one. More precisely, our aim is to improve results obtained by referring to these techniques by means of genetic algorithms.The paper is organized as follows: Section 2 is devoted to the description of the two above mentioned techniques, and to the presentation of the genetic approach. In Section 3 an application to real data is illustrated.2. K -means algorithms for categorical dataIt is worthwhile recalling that the K -means algorithm consists of the following steps: (1) selection of the initial K seeds (2) calculation of the dissimilarity between each observation and the seeds (3) allocation of each observation to the seed it is nearest to(4) synthesis of the observations in each cluster: the k means of the obtained clusters constitute now the K seeds. Steps (2)-(4) are repeated until the algorithm converges.The main features of Huang’s (1998) algorithm (from now onwards HA ) concern steps(2) and (4). The dissimilarity measure referred to is the well known simple matching coefficient: in particular the distance between the i -th object, characterized by x i = (x i 1, …, x iQ ) and the j -th object, characterized by x j = (x j 1, …, x jQ ) is d ij = d (x i , x j ) = ∑δq jq iq x x ),(, with δ(x iq , x jq ) = 0 if x iq = x jq , 1 otherwise. The second modification concerns the synthesis of the observations in each cluster. Instead of the means, the modes of the clusters are considered: each variable is thus synthesized by the most frequent category (notice that a set of data can be characterized by more modes). An important consideration has to be made with reference to step (1), concerning the choice of the initial seeds: in absence of prior information, K observations are randomly selected to constitute the seeds. The algorithm evidently depends upon the choice of the initial seeds. Notice that this dependence can be more crucial than in the quantitative case, the mode being in a sense less flexible than the mean in synthesizing a variable. Actually, Huang proposes a method to differently select the initial seeds. The method consists in assigning the most frequent categories of the different variables equally to the initial seeds, obtaining thus a set of (artificial) representative seeds. Subsequently, the object most similar to each seed is individualized and substituted to the seed itself. At least in our opinion, this way to p roceed (which can not be performed automatically) is very hard to implement in the case when many variables are taken into account.As concerns the partitioning around medoids algorithm (in the following KR1) introduced by Kauffman and Rousseeuw (1990), it is based upon the search of K representative objects, called medoids , among the objects of the data set. KR1 proceeds as follows: in a first phase, K representative objects are found; subsequently, K clusters are formed by assigning each observation to the medoid it is nearest too (dissimilarity between two objects being measured again by the simple matching coefficient). As it is evident, the main problem to be addressed in this case concerns the selection of the medoids. Kauffman and Rousseeuw (1990) propose to this end an algorithm which can be implemented when the number of observations is limited. In cases when the number of objects to be clustered is high - like the one considered here - they propose a modification of the algorithm (in the following KR2): a sample of cases is firstly drawn and clustered into K groups using the K -medoids method. The objects not in the sample are then assigned to the nearest medoid (obtained on the basis of the sample). This procedure is then repeated for t times (the authors propose to set t = 5) and the bestclustering is then chosen. The goodness of the clustering is evaluated by referring to the total within dissimilarity. It is evident that in this case results strongly depend upon the drawn sub-samples. Our aim in this work is to analyze the performance of genetic approaches (Mitchell, 1996) in improving results obtained by applying the algorithms described above. Genetic algorithms are used to solve optimization problems. At the first step of the algorithm an artificial population of candidate states is created. Each candidate state represents the possible solution to the considered optimization problem: its characteristics as a solver of the problem are embodied in its genetic code , and its performance with reference to the problem itself is evaluated by means of the so called fitness function . The artificial population is forced to evolve according to some genetic operators (such as selection, crossover, mutation). More precisely, some states of the current population are selected with probability proportional to their fitness function and randomly exposed to random mutations; new individuals are thus created, worse solutions are suppressed and the current population evolves to a new generation possibly constituted by better individuals (solutions). A crucial point in the definition of a genetic algorithm is the choice of the genetic code to attribute to a possible solution. In this work we adopt two coding schema. In the first algorithm, GA1, the artificial individual is characterized by a string of K cells. Each cell contains the identification number of an observation selected to be the initial seed in HA or the medoid in KR1 and KR2. In the second algorithm, GA2, the artificial individual is characterized by a string of Q × K cells. Each group of Q cells represents an initial seed in HA : more precisely, the q -th cell contains one out of the categories of the q -th variable (randomly selected). This algorithm was defined by adapting to the categorical case the algorithm recently proposed by Paterlini et al. (2001) to improve the performance of the K -means algorithm in the quantitative case. Both the algorithms use the same genetic operators. As regards to the fitness function, we considered the total within dissimilarity, defined as W = ∑∑==K k n i kd 11),,(k ik s x s k denoting the “representative individual” for the k -th cluster (a medoid or a vector of modes).3. Empirical resultsIn this section we illustrate results obtained by applying the algorithms described in the previous section to a real data set containing different information about 1038 investors in the Italian market (portfolio choice, satisfaction about the performance of the chosen investment, socio-demographic information, and so on). The variables taken into consideration are Q = 23, and some variables (for example profession) are characterized by a high number of categories. In Table 1 we present results obtained in the case when K = 10. In the table, W represents the total within dissimilarity, while T = ∑=Ni d 1),,(s x i where N is the sample size and s denotes the vector of marginal modes in HA and the “central” observation (the observation which is most similar to all the others) in KR .As is evident from the Table, the application of genetic algorithms permits us to improve the performances of the original algorithms. As regards to the algorithms introduced by Kauffman and Rousseeuw, notice that KR1 is characterized (as expected) by the best performance; when algorithm KR2 is used, a worse performance is observed.The genetic algorithm improves results obtained by KR2 and constitutes thus a valid alternative to the algorithm in the case when sample size is high, so that KR1 can not be used.Table 1. Fitness values for different algorithmsMethod W 1 −W/THA 5352 0.45988GA1 for HA5038 0.49157GA2 for HA5024 0.49299KR1 5124 0.50363KR2 5226 0.49375GA1 for KR2 5134 0.50266The illustrated results show the good performance of genetic algorithms even in cases when they are applied to the problem of clustering a large categorical data set by means of a properly defined extension of the K-means algorithm.In the presented example, the number of clusters is considered as fixed, but, obviously, the genetic algorithm can be extended to simultaneously determine the best number of clusters and the best clustering. As a second, important point, the genetic algorithm can also be modified to automatically select either a sub-set of variables to be referred to in the clustering problem (a very important problem in data mining) or, with regard to categorical variables assuming a large number of levels, to simultaneously determine sub-sets of variables and of categories to be retained in analysis. This can be very important to evidence the source of information which is in effect relevant in determining clusters.ReferencesBaragona R., Calzini C., Battaglia, F. (2001) Genetic algorithms and clustering: an application to Fisher’s Iris Data, in S. Borra et al., eds., Advances in Classification and Data Analysis, Springer-Verlag, Berlin/Heidelberg.Huang Z. (1998) A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data MiningKauffman L., Rousseeuw P.J. (1990) Finding Groups in Data, Wiley & Sons, NY. Falkenauer E. (1998) Genetic Algorithms and Grouping Problems, Wiley & Sons, Chichester.MacQueen J.B. (1967), Some Methods for Classification and Analysis of Multivariate Observations, In proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281-297.Mitchell M. (1996) An Introduction to Genetic Algorithms, M.I.T. Press, Cambridge, Mass.Murthy C.A., Chowdhury N. (1996) In Search of Optimal Clusters Using Genetic Algorithms, Pattern Recognition Letters, 17, 825-832Paterlini S., Favaro S., Minerva T. (2001) Genetic approaches for Data Clustering Using Partial Information, Book of short papers, Cladag 2001, Palermo, 33-36.。