Mining comparable bilingual text corpora for cross-language information integration
- 格式:pdf
- 大小:186.88 KB
- 文档页数:6
1R 算法名称Activation function 激励函数Adaptive classifier combination (ACC) 自适应分类器组合Adaptive 自适应Additive 可累加的Affinity analysis 亲和力分析Affinity 亲和力Agglomerative clustering 凝聚聚类Agglomerative 凝聚的Aggregate proximity relationship 整体接近关系Aggregate proximity 整体接近Aggregation hierarchy 聚合层次AGNES 算法名称AIRMA 集成的自回归移动平均Algorithm 算法Alleles 等位基因Alternative hypothesis 备择假设Approximation 近似Apriori 算法名称AprioriAll 算法名称Apriori-Gen 算法名称ARGen 算法名称ARMA 自回归移动平均Artificial intelligence (AI) 人工智能Artificial neural networks (ANN) 人工神经网络Association rule problem 关联规则问题Association rule/ Association rules 关联规则Association 关联Attribute-oriented induction 面向属性的归纳Authoritative 权威的Authority权威Autocorrelation coefficient 自相关系数Autocorrelation 自相关Autoregression 自回归Auto-regressive integrated moving average 集成的自回归移动平均Average link 平均连接Average 平均Averagelink 平均连接Backlink 后向链接back-percolation 回滤Backpropagation 反向传播backward crawling 后向爬行backward traversal 后向访问BANG 算法名称Batch gradient descent 批量梯度下降Batch 批量的Bayes Rule 贝叶斯规则Bayes Theorem 贝叶斯定理Bayes 贝叶斯Bayesian classification 贝叶斯分类BEA 算法名称Bias 偏差Binary search tree 二叉搜索树Bipolar activation function 双极激励函数Bipolar 双极BIRCH 算法名称Bitmap index 位图索引Bivariate regression 二元回归Bond Energy algorithm 能量约束算法Boosting 提升Border point 边界点Box plot 箱线图Boyer-Moore (BM) 算法名称Broker 代理b-tree b树B-tree B-树C4.5 算法名称C5 算法名称C5.0 算法名称CACTUS 算法名称Calendric association rule 日历关联规则Candidate 候选CARMA 算法名称CART 算法名称Categorical 类别的CCPD 算法名称CDA 算法名称Cell 单元格Center 中心Centroid 质心CF tree 聚类特征树CHAID 算法名称CHAMELEON 算法名称Characterization 特征化Chi squared automatic interaction detection 卡方自动交互检测Chi squared statistic 卡方统计量Chi squared 卡方Children 孩子Chromosome 染色体CLARA 算法名称CLARANS 算法名称Class 类别Classification and regression trees 分类和回归树Classification rule 分类规则Classification tree 分类树Classification 分类Click 点击Clickstream 点击流Clique 团Cluster mean 聚类均值Clustering feature 聚类特征Clustering problem 聚类问题,备选定义Clustering 聚类Clusters 簇Collaborative filtering 协同过滤Combination of multiple classifiers (CMC) 多分类器组合Competitive layer 竞争层Complete link 全连接Compression 压缩Concept hierarchy 概念层次Concept 概念Confidence interval 质心区间Confidence 置信度Confusion matrix 混淆矩阵Connected component 连通分量Contains 包含Context Focused Crawler (CFC) 上下文专用爬虫Context graph 上下文图Context layer 上下文层Contiguous subsequeace 邻接子序列Contingency table 列联表Continuous data 连续型数据Convex hull 凸包Conviction 信任度Core points 核心点Correlation coefficient r 相关系数rCorrelation pattern 相关模式Correlation rule 相关规则Correlation 相关Correlogram 相关图Cosine 余弦Count distribution 计数分配Covariance 协方差Covering 覆盖Crawler 爬虫CRH 算法名称Cross 交叉Crossover 杂交CURE 算法名称Customer-sequence 客户序列Cyclic association rules 循环关联规则Data bubbles 数据气泡Data distribution 数据分配Data mart 数据集市Data Mining Query Language (DMQL) 数据挖掘查询语言Data mining 数据挖掘Data model 数据模型Data parallelism 数据并行Data scrubbing 数据清洗Data staging 数据升级Data warehouse 数据仓库Database (DB) 数据库Database Management System 数据库管理系统Database segmentation 数据库分割DBCLASD 算法名称DBMS 算法名称DBSCAN 算法名称DDA算法名称,数据分配算法Decision support systems (DSS) 决策支持系统Decision tree build 决策树构造Decision tree induction 决策树归纳Decision tree model (DT model) 决策树模型Decision tree processing 决策树处理Decision tree 决策树Decision trees 决策树Delta rule delta规则DENCLUE 算法名称Dendrogram 谱系图Density-reachable 密度可达的Descriptive model 描述型模型Diameter 直径DIANA 算法名称dice 切块Dice 一种相似度量标准Dimension modeling 维数据建模Dimension table 维表Dimension 维Dimensionality curse 维数灾难Dimensionality reduction 维数约简Dimensions 维Directed acyclic graph (DAG) 有向无环图Direction relationship 方向关系Directly density-reachable 直接密度可达的Discordancy test 不一致性测试Dissimilarity measure 差别度量Dissimilarity 差别Distance based simple 基于距离简单的Distance measure 距离度量Distance scan 距离扫描Distance 距离Distiller 提取器Distributed 分布式的Division 分割Divisive clustering 分裂聚类Divisive 分裂的DMA 算法名称Domain 值域Downward closed 向下封闭的Drill down 下钻Dynamic classifier selection (DCS) 动态分类器选择EM 期望最大值Encompassing circle 包含圆Entity 实体Entity-relationship data model 实体-关系数据模型Entropy 熵Episode rule 情节规则Eps-neighborhood ϵ-邻域Equivalence classes 等价类Equivalent 等价的ER data model ER数据模型ER diagram ER图Euclidean distance 欧几里得距离Euclidean 欧几里得Evaluation 评价Event sequence 事件序列Evolutionary computing 进化计算Executive information systems (EIS) 主管信息系统Executive support systems (ESS) 主管支持系统Exhaustive CHAID 穷尽CHAIDExpanded dimension table 扩张的维表Expectation-maximization 期望最大化Exploratory data analysis 探索性数据分析Extensible Markup Language 可扩展置标语言Extrinsic 外部的Fact table 事实表Fact 事实Fallout 错检率False negative (FN) 假反例False positive (FP) 假正例Farthest neighbor 最远邻居FEATUREMINE 算法名称Feedback 反馈Feedforward 前馈Finite state machine (FSM) 有限状态机Finite state recognizer 有限状态机识别器Firefly 算法名称Fires 点火Firing rule 点火规则Fitness function 适应度函数Fitness 适应度Flattened dimension table 扁平的维表Flattened 扁平的Focused crawler 专用爬虫Forecasting 预报Forward references 前向访问Frequency distribution 频率分布Frequent itemset 频率项目集Frequent 频率的Fuzzy association rule 模糊关联规则Fuzzy logic 模糊逻辑Fuzzy set 模糊集GA clustering 遗传算法聚类Gain 增益GainRatio 增益比率Gatherer 收集器Gaussian activation function 高斯激励函数Gaussian 高斯GDBSCAN 算法名称Gene 基因Generalization 泛化,一般化Generalized association rules 泛化关联规则Generalized suffix tree (GST) 一般化的后缀树Generate rules 生成规则Generating rules from DT 从决策树生成规则Generating rules from NN 从神经网络生成规则Generating rules 生成规则Generic algorithms 遗传算法Genetic algorithm 遗传算法Genetic algorithms (GA) 遗传算法Geographic Information Systems (GIS) 地理信息系统Gini 吉尼Gradient descent 梯度下降g-sequence g-序列GSP 一般化的序列模式Hard focus 硬聚焦Harvest rate 收获率Harvest 一个Web内容挖掘系统Hash tree 哈希树Heapify 建堆Hebb rule hebb规则Hebbian learning hebb学习Hidden layer 隐层Hidden Markov Model 隐马尔可夫模型Hidden node 隐节点Hierarchical classifier 层次分类器Hierarchical clustering 层次聚类Hierarchical 层次的High dimensionality 高维度Histogram 直方图HITS 算法名称Hmm 隐马尔可夫模型HNC Risk Suite 算法名称HPA 算法名称Hub 中心Hybrid Distribution (HD) 混合分布Hybrid OLAP (HOLAP) 混合型联机分析处理Hyper Text Markup Language 超文本置标语言Hyperbolic tangent activation function 双曲正切激励函数Hyperbolic tangent 双曲正切Hypothesis testing 假设检验ID3 算法名称IDD 算法名称Inverse document frequency(IDF) 文档频率倒数Image databases 图像数据库Incremental crawler 增量爬虫Incremental gradient descent 增量梯度下降Incremental rules 增量规则Incremental updating 增量更新Incremental 增量的Individual 个体Induction 归纳Information gain 信息增益Information retrieval (IR) 信息检索Information 信息Informational data 情报数据Input layer 输入层Input node 输入节点Integration 集成Interconnections 相互连接Interest 兴趣度Interpretation 解释Inter-transaction association rules 事务间关联规则Inter-transaction 事务之间Intra-transaction association rules 事务内关联规则Intra-transaction 事务之内Intrinsic 内部的Introduction 引言IR 算法名称Isothetic rectangle isothetic矩形Issues 问题Itemset 项目集Iterative 迭代的JaccardJaccard’s coefficient Jaccard系数Jackknife estimate 折叠刀估计Java Data Mining (JDM) Java数据挖掘Join index 连接索引K nearest neighbors (KNN) K最近邻K-D tree K-D树KDD object KDD对象KDD process KDD过程Key 键K-means K-均值K-Medoids K-中心点K-Modes K-模KMP 算法名称Knowledge and data discovery management system (KDDMS) 知识与数据发现管理系统Knowledge discovery in databases (KDD) 数据库知识发现Knowledge discovery in spatial databases 空间数据库知识发现Knuth-Morris-Pratt algorithm 算法名称Kohonen self organizing map Kohonen自组织映射k-sequence K-序列Lag 时滞Large itemset property 大项集性质Large itemset 大项集Large reference sequence 强访问序列Large sequence property 大序列性质Large 大Learning parameter 学习参数Learning rule 学习准则Learning 学习Learning-rate 学习率Least squares estimates 最小二乘估计Levelized dimension table 层次化维表Lift 作用度Likelihood 似然Linear activation function 线性激励函数Linear discriminant analysis (LDA) 线性判别分析Linear filter 线性滤波器Linear regression 线性回归Linear 线性Linear 线性的Link analysis 连接分析Location 位置LogisticLogistic regression logistic回归Longest common subseries 最长公共子序列Machine learning 机器学习Major table 主表Manhattan distance 曼哈顿距离Manhattan 曼哈顿Map overlay 地图覆盖Market basket analysis 购物篮分析Market basket 购物篮Markov Model (MM) 马尔可夫模型Markov Property 马尔可夫性质Maximal flequent forward sequences 最长前向访问序列Maximal forward reference 最长前向访问Maximal reference sequences 最长访问序列Maximum likelihood estimate (MLE) 极大似然估计MBR 最小边界矩形Mean squared error (MSE) 均方误差Mean squared 均方Mean 均值Median 中值Medoid 中心点Merged context graph 合并上下文图Method of least squares 最小二乘法Metric 度量Minimum bounded rectangle 最小边界矩阵Minimum item supports 最小项目支持度Minimum Spanning Tree algorithm 最小生成树算法Minimum Spanning Tree (MST) 最小生成树Minor table 副表MinPts 输入参数名称MINT 一种网络查询语言MISapriori 算法名称Mismatch 失配Missing data 缺失数据Mode 模Momentum 动量Monothetic 单一的Moving average 移动平均Multidimensional Database (MDD) 多维数据库Multidimensional OLAP (MOLAP) 多维OLAP Multilayer perceptron (MLP) 多层感知器Multimedia data 多媒体数据Multiple Layered DataBase (MLDB) 多层数据库Multiple linear regression 多元线性回归Multiple-level association rules 多层关联规则Mutation 变异Naïve Bayes 朴素贝叶斯Nearest hit 同类最近Nearest miss 异类最近Nearest Neighbor algorithm 最近邻算法Nearest neighbor query 最近邻查询Nearest neighbor 最近邻Nearest Neighbors 最近邻Negative border 负边界Neighborhood graph 近邻图Neighborhood 邻居Neural network (NN) 神经网络Neural network model (NN model) 神经网络模型Neural networks 神经网络Noise 噪声Noisy data 噪声数据Noncompetitive learning 非竞争性学习Nonhierarchical 非层次的Nonlinear regression 非线性回归Nonlinear 非线性的Nonparametric model 非参数模型Nonspatial data dominant generalization 以非空间数据为主的一般化Nonspatial hierarchy 非空间层次Nonstationary 非平稳的Normalized dimension table 归一化维表NSD CLARANS 算法名称Null hypothesis 空假设OAT 算法名称Observation probability 观测概率OC curve OC曲线Ockham’s razor 奥卡姆剃刀Offline gradient descent 离线梯度下降Offline 离线Offspring 子孙OLAP 联机分析处理Online Analytic Processing 在线梯度下降Online gradient descent 在线梯度下降Online transaction processing (OLTP) 联机事务处理Online 在线Operational characteristic curve 操作特征曲线Operational data 操作型数据OPTICS 算法名称OPUS 算法名称Outlier detection 异常点检测Outlier 异常点Output layer 输出层Output node 输出结点Overfitting 过拟合Overlap 重叠Page 页面PageRank 算法名称PAM 算法名称Parallel algorithms 并行算法Parallel 并行的Parallelization 并行化Parametric model 参数模型Parents 双亲Partial-completeness 部分完备性Partition 分区Partitional clustering 基于划分的聚类Partitional MST 划分MST算法Partitional 划分的Partitioning Around Medoids 围绕中心点的划分Partitioning 划分Path completion 路径补全Pattern detection 模式检测Pattern discovery 模式发现Pattern matching 模式匹配Pattern Query Language (PQL) 模式查询语言Pattern recognition 模式识别Pattern 模式PDM 算法名称Pearson’s r 皮尔逊系数rPerceptron 感知器Performance measures 性能度量Performance 性能Periodic crawler 周期性爬虫Personalization 个性化Point estimation 点估计PolyAnalyst 附录APolythetic 多的Population 种群Posterior probability 后验概率Potentially large 潜在大的Precision 查准率Predicate set 谓词集合Prediction 预测Predictive model 预测型模型Predictive Modeling Mark-Up Language (PMML) 预测模型置标语言Predictor 预测变量Prefix 前缀Preprocessing 预处理Prior probability 先验概率PRISM 算法名称Privacy 隐私Processing element function 处理单元函数Processing elements 处理单元Profile association rule (PAR) 简档关联规则Profiling 描绘Progressive refinement 渐进求精Propagation 传播Pruning 剪枝Quad tree 4叉树Quantitative association rule 数量关联规则Quartiles 4分位树Query language 查询语言Querying 查询QUEST 算法名称R correlation coefficient r相关系数Radial basis function (RBF) 径向基函数Radial function 径向函数Radius 半径RainForest 算法名称Range query 范围查询Range 全距Rank sink 排序沉没Rare item problem 稀疏项目问题Raster 光栅Ratio rule 比率规则RBF network 径向基函数网络Recall 召回率Receiver operating characteristic curve 接受者操作特征曲线Recurrent neural network 递归神经网络Referral analysis 推荐分析Region query 区域查询Regression coefficients 回归稀疏Regression 回归Regressor 回归变量Related concepts 相关概念Relation 关系Relational algebra 关系代数Relational calculus 关系计算Relational model 关系模型Relational OLAP 关系OLAPRelationship 关系Relative operating characteristic curve 相对操作特征曲线Relevance 相关性Relevant 相关的Reproduction 复制Response 响应Return on investment 投资回报率RMSE 均方根误差RNN 递归神经网络Robot 机器人ROC curve ROC曲线ROCK algorithm ROCK算法ROI 投资回报率ROLAP 关系型联机分析处理Roll up 上卷Root mean square error 均方根误差Root mean square (Rms) 均方根Roulette wheel selection 轮盘赌选择R-tree R树Rule extraction 规则抽取Rules 规则Sampling 抽样SAND 算法名称Satisfy 满足Scalability 可伸缩性Scalable parallelizable induction of decision trees 决策树的可伸缩并行归纳Scatter diagram 散点图Schema 模式SD CLARANS 算法名称,空间主导的Search engine 搜索引擎Search 搜索Seed URL 种子URLSegmentation 分割Segments 片段Selection 选择Self organizing feature map (SOFM) 自组织特征映射Self organizing map (SOM) 自组织映射Self organizing neural networks 自组织神经网络Self organizing 自组织Semantic index 语义索引Sequence association rule problem 序列关联规则问题Sequence association rule 序列关联规则Sequence association rules 序列关联规则Sequence classifier 序列分类器Sequence discovery 序列发现Sequence 时间序列Sequential analysis 序列分析Sequential pattern 序列模式Sequential patterns 序列模式Serial 单行的Session 会话Set 集合SGML 一种置标语言Shock 冲击Sigmoid activation function S型激励函数Sigmoid S型的Silhouette coefficient 轮廓系数Similarity measure 相似性度量Similarity measures 相似性度量Similarity 相似性Simple distance based 简单基于距离的Simultaneous 同时的Single link 单连接Slice 切片Sliding window 滑动窗口SLIQ 算法名称Smoothing 平滑Snapshot 快照Snowflake schema 雪花模式Soft focus 软聚焦SPADE 算法名称Spatial Association Rule 空间关联规则Spatial association rules 空间关联规则Spatial characteristic rules 空间特征曲线Spatial clustering 空间聚类Spatial data dominant generalization 以空间数据为主的一般化Spatial data mining 空间数据挖掘Spatial data 空间数据Spatial database 空间数据库Spatial Decision Tree 空间决策树Spatial discriminant rule 空间数据判别规则Spatial hierarchy 空间数据层次Spatial join 空间连接Spatial mining 空间数据挖掘Spatial operator 空间运算符Spatial selection 空间选择Spatial-data-dominant 空间数据主导Spider 蜘蛛Splitting attributes 分裂属性Splitting predicates 分裂谓语Splitting 分裂SPRINT 算法名称SQL 结构化查询语言Squared Error algorithm 平方误差算法Squared error 平方误差Squashing function 压缩函数Standard deviation 标准差Star schema 星型模式Stationary 平稳的Statistical inference 统计推断Statistical significance 统计显著性Statistics 统计学Step activation function 阶跃激励函数Step 阶跃Sting build 算法名称STING 算法名称Strength 强度String to String Conversion 串到串转换Subepisode 子情节Subsequence 子序列Subseries 子序列Subtree raising 子树上升Subtree replacement 子树替代Suffix tree 后缀树Summarization 汇总Supervised learning 有指导的学习Support 支持度SurfAid Analytics 附录ASurprise 惊奇度Targeting 瞄准Task parallelism 任务并行Temporal association rules 时序关联规则Temporal database 时序数据库Temporal mining 时序数据挖掘Temporal 时序Term frequency (TF) 词频Thematic map 主题地图Threshold activation function 阈值激励函数Threshold 阈值Time constraint 时间约束Time line 大事记Time series analysis 时间序列分析Time series 时间序列Topological relationship 拓扑关系Training data 训练数据Transaction time 事务时间Transaction 事务Transformation 变换Transition probability 转移概率Traversal patterns 浏览模式Trend dependency 趋势依赖Trend detection 趋势检测Trie 一种数据结构True negative (TN) 真反例True positive (TP) 真正例Unbiased 无偏的Unipolar activation function 单极激励函数Unipolar 单级的Unsupervised learning 无指导学习Valid time 有效时间Variance 方差Vector 向量Vertical fragment 纵向片段Virtual warehouse 虚拟数据仓库Virtual Web View (VWV) 虚拟Web视图Visualization 可视化V oronoi diagram V oronoi图V oronoi polyhedron V oronoi多面体WAP-tree WAP树WaveCluster 算法名称Wavelet transform 小波变换Web access patterns Web访问模式Web content mining Web内容挖掘Web log Web日志Web mining Web挖掘Web usage mining Web使用挖掘Web Watcher 一种方法WebML 一种Web挖掘查询语言White noise 白噪声WordNet Semantic NetworkWordNet 一个英语词汇数据库。
比特币助记词正则
比特币助记词是一种用于恢复比特币钱包的重要工具。
它以一串单词的形式存在,每个单词都代表着一个唯一的概念。
这些概念涵盖了比特币钱包的所有信息,包括公钥、私钥和交易记录等。
比特币助记词的生成过程是基于数学算法的,但在本文中,我们将避免使用数学公式或计算公式来解释其原理。
相反,我们将以简单易懂的语言来描述这个过程。
比特币助记词的生成是通过一个熵源来实现的。
这个熵源可以是随机数生成器或者是一段随机的文字。
通过这个熵源,系统会生成一系列的随机数,然后将这些随机数转化成一组单词。
这组单词的数量通常为12个或24个,每个单词都是从一个预定义的单词列表中选取的。
这个单词列表是为了避免生成的助记词出现歧义或误导的情况。
因此,这个单词列表是经过精心筛选和设计的,以确保每个单词都是独一无二的,并且易于记忆和书写。
比特币助记词的生成过程是完全确定性的,这意味着只要有相同的熵源和算法,就可以生成相同的助记词。
这就为恢复钱包提供了方便,只需输入正确的助记词,系统就能够还原出钱包的所有信息。
比特币助记词的重要性不言而喻。
它是恢复钱包的唯一方式,也是保护个人财产安全的关键。
因此,我们应该妥善保管自己的助记词,避免泄露给他人或丢失。
比特币助记词是一种重要的工具,它可以帮助我们恢复比特币钱包,并保护个人财产的安全。
了解助记词的生成原理和使用方法对于比特币用户来说是至关重要的。
希望通过本文的介绍,读者能够更好地理解和运用比特币助记词。
Mining multiple-level spatial association rules for objects witha broad boundaryEliseo Clementini a,*,Paolino Di Felice a ,Krzysztof Koperski baDepartment of Electrical Engineering,University of L'Aquila,67040Poggio di Roio,L'Aquila,ItalybMathsoft,Inc.,1700Westlake Ave.N.,Suite 500,Seattle,WA 98109-3044,USAReceived 30July 1999;accepted 21December 1999AbstractSpatial data mining,i.e.,mining knowledge from large amounts of spatial data,is a demanding ®eld since huge amounts of spatial data have been collected in various applications,ranging from remote sensing to geographical in-formation systems (GIS),computer cartography,environmental assessment and planning.The collected data far ex-ceeds people's ability to analyze it.Thus,new and e cient methods are needed to discover knowledge from large spatial databases.Most of the spatial data mining methods do not take into account the uncertainty of spatial information.In our work we use objects with broad boundaries,the concept that absorbs all the uncertainty by which spatial data is commonly a ected and allows computations in the presence of uncertainty without rough simpli®cations of the reality.The topological relations between objects with a broad boundary can be organized into a three-level concept hierarchy.We developed and implemented a method for an e cient determination of such topological relations.Based on the hierarchy of topological relations we present a method for mining spatial association rules for objects with uncertainty.The progressive re®nement approach is used for the optimization of the mining process.Ó2000Elsevier Science B.V.All rights reserved.Keywords:Association rule;Data mining;Spatial database;Topological relation;Uncertainty1.IntroductionLarge amounts of spatial data collected through remote sensing,e-commerce,and other data collection tools,make it crucial to develop tools for the discovery of interesting knowledge from large spatial databases.This situation creates the necessity of an automated knowledge/infor-mation discovery from data,which leads to a promising emerging ®eld,called data mining or knowledge discovery in databases (KDD).Knowledge discovery in databases can be de®ned as the nontrivial extraction of implicit,previously unknown,and potentially useful information from data [17].Data mining represents the integration of several ®elds,including machine learning,database systems,data visualization,statistics,and informationtheory.Data &Knowledge Engineering 34(2000)251±270/locate/datak*Corresponding author.Tel.:+39-862-434431;fax:+39-862-434403.E-mail addresses:eliseo@ing.univaq.it (E.Clementini),difelice@ing.univaq.it (P.Di Felice),krisk@ (K.Koperski).0169-023X/00/$-see front matter Ó2000Elsevier Science B.V.All rights reserved.PII:S 0169-023X (00)00017-3252 E.Clementini et al./Data&Knowledge Engineering34(2000)251±270The majority of the data mining algorithms was developed for the analysis of relational and transactional databases,but recently non-spatial data mining techniques have been expanded toward mining from spatial data.Generalization-based spatial data mining methods[26,27]dis-cover spatial and non-spatial relations at a general concept level,where spatial objects are ex-pressed as merged spatial regions[26]or clustered spatial points[27,16,18].However,these methods do not discover rules re¯ecting the structure of spatial objects and spatial/spatial or spatial/non-spatial relations that contain spatial predicates,such as adjacent,inside,close,and intersects.Spatial association rules[24],i.e.,the rules of the form``P A R'',where P and R are sets of predicates,use spatial and non-spatial predicates in order to describe spatial objects using relations with other objects.For example,a rule``is a X Y university A inside X Y city ''is a spatial association rule.The re®ned spatial predicates used in[24]were not analyzed at di erent ab-straction levels,as we did in this paper.In a large database many association relations may exist,but some may occur rarely or may not hold in most cases.To focus our study on patterns that are relatively``strong'',i.e.,patterns that occur frequently and hold in most cases,the concepts of minimum support and minimum con-®dence are used[1,2].Informally,the support of a pattern P in a set of spatial objects S is the probability that a member of S satis®es pattern P;and the con®dence of the rule P A R is the probability that the pattern R occurs,if the pattern P occurs.A user or an expert may specify thresholds to con®ne the discovered rules to be the strong ones,that is,using the patterns that occur frequently and the rules that demonstrate relatively strong implication relations.Most spatial data mining methods,including previous study on spatial association rules[24], use spatial objects with exactly known location.However,in real situations the extensions of spatial objects can be known only with a®nite accuracy.There are di erent sources that cause spatial information to be uncertain:incompleteness,inconsistency,vagueness,imprecision,and error[31].Incompleteness is related to totally or partly missing data:the prototypical situation of this kind is when a dataset is obtained from digitizing paper maps and pieces of lines are missing. Inconsistency arises when several versions of the same object exist,due either to di erent time snapshots,or datasets of di erent sources,or di erent abstraction levels.Vagueness is an intrinsic property of many natural geographic features that do not have crisp or well-de®ned boundaries. Imprecision is due to a®nite representation of spatial entities:the basic example of this kind is the regular tessellation used in raster data,where the element of the tessellation is the smallest unit that represents space.Error is everything that is introduced by limited means of taking measurements.In this paper,we extend the technique for mining spatial association rules[24]toward mining in the situation when spatial information is inaccurate.Spatial predicates used in previous work are based on the assumption that the boundary of spatial objects is exactly determined,i.e.,the objects are crisp.Many research papers propose an approach to deal with uncertainty in spatial data in which objects are represented by the lower and upper approximations of their extent,i.e., objects have broad boundaries[9,13,15,28].The advantage of this approach is that it can be im-plemented on existing database systems at a reasonable cost:the new model can be seen as an extension of existing geometric models.Topological relations,which are the spatial predicates taken into account in this paper,have been studied for simple regions with a broad boundary in[10].A set of topological relations for complex objects with a broad boundary has been proposed in[12],where the topological predi-E.Clementini et al./Data&Knowledge Engineering34(2000)251±270253 cates have been hierarchically organized into three levels.The bottom level of the hierarchy o ers detailed topological relations using an extension of the9-intersection model[14].The intermediate and top levels o er more abstract operators that allow users to query uncertain spatial data in-dependently of the underlying geometric data model.In the present work we study e cient methods for mining spatial association rules that use a progressive search technique.This technique®rst searches for frequent patterns at a high concept level of the topological relations.Then,only for such frequent patterns,it deepens the search to lower concept levels(i.e.,their lower level descendants).Such a deepening search process con-tinues until no frequent patterns can be found.A decision tree is constructed to determine the type of a topological relation between two objects.The distribution of topological relations is taken into account during the construction of the decision tree in order to minimize the number of computations needed to determine the type of a topological relation.The remainder of this paper is organized as follows.In Section2we recall the basic de®nitions from the extended model for complex objects with a broad boundary,which o ers a uniform way of treating uncertainty in spatial data[12].Section3introduces the notion of concept hierarchies including the three-level hierarchy of topological relations.Section4presents a method for the determination of the topological relations between objects with broad boundaries.Section5in-troduces the algorithm for mining the strong spatial association rules that use topological rela-tions between objects with broad boundaries.Section6draws short conclusions.2.The model for complex objects with a broad boundaryIn this section,we recall the basic de®nitions of the concepts that are used to describe composite objects with a broad boundary.De®nition1.A omposite region with ro d ound ry e is made up of two composite regions A1 and A2with A1 A2,where o A1is the inner ound ry of e and o A2is the outer ound ry of e.De®nition2.The ro d ound ry D A of a composite region with a broad boundary e is the closed subset comprised between the inner boundary and the outer boundary of e,i.e.,D A A2ÀA1,or equivalently D A A2ÀA1°.De®nition3.snterior, losure,and exterior of a composite region with a broad boundary e are de®ned as A° A2ÀD A Y A A° D A Y AÀ R2À"A,respectively.Fig.1illustrates some con®gurations of composite regions:case(a)is a region with two components;case(b)is a region where A1has two components and A2has one component;case(c) is a region where A1has one component and A2has two components.3.The concept hierarchiesWe believe that the advancement of OLAP tools is partially related to the ability of the OLAP systems to provide multi-level,multidimensional presentation of data stored in large data ware-houses[5].The existing OLAP systems provide mainly tools for summarization and visualizationof generalized data.Therefore,a new approach to data mining,which integrates OLAP and data mining was proposed.This new approach,called On-Line Analytical Mining (OLAM)presents a promising direction for mining large databases and data warehouses [22].Analysts can interac-tively adjust the level of generalization.For example,an analyst may start with descriptions of all schools using spatial predicates such as tou h (s hool ,p rk ).If he/she wants a more detailed in-formation about di erent types of schools he/she can drill-down and use predicates such as tou h (high _s hool ,p rk ).Such progressive ``zoom in''and ``zoom out''operations can individu-alize the mining process for di erent purposes.For example,a higher level user may concentrate on general information,while other users can look for details.Concept hierarchies [20]are used in previous work on spatial association [24]to facilitate presentation of knowledge at di erent levels.As we ascend the concept hierarchy,information becomes more and more general,but it still remains consistent with the lower concept levels.For example,a concept hierarchy for ro k can be speci®ed as follows:According to it,both lowest level concepts limestone and dolomite can be generalized 1to the concept hemi l sediment ry ,which in turn can be generalized to the concept sediment ry that also includes org ni sediment ry .Concept hierarchies can be explicitly given by the experts,or in some cases they can be gen-erated automatically by data analysis [21].In some cases concept hierarchies can be encoded in a database schema.For example,there may exist separate attributes for the n me of the rock (e.g.,granite),group (e.g.,intrusive igneous),and type (e.g.,igneous).For the purpose of this paper we assume that non-spatial predicates are generalized based on an underlying schema.(rock(igneous (intrusive igneous (granite,diorite,F F F )),(extrusive igneous (basalt,rhyolite,F F F ))),(sedimentary (clastic sedimentary (sandstone,shale,F F F )),(chemical sedimentary (limestone,dolomite,F F F )),(organic sedimentary (chalk,coal,F F F ))),(metamorphic(foliated metamorphic (slate,gneiss,F F F )),(non-foliated metamorphic(marble,quartzite,F F F)))).posite regions with a broad boundary.1One can notice that such generalization process di ers from the cartographic map generalization process,which involves the generalization of the symbolic representations of the objects [4].254 E.Clementini et al./Data &Knowledge Engineering 34(2000)251±270The novelty of this paper is to explore the bene®ts of forming a hierarchical structure for spatial relations.Di erent kinds of spatial relations can be used for this purpose.For example,a hier-archical structure of metric relations could be based on the methods proposed in[11],but this is outside the scope of the present paper.Hereafter,we concentrate on a concept hierarchy con-cerning topological relations.Binary topological relations between two objects,e and f,in R2can be classi®ed according to the intersection of eÕs interior,boundary,and exterior with fÕs interior, boundary,and exterior.The nine intersections between the six object parts describe a topological relation and can be concisely represented by the following3Â3matrix w,called the9-interse tion [14]:MA° B°A° D B A° BÀD A B°D A D B D A BÀAÀ B°AÀ D B AÀ BÀHdIe XBy considering the values empty(0)and nonempty(1),we can distinguish between29 512 binary topological relations.For two simple regions with a1-dimensional boundary,only eight of them can be realized.For two composite regions with a1-dimensional boundary,there are eight additional matrices that can be realized totaling to16relations.When objects are ap-proximated using a broad boundary this number grows to56.Generally speaking,the9-in-tersection method has the advantage that users can test for a large number of spatial relations and®ne tune of the particular relation being tested.It has the disadvantage that it is too detailed for the most of the practical applications and it does not have a corresponding natural language equivalent.The topological relations between pairs of composite regions with a broad boundary at three hierarchical levels were studied in[12].Thereafter,we summarize the concepts of that paper that are necessary for the self-contained reading of the present paper.The ottom level,which deals directly with geometry,consists of the56relations de®ned in terms of the9-intersection matrices(Fig.2).Such relations can be organized in a graph where each node is labeled with a relation(for the numbering the reader may refer to[12])and the arcs express geometric proximity between relations(Fig.3).At the top level,a de®nitely smaller number of relations(namely the four relations,disjoint,tou h,overl p,and in,of the so-called g l ulusEf sed wethod(CBM)2[6,8])is su cient to describe the topological relations between pairs of composite regions with a broad boundary.Each CBM relation corresponds to a cluster of 9-intersection matrices,as depicted in Fig.3.The mapping between the two levels can be also concisely expressed by patterns of the9-intersection matrices(see Table1,where d stands for any value(0or1)).Opposite to the bottom level,the top level is much more abstract as it does not provide a user with the geometric details related to the presence of broad boundaries and multiple components.2The CBM de®nitions of the top level relations are as follows:h A Y touch Y B i X A° B° Y A B Y Yh A Y in Y B i X A B A A° B° Y Yh A Y overlap Y B i X dim A° dim B° dim A° B° A B A A B B Yh A Y disjoint Y B i X A B Y XE.Clementini et al./Data&Knowledge Engineering34(2000)251±270255Fig.2.The 56topological relations between composite regions with a broad boundary.256 E.Clementini et al./Data &Knowledge Engineering 34(2000)251±270More details are o ered by the intermedi te level ,structured in terms of 14relations,which results on the clustering illustrated in Fig.4.For brevity,Table 2shows only the 9-intersection patterns for the intermediate level relations descending from the top-level relations tou h and in .Table 3summarizes the three-level concept hierarchy of the topological relations.4.Determination of the topological relationsWe use decision trees to determine which predicate is satis®ed at the level l of the topological concept hierarchy.The nodes of the decision tree store tests for the intersections from the 9-in-tersection matrix.Based on the values for these intersections the search space is partitioned so,®nally a leaf node of the tree contains only a single relation at the level l .In general the average cost to determine a relation at the level l of the topological concept hierarchy can be calculated as the weighted sum of the operations needed to determine a generalization of a bottom-level re-lation i at the level l (i.e.,the length of the path)times the probability that the iexistsFig.3.The clusters of the topological relations at the top level.E.Clementini et al./Data &Knowledge Engineering 34(2000)251±270257Cost l 56i 1op num gen l Y R i ÃPr R i XThe goal is to build a decision tree that minimizes the cost of computations.Therefore,the re-lations should be found with the smallest possible number of tests.Unfortunately the task of building decision trees with the smallest number of tests is known to be NP-complete [23].The problem of constructing decision trees has been widely analyzed in the areas of machine learning and data mining,where it is used in the classi®cation of data.The decision tree algorithms try to minimize the size of the resulting trees,because usually smaller trees enable better accuracy of classi®cation [29].In the process of the determination of predicates smaller trees enable faster computations.In [7]the authors use decision trees to determine which topological relation is satis®ed between two objects.The algorithm uses patterns consisting of a single matrix that uniquely de®nes the relation between objects with crisp boundaries.Therefore,this approach would be di cult to apply in the case of topological relations between objects with broad boundaries,because for some relations the template consists of a disjunction of 9-intersection matrices (see Table 1).For the purpose of building a decision tree we use the ID3algorithm [30]that tries to build a tree with the minimal number of nodes in order to distinguish between di erent groups of objects.ID3is a greedy algorithm that uses values of the attributes and data distribution to decide which attribute should be used to partition the dataset.This way we can optimize the cost of determining the topological relations based on the distribution of bottom-level relations.Such distribution can be estimated using results of the previous queries or established using sampling techniques.Based on the information gain measure the algorithm chooses an intersection from the 9-intersection matrix,which allows for the maximum separation of classes.If there are p i pairs of objects related by the topological relation i and there are p pairs of objects related by any topological relationTable 1Mapping of top-level topological relations to 9-intersectionmatrices disjoint0d d d 0d d ddHd Ie touch0d d d 1d d d d Hd Ie overlap1d dd d 1d 1d Hd Ie 1d d1d 1d d d Hd Ie 11ddd d d 1dHd Ie 11d 1d d d ddHd Ie ind 0dd d 0dd dHd Ie d d d 0d d d 0dH d Ie 258 E.Clementini et al./Data &Knowledge Engineering 34(2000)251±270then an arbitrary relation between pair of objects should be i with the probability p i /p .When a decision tree is used to determine the relation between two objects it can be regarded as a source of messages for i Õs with the expected information needed to generate this message given byI p 1Y p 2Y F F F Y p m Àm i 1p i p log 2p ip X If a value of the intersection g is used to partition the dataset of relations to two subsets g 0andg 1,then the expected information gain for g is obtained as the following weighted average:E C p 10 ÁÁÁ p m 0p I p 10YF F F Y p m 0 p 11 ÁÁÁ p m 1p I p 11Y F F F Y p m 1 Ywhere p i 0and p i 1are the numbers of object pairs that satisfy the relation i and that have thevalues of the g intersection 0and 1,respectively.The information gained by branching on g isgain C I p 1Y p 2Y F F F Y p m ÀE CXFig.4.The clusters of the topological relations at the intermediate level.E.Clementini et al./Data &Knowledge Engineering 34(2000)251±270259ID3examines all intersections and splits the decision tree based on the intersection that maximizes the information gain.Fig.5presents a decision tree constructed in this way in the case when all 56bottom-level relations are equally possible and the top-level relations are determined.When all bottom-level relations are equally probable the cost is 3.0(the leaf nodes of the tree contain di erent numbers of the bottom-level relations).We have implemented the decision tree algorithm and tested it using di erent distributions of the topological relations.The tree presented in Fig.5can be used for any distribution of the topological relations,but in the case when the distribution is not even its cost can be larger than for other trees.For example,we can have a distribution of topological relations such that the most of the relations are in (in particular 79.9%are the relations 40of Fig.2),10%are disjoint,10%are overlap (relation 18),and the rest of relations is equally distributed between other 53relations.In this case the tree from Fig.5has the cost 3.7(i.e.,0X 799Â4 0X 1Â2 0X 1Â3 ÁÁÁ),while the decision tree built by the presented algorithm has the cost 2.2.Such distribution can happen whenTable 2The examples of the mapping of intermediate-level topological relations to9-intersection matrices nearlyInsided 0d 1d 0d 1dH d I e d 0d d d011dHd Ie nearlyEquald 1d0d 0d 0d Hd Ie d 000d d d 0d H d I e d 0dd d 0d 0dHd Ie d 0d 0d 00ddHd Ie nearlyContainsd 1d0d 1d 0d Hd Ie d d 10d 1d 0dH d Ie nearlyMeet0d 1d 1d 1d d Hd Ie coveredByBoundary0d 0d d d 1d d Hd Ie coversWithBoundary0d 1d d d 0d d Hd Ie boundaryOverlap0d 0d d d 0d dHd Ie 260 E.Clementini et al./Data &Knowledge Engineering 34(2000)251±270one type of objects is very large in size,and a coarse resolution ®lter (see the next section)is used to eliminate many disjoint relations.A decision tree can be built for any level of the topological concept hierarchy.In the case when topological relations are determined in a sequence,starting from the top level and proceeding to lower levels,like in the case of the mining algorithm presented in the next section,an optimization can take place.In such case a number of separate decision trees can be constructed for the re-lations contained in di erent leaf nodes of the decision tree for a higher level.For example,there are three leaf nodes in the tree from Fig.5that contain children of the relation in .The node a contains intermediate-level relations ne rlysnside and ne rlyiqu l ;the node b contains relations ne rlygont ins and ne rlyiqu l ;and the node c contains a single relation ne rlyiqu l .Therefore,when the information obtained from the decision tree for top-level relations is known,and a user wants to determine intermediate-level relations the decision trees have to be built only for the nodes a and b .Also each of these decision trees would have to distinguish just between twoTable 3The three-level hierarchy of topological relationsLevelTotal number of operators Top level disjoint touch overlap in (and reverse)4Intermediate level disjoint nearlyMeets coveredByBoundary coversWithBoundary boundaryOverlapnearlyOverlaps interiorCoveredBy-Interior interiorCov-ersInterior partlyIn-side partlyContains crossCointainment nearlyInside nearlyContains nearlyEqual 14Bottom level (9-intersection)1relation matrix 16relation matrices 26relation matrices 13relation matrices56Fig.5.Example of a decision tree for top-level relations.intermediate-level relations instead of three relations.The optimized average cost of the deter-mination of the children of the relation in is1.54,while if one tries to determine the intermediate-level relations based only on the information that the top-level relation is in the average cost is 2.69(assuming even distribution of the bottom-level relations in both cases).5.Spatial association rulesSpatial association rules represent spatial and non-spatial relations between objects.For ex-ample,the following is a spatial association rule:is a X Y resort overlap X Y national park A is expensive X 30%Y80%This rule states that80%of resorts that overlap national parks are expensive and that30%of resorts satisfy all predicates in the rule,i.e.,they are expensive and they overlap national parks. Various kinds of spatial predicates can be involved in spatial association rules.They may represent topological relations described in the previous section.They may also represent spatial orientation or ordering,such as left,right,north,east,or contain some distance information,such as close_to,far_from,etc.These single predicates can be used in conjunction providing more detailed descriptions of spatial objects.For the systematic study of the mining of spatial associ-ation rules,we®rst present some concepts.De®nition4.A conjunction of k single predicates is called a kEpredi te.The support of a k-predicate,R R1 ÁÁÁ R k,in a set of objects ,denoted as r R a S ,is the number of objects in that satisfy ,versus the cardinality(i.e.,the total number of objects)of .A sp ti l sso i tion rule is a rule in the form ofP A Q s%Y c%where s%is the support of the rule, %is the onfiden e of the rule D is a k-predicate, is an m-predicate,and at least one of the predicates forming or is a spatial predicate.The support of the rule is the support of the conjunction of the predicates that form the rule(i.e.,P Q).The onfiden e3of rule P A Q in ,is the ratio of r P Q a S ,over r P a S .Most people are interested in the patterns that occur relatively frequently(i.e.,with wide support)and the rules that have strong implications(i.e.,with high con®dence).The rules with large values of support and high con®dence are strong rules.To mine strong rules two kinds of thresholds±minimum support and minimum con®dence±are used.3The concept of the con®dence of the rule provides the estimator of conditional probability Pr Q j P ,i.e.,the probability that k-predicate is satis®ed by a randomly chosen object that satis®es k-predicate .This concept should not be confused with the concept of con®dence interval that is used in statistics.The con®dence interval is the interval,that with a certain probability,a value of a statistical variable falls into.De®nition5.A k-predicate is frequent in set ,at level l,if the support of is no less than its minimum support threshold for level l,and all ancestors of from the concept hierarchy,are frequent at their corresponding levels.The con®dence of a rule P A Q is high at level l if its con®dence is no less than its corresponding minimum con®dence threshold.A rule P A Q is strong if the predicate P Q is frequent in set and the con®dence of P A Q is high. Although it is possible to®nd strong spatial association rules that use only the topological relations from the lowest level of the concept hierarchy proposed in[12],such an approach would be quite ine cient because most intersection values from the9-intersection model would have to be determined.Moreover,because many of these relations may characterize infrequently existing characteristics they would not pass the support threshold,and therefore,the results of these computations would not be presented to a user despite used computation time.In our approach the mining starts with the®ltration process,which is based on a predicate describing relation between MBRs of the ter the algorithm uses the topological rela-tions from the top level of the hierarchy presented in Table3.The predicates such as disjointD tou hD overl pD or in are found.Then,the algorithm proceeds to the lower levels of the topological hierarchy.In this case only the children of the topological relations that are frequent would be examined in detail.For example,if the tou h relation characterizes only a small number of objects then its children,such as,ne rlyweet or ound ryyverl p would not be computed,because they cannot satisfy the conditions for large predicates according to De®nition5.The mining process can be summarized in the following algorithm.Algorithm5.1.Mining multi-level spatial association rules for objects with a broad boundary. Input:1.Spatial database SDB containing set of spatial objects with broad boundaries.2.Set of spatial and non-spatial concept hierarchies.3.Two thresholds for each level l of description:minimum support(min_support[l])and minimum con®dence(min_con®dence[l]).4.Mining query,which consists of:4.1.a reference set of objects that are described ,4.2.a set of task-relevant sets for spatial objects,and4.3.a set of relevant predicates.Output:A set of strong multi-level spatial association rules using topological relations between objects with broad boundaries.Method:1. elev nt hf: extract_task_relevant_objects( hf);2.wf redi te hf: ®nd_MBR_predicates( elev nt hf);3. redi te hf: ®lter_with_minimum_support(min_support[1],wf redi te hf);4.for(vevel 1;vevel! max& redi te hf! Y;vevel++)5.{ redi te hf: ®nd_predicates(vevel, redi te hf, elev nt hf);6. redi te hf: ®lter_with_minimum_support(min_support[vevel], redi te hf);7.mine_association_rules( redi te hf);}。
Absolute deviation, 绝对离差Absolute number, 绝对数Absolute residuals, 绝对残差Acceleration array, 加速度立体阵Acceleration in an arbitrary direction, 任意方向上的加速度Acceleration normal, 法向加速度Acceleration space dimension, 加速度空间的维数Acceleration tangential, 切向加速度Acceleration vector, 加速度向量Acceptable hypothesis, 可接受假设Accumulation, 累积Accuracy, 准确度Actual frequency, 实际频数Adaptive estimator, 自适应估计量Addition, 相加Addition theorem, 加法定理Additivity, 可加性Adjusted rate, 调整率Adjusted value, 校正值Admissible error, 容许误差Aggregation, 聚集性Alternative hypothesis, 备择假设Among groups, 组间Amounts, 总量Analysis of correlation, 相关分析Analysis of covariance, 协方差分析Analysis of regression, 回归分析Analysis of time series, 时间序列分析Analysis of variance, 方差分析Angular transformation, 角转换ANOVA (analysis of variance), 方差分析ANOVA Models, 方差分析模型Arcing, 弧/弧旋Arcsine transformation, 反正弦变换Area under the curve, 曲线面积AREG , 评估从一个时间点到下一个时间点回归相关时的误差ARIMA, 季节和非季节性单变量模型的极大似然估计Arithmetic grid paper, 算术格纸Arithmetic mean, 算术平均数Arrhenius relation, 艾恩尼斯关系Assessing fit, 拟合的评估Associative laws, 结合律Asymmetric distribution, 非对称分布Asymptotic bias, 渐近偏倚Asymptotic efficiency, 渐近效率Asymptotic variance, 渐近方差Attributable risk, 归因危险度Attribute data, 属性资料Attribution, 属性Autocorrelation, 自相关Autocorrelation of residuals, 残差的自相关Average, 平均数Average confidence interval length, 平均置信区间长度Average growth rate, 平均增长率Bar chart, 条形图Bar graph, 条形图Base period, 基期Bayes' theorem , Bayes定理Bell-shaped curve, 钟形曲线Bernoulli distribution, 伯努力分布Best-trim estimator, 最好切尾估计量Bias, 偏性Binary logistic regression, 二元逻辑斯蒂回归Binomial distribution, 二项分布Bisquare, 双平方Bivariate Correlate, 二变量相关Bivariate normal distribution, 双变量正态分布Bivariate normal population, 双变量正态总体Biweight interval, 双权区间Biweight M-estimator, 双权M估计量Block, 区组/配伍组BMDP(Biomedical computer programs), BMDP统计软件包Boxplots, 箱线图/箱尾图Breakdown bound, 崩溃界/崩溃点Canonical correlation, 典型相关Caption, 纵标目Case-control study, 病例对照研究Categorical variable, 分类变量Catenary, 悬链线Cauchy distribution, 柯西分布Cause-and-effect relationship, 因果关系Cell, 单元Censoring, 终检Center of symmetry, 对称中心Centering and scaling, 中心化和定标Central tendency, 集中趋势Central value, 中心值CHAID -χ2 Automatic Interaction Detector, 卡方自动交互检测Chance, 机遇Chance error, 随机误差Chance variable, 随机变量Characteristic equation, 特征方程Characteristic root, 特征根Characteristic vector, 特征向量Chebshev criterion of fit, 拟合的切比雪夫准则Chernoff faces, 切尔诺夫脸谱图Chi-square test, 卡方检验/χ2检验Choleskey decomposition, 乔洛斯基分解Circle chart, 圆图Class interval, 组距Class mid-value, 组中值Class upper limit, 组上限Classified variable, 分类变量Cluster analysis, 聚类分析Cluster sampling, 整群抽样Code, 代码Coded data, 编码数据Coding, 编码Coefficient of contingency, 列联系数Coefficient of determination, 决定系数Coefficient of multiple correlation, 多重相关系数Coefficient of partial correlation, 偏相关系数Coefficient of production-moment correlation, 积差相关系数Coefficient of rank correlation, 等级相关系数Coefficient of regression, 回归系数Coefficient of skewness, 偏度系数Coefficient of variation, 变异系数Cohort study, 队列研究Column, 列Column effect, 列效应Column factor, 列因素Combination pool, 合并Combinative table, 组合表Common factor, 共性因子Common regression coefficient, 公共回归系数Common value, 共同值Common variance, 公共方差Common variation, 公共变异Communality variance, 共性方差Comparability, 可比性Comparison of bathes, 批比较Comparison value, 比较值Compartment model, 分部模型Compassion, 伸缩Complement of an event, 补事件Complete association, 完全正相关Complete dissociation, 完全不相关Complete statistics, 完备统计量Completely randomized design, 完全随机化设计Composite event, 联合事件Composite events, 复合事件Concavity, 凹性Conditional expectation, 条件期望Conditional likelihood, 条件似然Conditional probability, 条件概率Conditionally linear, 依条件线性Confidence interval, 置信区间Confidence limit, 置信限Confidence lower limit, 置信下限Confidence upper limit, 置信上限Confirmatory Factor Analysis , 验证性因子分析Confirmatory research, 证实性实验研究Confounding factor, 混杂因素Conjoint, 联合分析Consistency, 相合性Consistency check, 一致性检验Consistent asymptotically normal estimate, 相合渐近正态估计Consistent estimate, 相合估计Constrained nonlinear regression, 受约束非线性回归Constraint, 约束Contaminated distribution, 污染分布Contaminated Gausssian, 污染高斯分布Contaminated normal distribution, 污染正态分布Contamination, 污染Contamination model, 污染模型Contingency table, 列联表Contour, 边界线Contribution rate, 贡献率Control, 对照Controlled experiments, 对照实验Conventional depth, 常规深度Convolution, 卷积Corrected factor, 校正因子Corrected mean, 校正均值Correction coefficient, 校正系数Correctness, 正确性Correlation coefficient, 相关系数Correlation index, 相关指数Correspondence, 对应Counting, 计数Counts, 计数/频数Covariance, 协方差Covariant, 共变Cox Regression, Cox回归Criteria for fitting, 拟合准则Criteria of least squares, 最小二乘准则Critical ratio, 临界比Critical region, 拒绝域Critical value, 临界值Cross-over design, 交叉设计Cross-section analysis, 横断面分析Cross-section survey, 横断面调查Crosstabs , 交叉表Cross-tabulation table, 复合表Cube root, 立方根Cumulative distribution function, 分布函数Cumulative probability, 累计概率Curvature, 曲率/弯曲Curvature, 曲率Curve fit , 曲线拟和Curve fitting, 曲线拟合Curvilinear regression, 曲线回归Curvilinear relation, 曲线关系Cut-and-try method, 尝试法Cycle, 周期Cyclist, 周期性D test, D检验Data acquisition, 资料收集Data bank, 数据库Data capacity, 数据容量Data deficiencies, 数据缺乏Data handling, 数据处理Data manipulation, 数据处理Data processing, 数据处理Data reduction, 数据缩减Data set, 数据集Data sources, 数据来源Data transformation, 数据变换Data validity, 数据有效性Data-in, 数据输入Data-out, 数据输出Dead time, 停滞期Degree of freedom, 自由度Degree of precision, 精密度Degree of reliability, 可靠性程度Degression, 递减Density function, 密度函数Density of data points, 数据点的密度Dependent variable, 应变量/依变量/因变量Dependent variable, 因变量Depth, 深度Derivative matrix, 导数矩阵Derivative-free methods, 无导数方法Design, 设计Determinacy, 确定性Determinant, 行列式Determinant, 决定因素Deviation, 离差Deviation from average, 离均差Diagnostic plot, 诊断图Dichotomous variable, 二分变量Differential equation, 微分方程Direct standardization, 直接标准化法Discrete variable, 离散型变量DISCRIMINANT, 判断Discriminant analysis, 判别分析Discriminant coefficient, 判别系数Discriminant function, 判别值Dispersion, 散布/分散度Disproportional, 不成比例的Disproportionate sub-class numbers, 不成比例次级组含量Distribution free, 分布无关性/免分布Distribution shape, 分布形状Distribution-free method, 任意分布法Distributive laws, 分配律Disturbance, 随机扰动项Dose response curve, 剂量反应曲线Double blind method, 双盲法Double blind trial, 双盲试验Double exponential distribution, 双指数分布Double logarithmic, 双对数Downward rank, 降秩Dual-space plot, 对偶空间图DUD, 无导数方法Duncan's new multiple range method, 新复极差法/Duncan新法E-LEffect, 实验效应Eigenvalue, 特征值Eigenvector, 特征向量Ellipse, 椭圆Empirical distribution, 经验分布Empirical probability, 经验概率单位Enumeration data, 计数资料Equal sun-class number, 相等次级组含量Equally likely, 等可能Equivariance, 同变性Error, 误差/错误Error of estimate, 估计误差Error type I, 第一类错误Error type II, 第二类错误Estimand, 被估量Estimated error mean squares, 估计误差均方Estimated error sum of squares, 估计误差平方和Euclidean distance, 欧式距离Event, 事件Event, 事件Exceptional data point, 异常数据点Expectation plane, 期望平面Expectation surface, 期望曲面Expected values, 期望值Experiment, 实验Experimental sampling, 试验抽样Experimental unit, 试验单位Explanatory variable, 说明变量Exploratory data analysis, 探索性数据分析Explore Summarize, 探索-摘要Exponential curve, 指数曲线Exponential growth, 指数式增长EXSMOOTH, 指数平滑方法Extended fit, 扩充拟合Extra parameter, 附加参数Extrapolation, 外推法Extreme observation, 末端观测值Extremes, 极端值/极值F distribution, F分布F test, F检验Factor, 因素/因子Factor analysis, 因子分析Factor Analysis, 因子分析Factor score, 因子得分Factorial, 阶乘Factorial design, 析因试验设计False negative, 假阴性False negative error, 假阴性错误Family of distributions, 分布族Family of estimators, 估计量族Fanning, 扇面Fatality rate, 病死率Field investigation, 现场调查Field survey, 现场调查Finite population, 有限总体Finite-sample, 有限样本First derivative, 一阶导数First principal component, 第一主成分First quartile, 第一四分位数Fisher information, 费雪信息量Fitted value, 拟合值Fitting a curve, 曲线拟合Fixed base, 定基Fluctuation, 随机起伏Forecast, 预测Four fold table, 四格表Fourth, 四分点Fraction blow, 左侧比率Fractional error, 相对误差Frequency, 频率Frequency polygon, 频数多边图Frontier point, 界限点Function relationship, 泛函关系Gamma distribution, 伽玛分布Gauss increment, 高斯增量Gaussian distribution, 高斯分布/正态分布Gauss-Newton increment, 高斯-牛顿增量General census, 全面普查GENLOG (Generalized liner models), 广义线性模型Geometric mean, 几何平均数Gini's mean difference, 基尼均差GLM (General liner models), 通用线性模型Goodness of fit, 拟和优度/配合度Gradient of determinant, 行列式的梯度Graeco-Latin square, 希腊拉丁方Grand mean, 总均值Gross errors, 重大错误Gross-error sensitivity, 大错敏感度Group averages, 分组平均Grouped data, 分组资料Guessed mean, 假定平均数Half-life, 半衰期Hampel M-estimators, 汉佩尔M估计量Happenstance, 偶然事件Harmonic mean, 调和均数Hazard function, 风险均数Hazard rate, 风险率Heading, 标目Heavy-tailed distribution, 重尾分布Hessian array, 海森立体阵Heterogeneity, 不同质Heterogeneity of variance, 方差不齐Hierarchical classification, 组内分组Hierarchical clustering method, 系统聚类法High-leverage point, 高杠杆率点HILOGLINEAR, 多维列联表的层次对数线性模型Hinge, 折叶点Histogram, 直方图Historical cohort study, 历史性队列研究Holes, 空洞HOMALS, 多重响应分析Homogeneity of variance, 方差齐性Homogeneity test, 齐性检验Huber M-estimators, 休伯M估计量Hyperbola, 双曲线Hypothesis testing, 假设检验Hypothetical universe, 假设总体Impossible event, 不可能事件Independence, 独立性Independent variable, 自变量Index, 指标/指数Indirect standardization, 间接标准化法Individual, 个体Inference band, 推断带Infinite population, 无限总体Infinitely great, 无穷大Infinitely small, 无穷小Influence curve, 影响曲线Information capacity, 信息容量Initial condition, 初始条件Initial estimate, 初始估计值Initial level, 最初水平Interaction, 交互作用Interaction terms, 交互作用项Intercept, 截距Interpolation, 内插法Interquartile range, 四分位距Interval estimation, 区间估计Intervals of equal probability, 等概率区间Intrinsic curvature, 固有曲率Invariance, 不变性Inverse matrix, 逆矩阵Inverse probability, 逆概率Inverse sine transformation, 反正弦变换Iteration, 迭代Jacobian determinant, 雅可比行列式Joint distribution function, 分布函数Joint probability, 联合概率Joint probability distribution, 联合概率分布K means method, 逐步聚类法Kaplan-Meier, 评估事件的时间长度Kaplan-Merier chart, Kaplan-Merier图Kendall's rank correlation, Kendall等级相关Kinetic, 动力学Kolmogorov-Smirnove test, 柯尔莫哥洛夫-斯米尔诺夫检验Kruskal and Wallis test, Kruskal及Wallis检验/多样本的秩和检验/H检验Kurtosis, 峰度Lack of fit, 失拟Ladder of powers, 幂阶梯Lag, 滞后Large sample, 大样本Large sample test, 大样本检验Latin square, 拉丁方Latin square design, 拉丁方设计Leakage, 泄漏Least favorable configuration, 最不利构形Least favorable distribution, 最不利分布Least significant difference, 最小显著差法Least square method, 最小二乘法Least-absolute-residuals estimates, 最小绝对残差估计Least-absolute-residuals fit, 最小绝对残差拟合Least-absolute-residuals line, 最小绝对残差线Legend, 图例L-estimator, L估计量L-estimator of location, 位置L估计量L-estimator of scale, 尺度L估计量Level, 水平Life expectance, 预期期望寿命Life table, 寿命表Life table method, 生命表法Light-tailed distribution, 轻尾分布Likelihood function, 似然函数Likelihood ratio, 似然比line graph, 线图Linear correlation, 直线相关Linear equation, 线性方程Linear programming, 线性规划Linear regression, 直线回归Linear Regression, 线性回归Linear trend, 线性趋势Loading, 载荷Location and scale equivariance, 位置尺度同变性Location equivariance, 位置同变性Location invariance, 位置不变性Location scale family, 位置尺度族Log rank test, 时序检验Logarithmic curve, 对数曲线Logarithmic normal distribution, 对数正态分布Logarithmic scale, 对数尺度Logarithmic transformation, 对数变换Logic check, 逻辑检查Logistic distribution, 逻辑斯特分布Logit transformation, Logit转换LOGLINEAR, 多维列联表通用模型Lognormal distribution, 对数正态分布Lost function, 损失函数Low correlation, 低度相关Lower limit, 下限Lowest-attained variance, 最小可达方差LSD, 最小显著差法的简称Lurking variable, 潜在变量M-RMain effect, 主效应Major heading, 主辞标目Marginal density function, 边缘密度函数Marginal probability, 边缘概率Marginal probability distribution, 边缘概率分布Matched data, 配对资料Matched distribution, 匹配过分布Matching of distribution, 分布的匹配Matching of transformation, 变换的匹配Mathematical expectation, 数学期望Mathematical model, 数学模型Maximum L-estimator, 极大极小L 估计量Maximum likelihood method, 最大似然法Mean, 均数Mean squares between groups, 组间均方Mean squares within group, 组内均方Means (Compare means), 均值-均值比较Median, 中位数Median effective dose, 半数效量Median lethal dose, 半数致死量Median polish, 中位数平滑Median test, 中位数检验Minimal sufficient statistic, 最小充分统计量Minimum distance estimation, 最小距离估计Minimum effective dose, 最小有效量Minimum lethal dose, 最小致死量Minimum variance estimator, 最小方差估计量MINITAB, 统计软件包Minor heading, 宾词标目Missing data, 缺失值Model specification, 模型的确定Modeling Statistics , 模型统计Models for outliers, 离群值模型Modifying the model, 模型的修正Modulus of continuity, 连续性模Morbidity, 发病率Most favorable configuration, 最有利构形Multidimensional Scaling (ASCAL), 多维尺度/多维标度Multinomial Logistic Regression , 多项逻辑斯蒂回归Multiple comparison, 多重比较Multiple correlation , 复相关Multiple covariance, 多元协方差Multiple linear regression, 多元线性回归Multiple response , 多重选项Multiple solutions, 多解Multiplication theorem, 乘法定理Multiresponse, 多元响应Multi-stage sampling, 多阶段抽样Multivariate T distribution, 多元T分布Mutual exclusive, 互不相容Mutual independence, 互相独立Natural boundary, 自然边界Natural dead, 自然死亡Natural zero, 自然零Negative correlation, 负相关Negative linear correlation, 负线性相关Negatively skewed, 负偏Newman-Keuls method, q检验NK method, q检验No statistical significance, 无统计意义Nominal variable, 名义变量Nonconstancy of variability, 变异的非定常性Nonlinear regression, 非线性相关Nonparametric statistics, 非参数统计Nonparametric test, 非参数检验Nonparametric tests, 非参数检验Normal deviate, 正态离差Normal distribution, 正态分布Normal equation, 正规方程组Normal ranges, 正常范围Normal value, 正常值Nuisance parameter, 多余参数/讨厌参数Null hypothesis, 无效假设Numerical variable, 数值变量Objective function, 目标函数Observation unit, 观察单位Observed value, 观察值One sided test, 单侧检验One-way analysis of variance, 单因素方差分析Oneway ANOVA , 单因素方差分析Open sequential trial, 开放型序贯设计Optrim, 优切尾Optrim efficiency, 优切尾效率Order statistics, 顺序统计量Ordered categories, 有序分类Ordinal logistic regression , 序数逻辑斯蒂回归Ordinal variable, 有序变量Orthogonal basis, 正交基Orthogonal design, 正交试验设计Orthogonality conditions, 正交条件ORTHOPLAN, 正交设计Outlier cutoffs, 离群值截断点Outliers, 极端值OVERALS , 多组变量的非线性正规相关Overshoot, 迭代过度Paired design, 配对设计Paired sample, 配对样本Pairwise slopes, 成对斜率Parabola, 抛物线Parallel tests, 平行试验Parameter, 参数Parametric statistics, 参数统计Parametric test, 参数检验Partial correlation, 偏相关Partial regression, 偏回归Partial sorting, 偏排序Partials residuals, 偏残差Pattern, 模式Pearson curves, 皮尔逊曲线Peeling, 退层Percent bar graph, 百分条形图Percentage, 百分比Percentile, 百分位数Percentile curves, 百分位曲线Periodicity, 周期性Permutation, 排列P-estimator, P估计量Pie graph, 饼图Pitman estimator, 皮特曼估计量Pivot, 枢轴量Planar, 平坦Planar assumption, 平面的假设PLANCARDS, 生成试验的计划卡Point estimation, 点估计Poisson distribution, 泊松分布Polishing, 平滑Polled standard deviation, 合并标准差Polled variance, 合并方差Polygon, 多边图Polynomial, 多项式Polynomial curve, 多项式曲线Population, 总体Population attributable risk, 人群归因危险度Positive correlation, 正相关Positively skewed, 正偏Posterior distribution, 后验分布Power of a test, 检验效能Precision, 精密度Predicted value, 预测值Preliminary analysis, 预备性分析Principal component analysis, 主成分分析Prior distribution, 先验分布Prior probability, 先验概率Probabilistic model, 概率模型probability, 概率Probability density, 概率密度Product moment, 乘积矩/协方差Profile trace, 截面迹图Proportion, 比/构成比Proportion allocation in stratified randomsampling, 按比例分层随机抽样Proportionate, 成比例Proportionate sub-class numbers, 成比例次级组含量Prospective study, 前瞻性调查Proximities, 亲近性Pseudo F test, 近似F检验Pseudo model, 近似模型Pseudosigma, 伪标准差Purposive sampling, 有目的抽样QR decomposition, QR分解Quadratic approximation, 二次近似Qualitative classification, 属性分类Qualitative method, 定性方法Quantile-quantile plot, 分位数-分位数图/Q-Q 图Quantitative analysis, 定量分析Quartile, 四分位数Quick Cluster, 快速聚类Radix sort, 基数排序Random allocation, 随机化分组Random blocks design, 随机区组设计Random event, 随机事件Randomization, 随机化Range, 极差/全距Rank correlation, 等级相关Rank sum test, 秩和检验Rank test, 秩检验Ranked data, 等级资料Rate, 比率Ratio, 比例Raw data, 原始资料Raw residual, 原始残差Rayleigh's test, 雷氏检验Rayleigh's Z, 雷氏Z值Reciprocal, 倒数Reciprocal transformation, 倒数变换Recording, 记录Redescending estimators, 回降估计量Reducing dimensions, 降维Re-expression, 重新表达Reference set, 标准组Region of acceptance, 接受域Regression coefficient, 回归系数Regression sum of square, 回归平方和Rejection point, 拒绝点Relative dispersion, 相对离散度Relative number, 相对数Reliability, 可靠性Reparametrization, 重新设置参数Replication, 重复Report Summaries, 报告摘要Residual sum of square, 剩余平方和Resistance, 耐抗性Resistant line, 耐抗线Resistant technique, 耐抗技术R-estimator of location, 位置R估计量R-estimator of scale, 尺度R估计量Retrospective study, 回顾性调查Ridge trace, 岭迹Ridit analysis, Ridit分析Rotation, 旋转Rounding, 舍入Row, 行Row effects, 行效应Row factor, 行因素RXC table, RXC表S-ZSample, 样本Sample regression coefficient, 样本回归系数Sample size, 样本量Sample standard deviation, 样本标准差Sampling error, 抽样误差SAS(Statistical analysis system ), SAS统计软件包Scale, 尺度/量表Scatter diagram, 散点图Schematic plot, 示意图/简图Score test, 计分检验Screening, 筛检SEASON, 季节分析Second derivative, 二阶导数Second principal component, 第二主成分SEM (Structural equation modeling), 结构化方程模型Semi-logarithmic graph, 半对数图Semi-logarithmic paper, 半对数格纸Sensitivity curve, 敏感度曲线Sequential analysis, 贯序分析Sequential data set, 顺序数据集Sequential design, 贯序设计Sequential method, 贯序法Sequential test, 贯序检验法Serial tests, 系列试验Short-cut method, 简捷法Sigmoid curve, S形曲线Sign function, 正负号函数Sign test, 符号检验Signed rank, 符号秩Significance test, 显著性检验Significant figure, 有效数字Simple cluster sampling, 简单整群抽样Simple correlation, 简单相关Simple random sampling, 简单随机抽样Simple regression, 简单回归simple table, 简单表Sine estimator, 正弦估计量Single-valued estimate, 单值估计Singular matrix, 奇异矩阵Skewed distribution, 偏斜分布Skewness, 偏度Slash distribution, 斜线分布Slope, 斜率Smirnov test, 斯米尔诺夫检验Source of variation, 变异来源Spearman rank correlation, 斯皮尔曼等级相关Specific factor, 特殊因子Specific factor variance, 特殊因子方差Spectra , 频谱Spherical distribution, 球型正态分布Spread, 展布SPSS(Statistical package for the social science), SPSS统计软件包Spurious correlation, 假性相关Square root transformation, 平方根变换Stabilizing variance, 稳定方差Standard deviation, 标准差Standard error, 标准误Standard error of difference, 差别的标准误Standard error of estimate, 标准估计误差Standard error of rate, 率的标准误Standard normal distribution, 标准正态分布Standardization, 标准化Starting value, 起始值Statistic, 统计量Statistical control, 统计控制Statistical graph, 统计图Statistical inference, 统计推断Statistical table, 统计表Steepest descent, 最速下降法Stem and leaf display, 茎叶图Step factor, 步长因子Stepwise regression, 逐步回归Storage, 存Strata, 层(复数)Stratified sampling, 分层抽样Stratified sampling, 分层抽样Strength, 强度Stringency, 严密性Structural relationship, 结构关系Studentized residual, 学生化残差/t化残差Sub-class numbers, 次级组含量Subdividing, 分割Sufficient statistic, 充分统计量Sum of products, 积和Sum of squares, 离差平方和Sum of squares about regression, 回归平方和Sum of squares between groups, 组间平方和Sum of squares of partial regression, 偏回归平方和Sure event, 必然事件Survey, 调查Survival, 生存分析Survival rate, 生存率Suspended root gram, 悬吊根图Symmetry, 对称Systematic error, 系统误差Systematic sampling, 系统抽样Tags, 标签Tail area, 尾部面积Tail length, 尾长Tail weight, 尾重Tangent line, 切线Target distribution, 目标分布Taylor series, 泰勒级数Tendency of dispersion, 离散趋势Testing of hypotheses, 假设检验Theoretical frequency, 理论频数Time series, 时间序列Tolerance interval, 容忍区间Tolerance lower limit, 容忍下限Tolerance upper limit, 容忍上限Torsion, 扰率Total sum of square, 总平方和Total variation, 总变异Transformation, 转换Treatment, 处理Trend, 趋势Trend of percentage, 百分比趋势Trial, 试验Trial and error method, 试错法Tuning constant, 细调常数Two sided test, 双向检验Two-stage least squares, 二阶最小平方Two-stage sampling, 二阶段抽样Two-tailed test, 双侧检验Two-way analysis of variance, 双因素方差分析Two-way table, 双向表Type I error, 一类错误/α错误Type II error, 二类错误/β错误UMVU, 方差一致最小无偏估计简称Unbiased estimate, 无偏估计Unconstrained nonlinear regression , 无约束非线性回归Unequal subclass number, 不等次级组含量Ungrouped data, 不分组资料Uniform coordinate, 均匀坐标Uniform distribution, 均匀分布Uniformly minimum variance unbiased estimate, 方差一致最小无偏估计Unit, 单元Unordered categories, 无序分类Upper limit, 上限Upward rank, 升秩Vague concept, 模糊概念Validity, 有效性VARCOMP (Variance component estimation), 方差元素估计Variability, 变异性Variable, 变量Variance, 方差Variation, 变异Varimax orthogonal rotation, 方差最大正交旋转Volume of distribution, 容积W test, W检验Weibull distribution, 威布尔分布Weight, 权数Weighted Chi-square test, 加权卡方检验/Cochran检验Weighted linear regression method, 加权直线回归Weighted mean, 加权平均数Weighted mean square, 加权平均方差Weighted sum of square, 加权平方和Weighting coefficient, 权重系数Weighting method, 加权法W-estimation, W估计量W-estimation of location, 位置W估计量Width, 宽度Wilcoxon paired test, 威斯康星配对法/配对符号秩和检验Wild point, 野点/狂点Wild value, 野值/狂值Winsorized mean, 缩尾均值Withdraw, 失访Youden's index, 尤登指数Z test, Z检验Zero correlation, 零相关Z-transformation, Z变换。
自然语言处理(Natural Language Processing, NLP)是人工智能领域的一个重要分支,它主要研究如何使计算机能够理解、处理和生成自然语言。
语义相似度计算是NLP领域中的一个重要问题,它主要用于衡量两个句子或短语之间的语义相似程度。
在本文中,我们将讨论自然语言处理中常见的语义相似度计算评估指标。
语义相似度计算评估指标是用来评价不同模型对于语义相似度计算问题的性能表现的指标。
在实际应用中,我们需要使用这些指标来评估不同模型在语义相似度计算任务上的表现,并选择最合适的模型。
常见的语义相似度计算评估指标包括Pearson相关系数、Spearman相关系数、Kendall相关系数和Mean Squared Error (MSE)等。
Pearson相关系数是用来衡量两个变量之间线性相关程度的指标,它的取值范围在-1到1之间。
当Pearson相关系数接近1时,表示两个变量呈正相关关系;当Pearson相关系数接近-1时,表示两个变量呈负相关关系;当Pearson相关系数接近0时,表示两个变量之间没有线性相关关系。
在语义相似度计算评估中,我们可以使用Pearson相关系数来衡量模型输出的相似度分数与人工标注的相似度分数之间的线性相关程度。
Spearman相关系数是一种非参数的统计指标,它用于衡量两个变量之间的单调关系程度。
Spearman相关系数的取值范围在-1到1之间,与Pearson相关系数类似。
在语义相似度计算评估中,我们可以使用Spearman相关系数来衡量模型输出的相似度分数与人工标注的相似度分数之间的单调关系程度。
Kendall相关系数是一种用于衡量两个变量之间等级关系程度的指标,它的取值范围也在-1到1之间。
在语义相似度计算评估中,我们可以使用Kendall相关系数来衡量模型输出的相似度分数与人工标注的相似度分数之间的等级关系程度。
Mean Squared Error(MSE)是一种用于衡量模型输出值与真实值之间差异程度的指标,它的计算方法是将每个样本的预测值与真实值之间的差异进行平方,并对所有样本的平方差异进行求和后取平均。
自然语言处理中常见的语义相似度计算评估指标一、引言自然语言处理(Natural Language Processing, NLP)是人工智能领域的一个重要分支,其研究的核心问题之一是语义相似度计算。
语义相似度计算是指对两个句子或词语之间的语义相似程度进行量化评估,是NLP领域的一个重要问题,也是许多NLP任务的基础。
为了准确度量语义相似度,研究人员提出了许多评估指标。
本文将对自然语言处理中常见的语义相似度计算评估指标进行介绍和分析。
二、基于词向量的语义相似度计算评估指标1. 余弦相似度余弦相似度是一种最基本的相似度计算方法,它衡量了两个向量方向的相似度。
在自然语言处理中,可以将词向量视为一个n维向量,而两个词之间的语义相似度则可以通过计算它们的词向量的余弦相似度来评估。
一般来说,余弦相似度的取值范围在-1到1之间,值越接近1表示两个词的语义越相似。
2. 欧氏距离欧氏距离是另一种常见的相似度计算方法,它用来衡量两个向量之间的距离。
在自然语言处理中,可以利用词向量的欧氏距离来评估两个词的语义相似度。
与余弦相似度不同,欧氏距离的取值范围在0到正无穷之间,值越小表示两个词的语义越相似。
三、基于语义网络的语义相似度计算评估指标1. 词义相似度词义相似度是一种基于语义网络的相似度计算方法,它通过计算两个词在语义网络中的相似程度来评估它们的语义相似度。
在自然语言处理中,常用的语义网络包括WordNet和ConceptNet等。
词义相似度的计算可以基于词语在语义网络中的层次位置、关联性和语义路径等因素,这种方法在一定程度上可以较为准确地评估词语之间的语义相似度。
2. 信息检索模型信息检索模型是一种基于语义网络的相似度计算方法,它通过计算两个词在语义网络中的关联性来评估它们的语义相似度。
在自然语言处理中,信息检索模型经常被用于文本相似度计算和推荐系统中。
这种方法可以综合考虑词语在语义网络中的关联性和权重,因此可以较为准确地评估词语之间的语义相似度。
双语术语提取算法双语术语提取算法(Bilingual Terminology Extraction Algorithm)引言:随着全球化的加深,多语言信息处理的需求也越来越迫切。
在这种背景下,双语术语提取算法成为了研究的热点之一。
通过提取两种语言中的术语,可以帮助人们更好地理解不同语言之间的关联,从而促进跨语言交流和信息处理的效率。
一、双语术语提取算法的定义双语术语提取算法是指通过对两种语言的文本进行分析和处理,从中提取出两种语言共有的术语。
这些术语是在不同领域中广泛使用的专业词汇,对于深入了解特定领域的文本非常重要。
1. 数据预处理在进行双语术语提取之前,首先需要对两种语言的文本数据进行预处理。
这包括去除标点符号、停用词等无关信息,并进行分词和词性标注等处理。
2. 术语候选项提取接下来,通过使用词频、互信息等统计方法,从预处理后的文本数据中提取出术语的候选项。
这些候选项是潜在的术语,需要进一步的筛选和验证。
3. 术语筛选与验证在候选项中,可能存在一些不是真正的术语,因此需要进行筛选与验证。
常用的方法包括基于词性、词义、语境等特征的术语识别算法。
这些算法可以帮助我们确定哪些候选项是真正的术语。
4. 双语术语对齐在确定了两种语言中的术语后,还需要对这些术语进行对齐。
通过比较两种语言中术语的相似性,可以找到它们之间的对应关系。
这个过程需要使用双语词典、翻译模型等工具。
5. 术语评估与优化需要对提取出的双语术语进行评估与优化。
可以使用专家评价、领域知识等方法来判断提取的术语是否正确和完整,并对算法进行改进和优化。
三、双语术语提取算法的应用领域双语术语提取算法在多个领域有着广泛的应用。
例如,在机器翻译中,通过提取源语言和目标语言中的术语,可以帮助改善翻译质量。
在自然语言处理中,双语术语提取可以用于构建双语词典、术语库等资源,为其他任务提供基础支持。
双语术语提取算法在跨语言信息检索、知识图谱构建、专业领域信息抽取等方面也有着重要的应用。
Mining Comparable Bilingual Text Corpora for Cross-Language Information IntegrationT ao T aoDepartment of Computer Science University of Illinois at Urbana ChampaignChengXiang ZhaiDepartment of Computer Science University of Illinois at Urbana ChampaignABSTRACTIntegrating information in multiple natural languages is a challenging task that often requires manually created lin-guistic resources such as a bilingual dictionary or examples of direct translations of text.In this paper,we propose a general cross-lingual text mining method that does not rely on any of these resources,but can exploit comparable bilingual text corpora to discover mappings between words and documents in different parable text cor-pora are collections of text documents in different languages that are about similar topics;such text corpora are often naturally available(e.g.,news articles in different languages published in the same time period).The main idea of our method is to exploit frequency correlations of words in differ-ent languages in the comparable corpora and discover map-pings between words in different languages.Such mappings can then be used to further discover mappings between doc-uments in different languages,achieving cross-lingual infor-mation integration.Evaluation of the proposed method on a120MB Chinese-English comparable news collection shows that the proposed method is effective for mapping words and documents in English and Chinese.Since our method only relies on naturally available comparable corpora,it is gen-erally applicable to any language pairs as long as we have comparable corpora.Categories and Subject Descriptors:H.3.3[Informa-tion Search and Retrieval]:Text MiningGeneral Terms:AlgorithmsKeywords:Cross-lingual text mining,comparable corpora, frequency correlation,document alignment1.INTRODUCTIONAs more information becomes available online,we have also seen more and more information in different natural languages such as English,Spanish,and Chinese.The web today consists of documents in many different languages. For a user who is interested infinding information from all Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on thefirst page.To copy otherwise,to republish,to post on servers or to redistribute to lists,requires prior specific permission and/or a fee.KDD’05,August21–24,2005,Chicago,Illinois,USA.Copyright2005ACM1-59593-135-X/05/0008...$5.00.the documents in different languages,it would be very use-ful if we could integrate related information in multiple lan-guages[1].Currently,cross-lingual information integration is often achieved through performing cross-lingual information re-trieval(CLIR)[17],which allows a user to retrieve documents in language A with a query in language B.Most CLIR tech-niques rely on manually created linguistic resources such as a bilingual dictionary or examples of direct translations of words and documents[16].Such resources may not al-ways be available,especially for minority languages;in such a case,how to perform cross-lingual information integration would be a significant challenge.In this paper,we propose a cross-lingual text mining method that can exploit compara-ble bilingual text corpora to perform cross-lingual informa-tion integration without requiring any additional linguistic resources.Comparable text corpora are collections of text documents in different languages that are about the same or similar topics.For example,news articles published in the same time period tend to report the same important international events in various topics such as politics,business,science and sports.Such data are naturally available to us,so it would be very interesting to study how to exploit them to perform cross-lingual information integration.Even when we have manually created clean bilingual resources such as a bilin-gual dictionary,it may still be desirable to exploit such com-parable corpora for two reasons:(1)New words and phrases are constantly introduced and it would be hard to keep up-dating a dictionary to include them all.Approaches such as what we propose in this paper can potentially be used to ac-quire translation knowledge about such new words/phrases from comparable corpora and help lexicographers update dictionaries.(2)Since comparable corpora are additional resources,we may expect to achieve better performance by combining the exploitation of comparable corpora with that of a bilingual dictionary.We frame the problem of cross-lingual information in-tegration as one involving mapping or linking words and documents in different languages.While comparable cor-pora have been studied extensively in the existing literature (e.g.,[6,10,15,5,2,8,13]),almost all existing work assumes some kind of bilingual dictionary or translation examples to start with.We study how to map words and documents from comparable bilingual corpora without requiring any ad-ditional linguistic resources such as a bilingual dictionary. Our basic idea is to exploit the fact that the frequency dis-tributions of topically related words in different languagesare often correlated due to the correlated coverage of the same events.For example,the earthquake and sea surge disaster that happened recently in Asia has been covered in the news articles in many different languages.We can thus expect to see a recent peak of words such as“earth-quake”,“India”,and“Indonesia”in news articles published in multiple languages.In general,we can expect that top-ically related words in different languages tend to co-occur together over time.Thus if we have available comparable news articles over a sufficiently long time period,it is in-tuitively possible to exploit such correlations to learn the associations of words in different languages.The general idea of exploiting frequency correlations to acquire word translations from comparable corpora has al-ready been explored in several previous studies(e.g.,[6,10, 15]).However,none of them has adopted a direct compari-son of frequency distributions of candidate words as we do; rather they tend to compute the associations between the words in the same language and then compare association patterns in two different languages.Our idea and the over-all approach appear to be more similar to the method used in[7],but there the task is aligning sentences in parallel corpora.With the word mappings,we can then try to match doc-uments in two different languages based on how well the words in each document are correlated.We propose four dif-ferent methods for computing cross-lingual document simi-larity,including a baseline expected correlation method,an IDF-weighted correlation method,a TF-IDF method,and a translation model method.These methods allow us to per-form cross-lingual document retrieval as well as linking the most strongly correlated documents together.We evaluated our methods on a120MB Chinese-English news report corpus for both word association mining and document association mining.The results show that our method can discover meaningful word mappings and can generate meaningful document alignments for information integration.The top ranked word pairs show various kinds of interesting associations between words in the two different languages.The document alignment results have a high precision of0.8at a cutoffof100,meaning that80%of the document pairs among the top100matching results are correctly matched.The rest of the paper is organized as follows.In Sec-tion2,we discuss our data set.In Section3and Section4, we present our methods for word association mining and document association mining respectively.The experiment results are reported in Section5,and we summarize our work in Section6.2.DATA SETThe comparable corpora we experiment with are6months’of news articles of Xinghua English and Chinese newswires dated from June8,2001through November7,2001.There are altogether43488documents in Chinese and34751doc-uments in English.The average document length is204.6 words.In this data set,there are many articles in English that have some comparable Chinese articles.An example of comparable news articles is given in Fig-ure1the same international swimming championship in English and Chinese,respectively.While these two arti-cles are from the same newswire source,and they cover the same event,they are not translations of each other.However,some words in the two articles are clearly trans-lations of each other.For example,the Chinese transla-tions of“swimming”,“World Swimming Championships”, and“Men’s400M Freestyle Heats”all occur in the Chinese document.(They are underlined.)World Swimming Schedule...9th FINA world swimming championships here on Sunday... 9:00,Men’s50M Freestyle HeatsWomen’s100M Breaststroke HeatsMen’s400M HeatsWomen’s400M Individual Medley HeatsMen’s100M Backstroke Heats...,4004004X1005,50Figure1:A fragment of an English article(top)and a comparable Chinese fragment(bottom)about an international swimming championship3.MINING CROSS-LINGUAL WORD AS-SOCIATIONSIn this section,we present our method for mining cross-lingual word associations.Our main idea is based on the observation that words that are translations of each other or about the same topic,tend to co-occur in the comparable corpora at the same/similar time periods.Thus if we have some large comparable corpora available,it is intuitively possible to exploit such correlations to learn the associations of words in different languages.To see if our intuition can be supported empirically,in Figure2,we compare the frequency distribution(over time) of Megawati with that of its Chinese tranlation“”(left)and with that of another randomly chosen Chinese name(“Arafat”)(right).We see that the distributions of the English“Megawati”and its Chinese translation indeed look very similar with a high correlation of0.855,while the distributions of“Megawati”and the Chinese translation of “Arafat”are quite different with a correlation of only0.0324. These plots show that we can indeed expect tofind semanti-cally relevant mappings between words in different languages by exploiting frequency correlations.We can thus represent each word with a frequency vector and score each candidate pair of words(in different languages)by the similarity of the two frequency vectors.Formally,suppose we have available comparable corpora C={(s1,t1),...,(s n,t n)},where s i and t i are a set of doc-uments associated with an anchor point of“i”in language A and language B,respectively.Let x(or y)be a source (target)word in language A and language B,respectively. We use c(x,s i)(or c(y,t i))to denote the counts of x(or y) in s i(or t i).The raw frequency vectors for x and y are thus (c(x,s1),...,c(x,s n))and(c(y,t1),...,c(y,t n)),respectively. In order to make the frequency vectors more comparable across different languages,it is desirable to normalize a raw frequency vector so that it becomes a frequency distribution00.020.040.060.080.10.120.140.1630405060708090100110120F r e q . D i s t r i b u t i o nDateMegawati-English Megawati-Chinese0.020.040.060.080.10.120.140.1630405060708090100110120F r e q . D i s t r i b u t i o nDateMegawati-English Arafat-ChineseFigure 2:Megawati vs.its Chinese translatioin (left)and a random Chinese name (Arafat )(right).over all the time points.That is,we divide all the counts by the sum of all the counts over the entire time period.Such a normalized frequency distribution would allow us to focus on the relative frequency on different days,which is presumably more comparable across different languages than the original non-normalized counts.Let x =(x 1,...,x n )and y =(y 1,...,y n )be the normalized frequency vectors for x and y ,respectively,wherex i =c (x,s i )P nj =1c (y,t j )In order to compute the similarity between x and y (or word x and word y ),we use the Pearson’s correlation coefficient,which is a commonly used statistic measure defined asr (x,y )=P ni =1x i y i −1qn (P n i =1x i )2)(P ni =1y 2i −1|d 1|c (y,d 2)d f (w ),and d f (w )is the number of doc-uments in a language that contains word w ,often called thedocument frequency of a word.Adding IDF weighting to ExpCorr ,we obtain the follow-ing IDF-weighted Correlation method (IDFCorr ):s (d 1,d 2)=Xx ∈d 1,y ∈d 2IDF (x,y )r (x,y )c (x,d 1)|d 2|In both ExpCorr and IDFCorr ,the similarity score growslinearly to the count of a word in the document.However,intuitively,having one extra match after matching a word 100times does not add so much extra evidence as matching the word the first time.We thus would like to have the similarity score to grow sub-linearly according to the count of a matching word.Again,in information retrieval,many formulas have been proposed to heuristically normalize the count of words to achieve this effect.A popular and effective method is the BM25term frequency normalization method [11,12].According to this formula,the normalized count of word w in document d is given byBM 25(w,d )=k 1c (w,d )AvgDocLen)where k 1and b are parameters and AvgDocLen is the aver-age document length.In our experiments,we set k 1=1.2and b =0.75,which are the recommended default settings.The BM25weighting formula provides an alternative way of normalizing the count of a word,so BM 25(x,d 1)and BM 25(y,d 2)can effectively play the same role as p (x |d 1)and p (y |d 2),possibly with more reasonably normalization of the counts.Thus we use BM 25(x,d 1)and BM 25(y,d 2)to replace p (x |d 1)and p (y |d 2),respectively,in theIDFCorr approach to obtain the following formula,which we refer to as BM25Correlation (BM25Corr ).s (d 1,d 2)=Xx ∈d 1,y ∈d 2IDF (x,y )r (x,y )BM 25(x,d 1)BM 25(y,d 2)Finally,motivated by the language modeling approach to information retrieval [9,17],we can also use the correlation between words to estimate a word translation model t (x |y )[3]as t (x |y )=r (x,y )|d 2[18,19].p (x |C )can beestimated as p (x |C )=P ni =1c (x,s i )Chinese yChinese y260.929october0.8753125230.918270.8732219280.91620080.87016210.913swimming0.86217august0.907july 0.8579300.895120.85524APECafghan0.886terror0.8501820afghanistan0.882terrorism0.8461420080.879taliban0.83929june0.876m 0.836Table 1:38highest correlated word pairs with the number 8.The character matching “afghan”and “afghanistan”is one of the three characters in the Chinese translation of “Afghanistan”(i.e.).The character matching “swimming”is also precisely its Chinese transla-tion,and the top two characters returned for “terror”are the exact translation of this word in Chinese (i.e.).An-other interesting example is the matching of “APEC”and “apec”.Interestingly,from this list,we can also infer that the two major common themes in this corpora appear to be sports and terrorism since the best matching words seem to fall into these two categories.This is an additional benefit of our word association mining algorithm.Chinese x(swimming)(championship)(place name)(seconds)(medal)(diving)(semi)(score)(men)(championship)We then take the top 5English documents from the three randomly chosen clusters,to generate 15seed English doc-uments.For each English document,we use the baseline method ExpCorr to score all the Chinese documents of the same day,the previous day,and the day after,and retrieve top 20Chinese documents for each English document.We read these documents and judge whether they are about the same topic/theme as the English seed document.The pairs judged as covering the same topic are assumed to be correct mappings.The seed English documents and the retrieved Chinese documents are combined together to form a working set of documents in both languages,which contains 15En-glish documents and 239Chinese documents .We then use all the four methods to compute the matching scores of all the seed English documents and all the Chinese documents in the working set,and take the 20top-ranked Chinese doc-uments for each English seed document for evaluation.This way,we can compare the performances of these four meth-ods with a controlled sample of the top-ranked pairs from the whole corpora.0.20.40.60.810 50 100 150 200 250 300 350p r e c i s i o nrankExpCorr IDFCorr BM25Corr CorrTransFigure 3:Alignment results on the working set.RankPrecision at RankExpCorrBM25Corr310.670.211010.90.670.97500.980.480.270.81IDFCorrCorrTrans150.800.61300.7300.40.961000.490Table 4:Precisions at ranks on the augmented set In order to understand how much bias the working set might introduce,for each English seed document,we fur-ther rank all the Chinese documents from the same day as the English seed document,the day before,and the day after.This time,we take the top 50Chinese documents for each English seed document and pool them together for evaluation.The results are shown in Figure 4and Table 4.Note that because we have not judged all these Chinese doc-uments,we assumed any matching with an unjudged docu-ment to be incorrect.This means that the performance we see actually represents a lower bound;the real performance can only be better.Comparing Table 3and Table 4indicates that the baseline ExpCorr performs similarly,indicating that the additional 30Chinese documents retrieved mostly have not made to the top pairs.The slight decrease in the precision at rank 50and rank 100suggests that there may be a couple un-judged Chinese documents showing up in the top 100list.Note that,for the baseline method,these 30Chinese doc-uments are unjudged,thus can only decrease performance.The BM25Corr method also performs similarly;actually its performance on the larger set is even slightly better at rank 50and 100.This suggests that the additional 30Chi-nese documents retrieved may actually contain some correct matchings.Note that,in this case,the additional 30Chinese documents may contain judged correct matchings,which are those documents that are among the top 20documents re-turned using the baseline method,but failed to make to the top 20documents by the BM25Corr method.Thus giving the BM25Corr method an opportunity to retrieve more results has helped it to improve the performance slightly.Both IDFCorr and CorrTrans perform worse on the larger set,indicating that they are not very robust.Indeed,the precision of CorrTrans is all zeros for all the ranks.This suggests that the method cannot normalize the scores for different English seed document well;as a result,some in-correct results in the additional30Chinese documents may have turned out to dominate the top pairs.Comparing the four methods on ranking the augmented working set,we see that both IDFCorr and BM25Corr again perform better than the other two methods,andBM25Corr is clearly the best method among the four. 6.CONCLUSION AND FUTURE WORKIn this paper,we propose and explore a completely un-supervised cross-lingual text mining method that can ex-ploit comparable bilingual corpora to perform cross-lingual information integration.Our basic idea is to exploit the fre-quency correlations of words about the same topic tofirst mine word associations and then mine document associa-tions.These associations can be used to integrate multilin-gual text information and support cross-lingual information retrieval and navigation,which has been becoming more and more important due to the rapid growth of multilingual doc-uments available on the Web.Evaluation of the proposed method on a120MB Chinese-English comparable news col-lection shows that the proposed method is effective for map-ping words and documents in English and Chinese.The most important contribution of our work is that we have demonstrated the feasibility of mining word and docu-ment associations from comparable corpora without relying on any additional(manually created)linguistic resources. To the best of our knowledge,all previous attempts on cross-lingual information integration rely on some manually crafted linguistics resources such as a bilingual dictionary or translation examples.Since our approach does not depend on such resources,it is more general and robust than the existing methods.Although we have shown promising results with our meth-ods,our methods can be further improved in several ways. First,we could use the document matching results to in-duce new alignments for the whole corpora,which can then be used to improve our computation of word correlations. The new results of word correlations can be fed back to help generate improved document alignment.This way,we have an iterative algorithm for mining both word associations and document associations.Second,we have treated the whole document as an information unit.To improve information integration accuracy,it may be beneficial to alignment doc-ument segments.For example,we can use a sliding window to search for the best matching segments when matching two documents.Finally,it would be very interesting to ex-plore how to design a mixture model that can mine word associations and document associations simultaneously. 7.ACKNOWLEDGEMENTSWe thank Richard Sproat and Dan Roth for helpful dis-cussions.This work is supported in part by DOI under the contract number NBCHC040176.8.REFERENCES[1]J.Allan et al.Challenges in information retrieval andlanguage modeling:report of a workshop held at thecenter for intelligent information retrieval.SIGIRForum,37(1):31–47,2003.[2]L.Ballesteros and W.B.Croft.Resolving ambiguityfor cross-language retrieval.In Research andDevelopment in Information Retrieval,pages64–71,1998.[3]A.Berger and ffrmation retrieval asstatistical translation.In Proceedings of the1999ACM SIGIR Conference on Research and Development inInformation Retrieval,pages222–229,1999.[4]T.M.Cover and J.Thomas.Elements of InformationTheory.Wiley,1991.[5]M.Franz,J.S.McCarley,and S.Roukos.Ad hoc andmultilingual information retrieval at IBM.In TextREtrieval Conference,pages104–115,1998.[6]P.Fung.A pattern matching method forfinding nounand proper noun translations from noisy parallelcorpora.In Proceedings of ACL1995,pages236–243,1995.[7]M.Kay and M.Roscheisen.Text translationputational Linguistics,19(1):75–102,1993.[8]H.Masuichi,R.Flournoy,S.Kaufmann,andS.Peters.A bootstrapping method for extractingbilingual text pairs.In Proc.18th COLINC,2000. [9]J.Ponte and W.B.Croft.A language modelingapproach to information retrieval.In Proceedings ofthe ACM SIGIR’98,pages275–281,1998.[10]R.Rapp.Identifying word translations in non-paralleltexts.In Proceedings of ACL1995,pages320–322,1995.[11]S.Robertson and S.Walker.Some simple effectiveapproximations to the2-poisson model forprobabilistic weighted retrieval.In Proceedings ofSIGIR’94,pages232–241,1994.[12]S.E.Robertson,S.Walker,S.Jones,M.M.Hancock-Beaulieu,and M.Gatford.Okapi atTREC-3.In D.K.Harman,editor,The Third TextREtrieval Conference(TREC-3),pages109–126,1995.[13]F.Sadat,M.Yoshikawa,and S.Uemura.Bilingualterminology acquisition from comparable corpora and phrasal translation to cross-language informationretrieval./P/P03/P03-2025.pdf. [14]G.Salton and M.McGill.Introduction to ModernInformation Retrieval.McGraw-Hill,1983.[15]K.Tanaka and H.Iwasaki.Extraction of lexicaltranslation from non-aligned corpora.In Proceedingsof COLING1996,1996.[16]J.Veronis.Parallel text processing:Alignment anduse of translation corpora.In Kluwer AcademicPublishers.,2000.[17]J.Xu,R.Weischedel,and C.Nguyen.Evaluating aprobabilistic model for cross-lingual informationretrieval.In Proceedings of ACM SIGIR2001,2001. [18]C.Zhai and fferty.A study of smoothingmethods for language models applied to ad hocinformation retrieval.In Proceedings of SIGIR’01,pages334–342,Sept2001.[19]C.Zhai and fferty.Two-stage language modelsfor information retrieval.In Proceedings of SIGIR’02,pages49–56,Aug2002.[20]C.Zhai,A.Velivelli,and B.Yu.A cross-collectionmixture model for comparative text mining.InProceedings of KDD2004,2004.。