Zweigenbaum P Using Word Alignment to Extend Multilingual Medical Terminologies
- 格式:pdf
- 大小:153.83 KB
- 文档页数:6
目录第五章自定义函数 (1)5.1 MA TLAB函数简介 (1)5.2 在MA TLAB中传递变量:按值传递机制 (6)例5.3 (7)5.3 选择性参数 (14)测试5.1 (16)5.4 用全局内存分享数据 (17)例5.4 (18)5.5 在函数调用两次之间本地数据的存储 (22)例5.5 运行平均数 (22)5.6 函数的函数(function functions), (26)例5.6 (27)5.7 子函数和私有函数 (29)5.8 总结 (29)5.9 练习 (30)5.1 (30)5.2 (30)5.3 (31)5.4 (31)5.5 (31)5.6 (31)5.7 (31)5.8 (31)5.9 (31)5.10 (31)5.11 (31)5.12 (32)5.13 (32)5.14 (32)5.15 (33)5.16 (33)5.17 (33)5.18 (34)5.19 (35)5.20 (35)5.21 (35)5.22 (35)5.23 (36)5.24 (36)5.25 (37)5.26 (37)第五章自定义函数在第三章中,我们强调了好的编程习惯的重要性。
我们进行开发的基本手段是自上而下的编程方法。
在自上而下的编程方法中,它开始于对所要解决问题的精确陈述和定义输入量和输出量。
下一步,我们在大面上进行算法的描述,然后把算法分解成一个一个的子问题。
再然后,程序员把这一个个子问题进行再一次的分解,直到分解成简单而且能够清晰理解的伪代码。
最后把伪代码转化为MATLAB代码。
尽管我们在前面的例子中,按照上面的步骤进行了编程。
但是产生的结果在某种程度上还是受限制的。
因为我们必须把每一个子问题产生的MATLAB代码嵌入到一个单独的大程序中。
在嵌入之前我们无法对每一次子问题的代码进行独立地验证和测试。
幸运的是,MATLAB有一个专门的机制,在建立最终的程序之前用于独立地开发与调试每一个子程序。
Word编辑法修改系统发育树序列标签教程
系统树在文章发表时一定要符合学术常规,如系统树上每个分枝的序列标签名。
这些序列标签名一般都包含菌的登录号和菌的拉丁名等。
当二者放在一起时,登录号需要用正体,而菌名的拉丁文需要用斜体来表示。
(注:或许你会发现在某些已发表的文章中有些作者却忽略了这样的处理)。
你是否也遇到了如何编辑系统树中序列标签名的问题呢?
今天的教程给大家带来另一种系统树编辑的简单方法(Word编辑法),这种方法对于一些同学可能是有些陌生的。
原因在于我们构建完系统树后,保存成的jpg、pdf、tif等格式图往往不知道如何对图片上的标签名进行更改。
Word编辑系统发育树标签名(登录号等正体、拉丁文斜体)操作:
例:MEGA建树
1.采用MEGA软件构建完成系统发育树如下图:
2.然后点击MEGA工具栏里的Image按钮,会看到如下图的保存类型的选项。
Eigen使用手册1、引言Eigen是一个高级C++库,用于进行线性代数、矩阵和向量操作、数值分析和相关的数学运算。
本手册旨在为使用Eigen库的用户提供详细的使用指导。
2、安装与配置在开始使用Eigen之前,您需要先将其安装到您的开发环境中。
请根据您所使用的操作系统和编译器,参照Eigen官方网站上的安装指南进行操作。
3、基本概念Eigen库中的核心概念包括矩阵、向量和线性方程组。
矩阵和向量是进行各种数学运算的基本数据结构。
4、主要功能Eigen库提供了丰富的功能,包括但不限于:矩阵和向量的基本操作、特征值和特征向量的计算、线性方程组的求解、稀疏矩阵的处理、优化和并行计算等。
5、矩阵运算Eigen库支持各种矩阵运算,包括矩阵的加法、减法、乘法、转置、逆等。
此外,还支持矩阵分解(如LU分解、QR分解等)和矩阵函数(如矩阵指数、行列式等)。
6、向量运算Eigen库中的向量支持各种基本运算,包括加法、减法、数乘、点积、外积等。
此外,还提供了向量函数(如向量范数、向量归一化等)。
7、特征值与特征向量Eigen库提供了计算特征值和特征向量的功能,支持多种特征值求解方法,如QR算法、Jacobi方法等。
8、线性方程组求解Eigen库提供了多种线性方程组求解方法,如Gauss-Jordan消元法、迭代法(如Jacobi方法、SOR方法等)以及直接法(如LU分解)。
9、稀疏矩阵处理Eigen库支持稀疏矩阵的存储和运算,提供了多种稀疏矩阵格式(如CSR、CSC等),并实现了高效的稀疏矩阵运算算法。
10、优化与并行计算Eigen库提供了多种优化选项,包括自动求逆、表达式模板等,以加速代码的执行。
此外,Eigen还支持多线程并行计算,可以充分利用多核处理器进行大规模计算。
11、应用案例Eigen已被广泛应用于各种领域,如计算机图形学、数值分析、机器人学、科学计算等。
一些成功的应用案例包括3D渲染引擎、机器学习算法、物理模拟等。
NotebookNotebook 的功能在于:使用户能在Word环境中“随心所欲地享用”MATLAB的浩瀚科技资源,为用户营造融文字处理、科学计算、工程设计于一体的完美工作环境。
MATLAB Notebook 制作的M-book文档不仅拥有MS-Word 的全部文字处理功能,而且具备MATLAB无与伦比的数学解算能力和灵活自如的计算结果可视化能力。
它既可以看作解决各种计算问题的字处理软件,也可以看作具备完善文字编辑功能的科技应用软件。
M-book文档最显著的特点是它的“活”性:●它为论文、科技报告、讲义教材、学生作业的撰写营造了文字语言思维和科学计算思维的和谐环境。
●用M-book写成的电子著作、电子文稿、讲义教材不仅图文并茂,而且动静结合。
那些由MATLAB指令构成的例题、演示,都可供读者亲自操作,举一反三,从而在“手脑并用”的环境中由此及彼、由浅入深。
8.1Notebook的配置和启动8.1.1Notebook的配置(1)与MATLAB适配的Word随MATLAB版本的升级,与其适配的版本也会发生变化。
以MATLAB R2007a为例,能用来配置Notebook环境的Word版本是2000,2002,2003。
(2)Notebook的配置在Windows已经装有前述Word版本的前提下,在MATLAB中配置Notebook环境十分简便。
只要在MATLAB指令窗中运行以下指令,配置过程将自动进行。
notebook –setup假如指令窗中出现如下信息,就表示配置成功。
Setup complete8.1.2Notebook的启动一创建新的M-book文件(1)在Word 默认窗口(即Normal.dot)下创建新的M-book文档●选择Word 窗口的下拉菜单项{文件:新建} ;●在弹出的对话框中,选择“M-book”模板,按[确定] 键;●于是,Word 的窗口由原先的默认式样变成“M-book”式样(如图8.1-1所示),自动开启一个新的MATLAB作为其服务器,而不管此前Windows平台上是否已经开启了MATLAB。
eigen使用方法-回复Eigen 是C++ 的一个开源线性代数库,提供了一系列用于向量、矩阵、数组和其他线性代数操作的高性能模板类和函数。
它被广泛应用于机器学习、计算机图形学等领域。
在本文中,我们将一步一步介绍Eigen 的使用方法。
第一步:安装Eigen要开始使用Eigen,首先需要安装它。
Eigen 的源代码可以在官方网站上下载,并通过解压缩后将其复制到您的项目目录中,或者将其安装到系统的标准库中。
您也可以通过包管理器(如Homebrew on Mac 或apt-get on Ubuntu)安装Eigen。
第二步:引入Eigen 头文件在您的C++ 代码中,需要包含Eigen 的头文件,以便使用其功能。
以下是引入Eigen3 库的示例代码:#include <Eigen/Dense>第三步:定义矩阵和向量在Eigen 中,可以使用Matrix 和Vector 类来表示矩阵和向量。
矩阵可以是动态大小或固定大小的,它们的大小由模板参数决定。
例如,要定义一个3x3的固定大小矩阵,可以使用以下代码:Eigen::Matrix<double, 3, 3> matrix;要定义一个动态大小的矩阵,可以使用以下代码:Eigen::MatrixXd matrix;同样,可以用VectorXd 来定义一个动态大小的向量。
第四步:填充矩阵和向量填充矩阵和向量的方法有很多种。
您可以逐个元素地赋值,也可以使用块赋值或生成特定类型的矩阵和向量。
以下是一些常见的方法示例:逐个元素赋值matrix(0, 0) = 1.0;matrix(0, 1) = 2.0;matrix(1, 1) = 3.0;块赋值matrix.block(0, 0, 2, 2) << 1.0, 2.0,3.0,4.0;生成特定类型的矩阵Eigen::MatrixXd::Zero(3, 3); 全零矩阵Eigen::MatrixXd::Identity(3, 3); 单位矩阵第五步:进行线性代数运算Eigen 提供了一系列的线性代数运算函数,包括矩阵和向量的乘法、求逆、转置、特征值分解等。
eigen3 交叉编译Eigen3是一个用于线性代数,矩阵和向量操作的C++库,它提供了丰富的功能,使得在C++中进行数值计算变得容易。
在某些情况下,您可能需要在不同的平台或架构上使用Eigen3,这就需要进行交叉编译。
下面将为您介绍如何进行Eigen3的交叉编译。
一、准备工作在进行交叉编译之前,请确保您已经安装了所需的交叉编译工具链,并且已经下载了Eigen3库的源代码。
您还需要了解目标平台的架构和编译器,以便选择适当的交叉编译设置。
二、交叉编译设置1. 配置环境变量:将交叉编译工具链的路径添加到系统环境变量中,以便编译器能够找到所需的工具。
2. 创建交叉编译构建目录:在交叉编译过程中,需要使用特定的构建目录来保存构建输出。
建议创建一个专门的交叉编译构建目录,例如“eigen3_cross_compile”。
3. 配置CMake:使用CMake作为构建系统来管理交叉编译过程。
在终端中,导航到Eigen3源代码目录,并执行以下命令以生成CMake 构建文件:```arduinomkdir -p build && cd buildcmake .. -DCMAKE_TOOLCHAIN_FILE=<path-to-toolchain-file> ```其中,`<path-to-toolchain-file>`是交叉编译工具链对应的配置文件。
4. 编译Eigen3:执行以下命令以开始交叉编译Eigen3:```rmake -j <num_threads>```其中,`<num_threads>`是用于并行编译的线程数。
5. 安装Eigen3:完成编译后,可以使用以下命令将Eigen3库安装到目标平台上:```arduinomake install```三、使用交叉编译的Eigen3库一旦Eigen3库成功交叉编译并安装到目标平台上,您就可以在您的C++项目中引用它了。
eigen基本操作摘要:1.Eigen 库简介2.Eigen 库的基本操作3.实例:使用Eigen 库进行线性代数运算正文:1.Eigen 库简介Eigen 库是一个用于线性代数、矩阵计算和其他相关领域的C++库。
它提供了大量的矩阵操作和算法,并且以模板方式实现,可以适用于各种数据类型。
Eigen 库的目标是提供高性能、易于使用的库,使得C++开发者可以方便地进行线性代数运算和相关操作。
2.Eigen 库的基本操作Eigen 库提供了许多基本的操作,包括矩阵的加法、减法、乘法、转置、求逆、行列式计算等。
下面我们以一个简单的例子来介绍一下Eigen 库的基本操作。
假设我们有一个3x3 的矩阵A:```3x3 矩阵A:1 2 34 5 67 8 9```我们可以使用Eigen 库来进行以下基本操作:(1) 矩阵加法:```cppEigen::Matrix3d A = Eigen::Matrix3d::Random(); Eigen::Matrix3d B = Eigen::Matrix3d::Random(); Eigen::Matrix3d C = A + B;```(2) 矩阵减法:```cppEigen::Matrix3d A = Eigen::Matrix3d::Random(); Eigen::Matrix3d B = Eigen::Matrix3d::Random(); Eigen::Matrix3d C = A - B;```(3) 矩阵乘法:```cppEigen::Matrix3d A = Eigen::Matrix3d::Random(); Eigen::Matrix3d B = Eigen::Matrix3d::Random(); Eigen::Matrix3d C = A * B;```(4) 矩阵转置:```cppEigen::Matrix3d A = Eigen::Matrix3d::Random();Eigen::Matrix3d B = A.transpose();```(5) 矩阵求逆:```cppEigen::Matrix3d A = Eigen::Matrix3d::Random();Eigen::Matrix3d B = A.inverse();```(6) 行列式计算:```cppEigen::Matrix3d A = Eigen::Matrix3d::Random();double det = A.determinant();```3.实例:使用Eigen 库进行线性代数运算下面我们通过一个具体的实例来展示如何使用Eigen 库进行线性代数运算。
C++矩阵处理库--Eigen初步使用(转载自: CS DN cyxcw1的博客)2013-07-16 21:20:43| 分类:计算机视觉|举报|字号订阅项目要进行比较多的矩阵操作,特别是二维矩阵。
刚开始做实验时,使用了动态二维数组,于是写了一堆Matrix函数,作矩阵的乘除加减求逆求行列式。
实验做完了,开始做代码优化,发现Matrix.h文件里适用性太低,而且动态二维数组的空间分配与释放也影响效率,于是寻找其他解决方案。
首先考虑的是与Matlab混合编程,折腾了半天把Matlab环境与VS2010环境之后,发现Matlab编译出来的函数使用起来也比较麻烦,要把数组转化成该函数适用的类型后才能使用这些函数。
我的二维数组也不是上千万维的,估计这个转化的功夫就牺牲了一部分效率了。
(如果谁有混合编程的心得,求帮忙,囧。
)接着想到使用一维数组的方法,或者把一维数组封装在一个类里边。
想着又要写一堆矩阵操作函数头就大,索性谷歌了一下矩阵处理库,除了自己之前知道的OpenCV库(之前由于转化cvarr麻烦,于是放弃),还有Eigen, Armadillo。
/houston11235/article/details/8501135该博客对这三个库的效率做了一个简单的评测,OpenCV库的矩阵操作效率是最低的,还好我没使用。
Eigen速度最快,与自己定义数组的操作效率相当(- -,才相当吗?我本来还想找个更快的呢)。
于是选择使用Eigen。
进入正题。
安装:/index.php?title=Main_Page这里是官网,直接把包下载下来,不大,也就几M,我是直接放在自己项目文件夹(考虑项目封装时,这样比较方便),放在VS2010 <INCLUDE>文件夹。
简单使用:看了一下官方文档,Eigen库除了能实现各种矩阵操作外,貌似还提供《数学分析》中的各种矩阵操作(包括L矩阵U矩阵)。
目前我使用到的还是简单的矩阵操作,如加减乘除,求行列式,转置,逆,这些基本操作只要:[cpp]view plaincopyprint?1. #include "Eigen/Eigen"2. using namespace Eigen;就能实现,别忘了名空间Eigen。
Microsoft可移植可执行文件和通用目标文件格式文件规范修订版8.1 - 2008年2月15日摘要本规范描述了Microsoft® Windows®操作系统家族下的可执行文件(映像)和目标文件的结构。
这些文件分别被称为可移植可执行(PE)文件和通用目标文件格式(COFF)文件。
注意:提供本文档是为了辅助开发用于Microsoft Windows操作系统上的工具和应用程序,但并不保证它在各个方面都是完整的规范。
Microsoft保留更改本文档而不通知的权利。
Microsoft可移植可执行文件和通用目标文件格式文件规范的此次修订版取代了本规范的6.0修订版。
本规范中的信息适用于以下操作系统:Windows Server® 2008Windows Vista®Windows Server 2003Windows XPWindows 2000本规范最后列出了参考信息和相关资源。
本规范的最新版在万维网的以下地址被维护:/whdc/system/platform/firmware/PECOFF.mspxSmartTech译电子信箱:zhzhtst@法律声明Microsoft可移植可执行文件和通用目标文件格式文件规范Microsoft Corporation修订版 8.1注意:提供本规范是为了辅助开发某些用于Microsoft Windows操作系统平台上的开发工具。
但是Microsoft并不保证它在各个方面都是完整的规范,也无法保证这里的所有信息在发布之后一直都是准确的。
Microsoft保留更改本规范而不通知的权利。
在合理的和非歧视性条款和条件下,Microsoft将针对任何Microsoft认为仅在面向Microsoft Windows的被称为编译器、链接器以及汇编程序的软件开发工具中实现和遵守本规范中所需部分这种有限用途下所需要的Microsoft权利要求书(如果存在)授予您免版税许可。
Using Word Alignment to Extend Multilingual Medical TerminologiesLouise Deléger∗,Magnus Merkel†,Pierre Zweigenbaum∗‡∗Inserm U729,Paris,France†Dept of Computer and Information Science,Linköping University,Sweden‡Assistance Publique–Paris Hospitals,STIM;Inalco,CRIM,Paris,Francelouise.deleger@spim.jussieu.fr,mme@ida.liu.se,pz@biomath.jussieu.frAbstractMedical terminologies such as those provided in the UMLS are never exhaustive and there is a constant need to enrich them,especially in terms of multilinguality.We present a methodology to acquire new French translations of English medical terms based on word alignment in a parallel corpus—i.e.pairing of corresponding words.We automatically collected a27.7-million-word parallel,English-French corpus.Based on afirst1.3-million-word extract of this corpus,we detected3,255French translations of English MeSH terms,among which1,956are new translations.1.IntroductionThe UMLS Metathesaurus is an extensive vocabulary database that gathers and provides a link between differ-ent existing biomedical terminologies.But despite being a multilingual resource,it is mostly composed of English vocabulary,and other languages such as French are under-represented in comparison to English.There is therefore a need to enrich the terminologies of the UMLS.The acqui-sition of new translations of English terms is required.This is the purpose of the VUMeF1project which aims at ex-tending the French part in the UMLS and which provides the background for this work.Plenty of multilingual texts can be found as regards a spe-cific domain but exhaustive terminologies and dictionaries are far less numerous—as can be seen in the case of the UMLS.Hence the idea of using parallel corpora(collec-tions of multilingual texts)to enlarge terminologies.So in-stead of employing a human translator,we can make use of existing translated texts from which translations at the term level can be extracted.We present a methodology to acquire medical terms based on word alignment in a parallel corpus.Word alignment is a natural language processing technique and is used in sev-eral applications such as terminology development(which is the case here),automatic translation and cross-lingual in-formation retrieval.It consists in pairing words that are translations of each other in a parallel corpus.Previous work has addressed the issue of multilingual med-ical terminologies.Chiao and Zweigenbaum(2002)collect translations from comparable corpora.Baud et al.(1998) make use of already parallel medical vocabularies to derive word translations.Widdows et al.(2002)use a statistical vector model on a corpus aligned at the document level. Névéol and Ozdowska(2005)have an approach similar to ours in that it deals with word alignment in parallel medi-cal corpora to extract French translations of English terms. However,we deal with a larger corpus and process all kinds of alignments.1.Project VUMeF(French Unified Medical V ocabulary),led by Stéfan Darmoni(Darmoni et al.,2003),is partially funded by the French Ministry of Research(National Network of Health Technologies).Our task involves issues such as dealing with errors in the alignment process that will spread from step to step,and detecting multi-word units—a term being either a single word or a multi-word expression.This work is outlined in the following way:based on a French-English corpus(2.1),we align sentences(2.2)and words(2.3,2.4).Medical terms are then selected(2.5) through the projection of a list of English terms from the MeSH.We obtain a list of bilingual English-French medi-cal terms that we review.We extract samples for evaluation purposes(2.6)and expose results(3.).We discuss(4.)and conclude(5.)on the method.2.Material and Method2.1.CorpusThe corpus used for this experiment is collected from the web.The web is indeed a powerful resource for building corpora,both in terms of quantity and multilinguality.The quality of such a corpus can nevertheless be questioned and this might account for a proportion of the noise detected in the results.Our corpus is gathered from a bilingual(French-English) Canadian health web site2.It is intended for the general public as well as for specialists and the proportion of spe-cialized terms might therefore be lower than in resources dedicated to medical specialists.Several techniques exist for building a parallel corpus from the web(Resnik and Smith,2003;Patry and Langlais, 2005).We generate pairs of parallel documents(i.e.,doc-uments that are translations of each other)using informa-tion contained in the document structure—namely,HTML links to corresponding documents in the other language.In-deed,after a study of the documents,we noticed that each document provided access to its translation page through an image or a text tag labelled in the corresponding lan-guage(specified in the“alt”attribute of the HTML tag). This gives us11,041pairs of parallel documents and a total of27.7million words.Documents from the web usually come in either HTML or PDF format,and need to be converted to text format.As2.http://www.hc-sc.gc.ca/9In Pierre Zweigenbaum,Stefan Schulz,and Patrick Ruch,editors,LREC2006Workshop on Acquiring and Representing Multilingual,Specialized Lexicons:the Case of Biomedicine.Genova,Italy,2006.ELDA.for us,we have HTML documents and thus have access to structural information that may be useful to keep even in a textfile.After cleaning the HTMLfiles and converting them to XML format,we use a XSLT stylesheet to trans-form them to text format while keeping a number of infor-mation—title,paragraph and link tags which will be used as correspondence points for sentence alignment.Indeed, we assume that a title in a source language corresponds to a title in a target language,a link to a link,and in most cases a paragraph to a paragraph.The resulting texts are segmented into sentences to prepare the way for further processing. 2.2.Sentence AlignmentThefirst step towards word alignment is to align the corpus at sentence-level.Sentence alignment is a mandatory task since there is not a full one-to-one correspondence between the sentences of two parallel texts.Although it is most com-mon that one sentence in a source language corresponds to one sentence in a target language,there are instances where one sentence is translated with two—or sometimes even three or more—sentences,and this needs to be determined before working at the word level.To do so,we use Dan Melamed’s GMA3(Geometric Map-ping and Alignment)(Melamed,2000),a robust tool which performs sentence alignment of parallel texts using both statistical and linguistic techniques.It is based,among other things,on length measurements,bilingual lexicons and cognates(words sharing similar spelling and meaning). Though sentence aligners in general and GMA in particular achieve high-quality performances,any mistake at this level will be reflected at the next one—i.e.,word alignment—and will make things even harder for this already complex process.So,in order to work on cleaner data,we attempt to automatically detect and remove incorrect sentence align-ments as well as bad document pairing(documents that are not parallel)using criteria such as sentence length and qual-ity evaluation of sentence alignment.2.3.Word AlignmentOnce sentences are aligned,we can proceed to word align-ment.This task is far more problematic than sentence align-ment.There is no true word-to-word correspondence be-tween the words of two sentences.A word is often trans-lated with several ones,or can be omitted in the corre-sponding sentence(this is typically the case for grammat-ical words that are specific to a language).Parallel sen-tences,though being translations of each other,can differ considerably in terms of structure.In that case even a hu-man has trouble determining which words should be paired together.The results we expect are therefore on a lower level than from the previous sentence alignment task.The issue of the type of word alignment should be raised. That is,are we satisfied with a word-to-word alignment? The objective of this work is to obtain medical terms.A term can be either a single word or a multi-word unit.A common approach is tofirst extract candidate terms using a separate tool—a term extractor—and then to proceed to their alignment(Daille et al.,1994;Gaussier,1998).The3./GMA/originality of this work is that we do not separate the de-tection of candidate terms from the alignment process.In other words we use a tool that is able to detect multi-word units and to align both single words and multi-word expres-sions.Word alignment systems usually derive from either statis-tical approaches or linguistic ones,or a combination of both.Statistical methods(Brown et al.,1993)involve co-occurrence measures and probability scores,and are especially effective on large corpora with high-frequency words but performances decrease with low-frequency oc-currences.Linguistic ones(Wu,2000)make use of infor-mation such as syntactic parsing.They are less robust de-spite being able to deal with low-frequency words.Hybrid approaches(Ahrenberg et al.,2000;Barbu,2004)seem to be a good compromise.2.4.Aligning Words with the I*ToolsWe use the I*Tools suite(developed at Linköping Univer-sity,Sweden)to perform word alignment.We chose these tools partly because they are based on a hybrid approach, using both statistical and linguistic techniques.They also align multi-word units which suits our terminological pur-pose.They make use of resources such as co-occurrence measures,bilingual dictionaries,POS tagging and syntacti-cal analysis.A pre-processing step is required:the corpus is tagged and lemmatized(using Treetagger4)and syntactically an-notated(with the syntactic analyzer SYNTEX(Bourigault et al.,2005)).Thefiles are transformed into XML format encoding this information.The alignment process with the I*Tools can be divided into three steps:training,automatic alignment and review of the results;each one corresponding to a specific tool of the suite.Training and review are both done with graphical, interactive tools that are fast to work with.Training of the system is manually done using a special tool of the suite—the I*Link5interactive aligner(Merkel et al.,2003).This tool proposes word pairings to the user who accepts or rejects them.The user’s decisions are stored into the resources of the system and by learn-ing from them,the performances become increasingly ac-curate.These resources provide training data for the auto-matic word aligner.The corpus is then automatically aligned by I*Trix,the au-tomatic aligner of the suite,using the resources created with I*Link.We obtain a list of word alignments—i.e.,source words paired with target words.The system can also exploit data created during the next step(the reviewing phase).In that case,the automatic alignment is repeated after afirst run and takes into account the review made by the user. This is useful if the resultsfirst obtained are not as good as expected.Results are reviewed with the I*View tool which enables the user to confirm,reject,or simply remove an alignment. An alignment is«removed»when it is neither an error nora correct alignment,meaning it is a partial alignment(some4.http://www.ims.uni-stuttgart.de/projekte/corplex/ TreeT agger/DecisionTreeT agger.html5.http://www.ida.liu.se/∼nlplab/ILink/10LREC2006Workshop on Acquiring and Representing Multilingual,Specialized Lexiconsparts are correct).This tool also indicates for each align-ment a quality score,which enables the user to rank the alignments.The quality score used in the I*Tools is based on the mutual information formula(Stolz,1965).Mu-tual information has been used in several works,including (Church and Hanks,1990)which derives a new measure for estimating word associations and(Fabry et al.,2005)which uses mutual information for term extraction to build a ter-minology.In our case,the measure is expressed in terms of the frequency of the words as a pair,and the frequency of each source and target word of the pair independently in other pairs.This means that for a proposed word pair which occurs with a high frequency and where the source word and the target word only occur in this pair and not in any other suggested word pairs,we have good reasons to as-sume that the quality is high.On the other hand if the word pair has a low frequency and the source and target words of the pair are found in several other suggested pairs,then there is reason to be more doubtful to the suggestion.The formula is Q=f(st)/n(s)+n(t),where f(st)is the frequency of the word pair and n(s)and n(t)are the number of differ-ent word pairs in which the source and target words occur respectively.2.5.Term Selection and reviewIn practice,there is no need to review all of the results since we are only looking for medical terms.We thus retain only those likely to be of interest.We select them using an En-glish medical terminology—namely MeSH(as extracted from the2005AC version of the UMLS).We project this list of terms onto the English entries of our alignment pairs and select those present in the pairs.Only then do we re-view the alignments.These alignments constitute French candidate translations of English MeSH terms.The review can be done by a linguist engineer.Afterwards,we can determine the proportion of new translations retrieved and submit these translations to medical experts.In order to restrain manual interaction,we also tested a dif-ferent solution,that is,not reviewing the results directly after the term selection,but only after somefiltering—elimination of duplicates,verbs,translations already ex-isting in the French version of the MeSH,and alignments with a poor quality score.Indeed,we consider that MeSH terms are mainly nouns and that verbs are not needed in the selection—they introduce too much noise.Low qual-ity score alignments are also likely to be errors and might not be worth reviewing.Thisfiltering phase will reduce the amount of terms to be reviewed.2.6.Implementation and EvaluationThe methodology described above was implemented as fol-lows:1.conversion of the corpus into text format;2.sentence alignment;3.training of the automatic word aligner on a set of600sentences randomly taken from our corpus,by inter-acting manually with I*Link;4.automatic word alignment with I*Trix.If results seempoor,afirst review may also be done followed by a second run of I*Trix.5.selection of medical terms(see section2.5); Implementation1:6.review,with I*View,of the alignments for the termsselected.Implementation2:6.filtering of the results:–elimination of duplicates–elimination of verbs(the MESH entries are consid-ered to be nouns)–elimination of terms already registered in theFrench MeSH–elimination of pairs with a poor quality score(equalto0);however,there may be correct alignmentsamong the low quality score ones.In that case,asmall proportion of translations will be lost.Wehave tested the implementation both with this stepand without it.7.review of the results.Evaluation was performed at several points of the imple-mentation.First,we performed an evaluation of the quality of the alignment at step2by checking100sentences ran-domly taken from the corpus and measuring the percentage of correct alignments(precision measure).The quality of word alignment was evaluated at step4by measuring precision on two samples:sample1consists of 100word pairs randomly taken from the whole resulting pairing,and sample2of100word pairs taken from the best word alignments(alignments with a frequency higher than 1and a good quality score—equal to or higher than1). Last,step6of implementation1allowed us to have a gold standard to evaluate word alignment for the medical terms. We evaluated the performances using information retrieval evaluation techniques,namely precision-recall measures. Other teams have also used these measures for evaluating tasks aside from information retrieval—text categoriza-tion for instance(Larkey and Croft,1996).In information retrieval,precision is computed at11recall points from a list of retrieved documents.In our case,we used a ranked list of alignments instead of documents,considering that an alignment being correct is similar to a document be-ing relevant for a query.We used trec_eval6to compute these recall points and obtained a precision-recall curve. These measures are calculated on the basis of the align-ments ranked by frequency and quality score,meaning that thefirst alignments are expected to be the best ones.These recall-precision points also allowed us to measure the mean average precision.This step was also useful to determine the proportion of errors and correct alignments in thefiltered results at step 6of implementation2,thereby allowing us to experiment with the setting of a threshold for the quality score of the alignments to befiltered out.3.Results3.1.Valid Alignments(Language Engineer)We completed steps1and2on the whole corpus,thus ob-taining1.1million sentence pairs.As the corpus is huge,6./trec_eval/trec_eval.8.0.tar.gzL.Deléger,M.Merkel,P.Zweigenbaum11Sample1Sample2Set of med-ical terms Precision50%92.2%52%Errors19.6% 4.9%30.3%Partialalignments30.4% 2.9%17.7%Table1:Evaluationfigures for word alignmentwe have currently processed only part of it from step3to the end—a set of540pairs of documents(1.3million words)gathered in two correspondingfiles.From this set, we obtained91,171word alignments and selected10,392 pairs of medical terms.Among these pairs,there are2,567different source terms (a term can have several translations),so we have a mean value of4.05French translations per English term.5,403 alignments were confirmed as correct ones—by a lan-guage engineer(LD)—which gives a precision of52% (see table1)and a mean value of2.1correct translations per term.We count2,159different source terms in the con-firmed alignments,meaning that408terms only had incor-rect translations.Table1shows that evaluation results for the overall quality of word alignments(step4of2.6)are very good for the top alignments(sample2,taken from the7,366best alignments as described in2.6)and average for the whole aligned cor-pus(sample1).As for sentence alignment(step2of2.6), we achieved a precision of95%,which is excellent. Precision-recallfigures for the evaluation of the set of med-ical terms(as described in2.6,step6)with trec_eval are detailed in the table onfigure1and emphasize the previous statement that precision is excellent for thefirst alignments and decrease afterwards.To be more accurate,these re-call points were computed on a scale of10,000alignments instead of the standard of1,000documents used in infor-mation retrieval.The increasingly descending slope of the curve onfigure2shows that the ranking algorithm does push the majority of incorrect alignments towards the end of the list,with an inflection around60%recall,obtain-ing more than80%precision.The mean average precision measured is indeed82%.A proportion of the noise in word alignment can be at-tributed to errors in the sentence alignment process:17% of the incorrect alignments are due to bad sentence pairing. Other factors include errors in POS tagging,bad document pairing(in our case we observed some English-English document pairs)and low quality of the data—misspelling of words,insertion of spaces inside a word,missing spaces between words.eful Medical Translations(Medical Specialist) If we take a look at the resulting list of5,403confirmed medical term alignments,we notice306pairs that are not real translations but merely pairs of English words—i.e. the English words have not been translated.These are con-sidered correct alignments but are of no use for our pur-pose,so we simply ignore them.Among the remaining 5,097,we eliminate a number of duplicates(pairs that areTable2:Precision at11recall levels,measured with trec_eval on a scale of10,000alignmentsthe same but were not considered as such by the alignment tools due to case differences)and obtain3,607word pairs. As stated in2.5,we do not consider verbs as valid candidate translations and we eliminate them,thus lowering down the number of translations to3,255.In this set,we look at the number of different concepts(CUI in the UMLS),terms al-ready present in the French version of the MeSH and new translations(see table3).The translations include morpho-logical variants—for instance adjectives instead of prepo-sitional phrases—and synonyms.However we do not con-sider plural/singular and masculine/feminine morphologi-cal variants as new translations.A sample of145MeSH terms(seefigures in table3)was also extracted for validation purposes.79terms had new translations which were submitted to expert validation.64 have been validated.Examples of translations are given in table4.Complete set ofMeSH termsValidationsample Translation pairs3,255145 Concepts1,868138 Already regis-tered1,29966 New1,95679 New and valid64(81%) Table3:Figures for the MeSH translationsEnglish French Validbone cancer cancer des os Yesbreast milk lait maternel Yes reproduction rights droits de reproduction NoTable4:Translation examplesThe second implementation tested—i.e.,reviewing only thefiltered results—gives the followingfigures.From the10,392pairs of selected terms,we eliminate duplicates (8,699resulting pairs),verbs(7,985resulting pairs),and se-lect only the new translations(not registered in the French12LREC2006Workshop on Acquiring and Representing Multilingual,Specialized LexiconsMeSH),which gives us6,670candidate translations to be reviewed.Thanks to thefirst implementation,we can eas-ily determine the proportion of noise.Since we expect 1,956new translations,4,714should be eliminated.Among these,there are incorrect alignments(4,452)and English-English pairs(262).We can see that the precision is very low,but that was expected.Since we are only looking at new translations pairs,incorrect word alignments are bound to be considered as new.Most of the noise is therefore se-lected and we do have the same balance as in implementa-tion1.Evaluation of the alignment is here meaningless and should only be performed on the non-filtered results.But this implementation enables us to considerably reduce the amount of word pairs to be reviewed:we have to review 6,670alignments instead of10,392,that is,35.8%less. We also testedfiltering out alignments with a low quality score.We set the threshold to0:all alignments with a score equal to0are removed.With this criterion,we lower down the selection to4,493alignments,which means that we now have56.8%less to review and2,766noisy align-ments.Among the2,177removed,238were actually cor-rect.So there is a loss of information,but the proportion remains small(12%).Alignments with a score equal to0 have an89%chance of being incorrect,which can justify for their being removed.4.DiscussionOur approach presents a number of advantages as well as some drawbacks.It allows us to acquire medical terms which are actually used in French documents for certain MeSH descriptors but are not registered in the current French version of the MeSH.It does not require a human translator and makes the best of existing resources.In terms of word alignment,we are able to process noisy data quite efficiently.We do not use monolingual term extractors and we align single words and multiword expressions with a uniform approach unlike other methods which concentrate on1-1and2-2word alignments(Névéol and Ozdowska, 2005).Though being an automatic approach,it still needs human help in the process(training and validation).The success of this method is also heavily dependent on the efficiency of word alignment which is a complex task.However,the remaining processing of the rest of the medical web corpus, if done incrementally,could steadily increase the quality of word ing the techniques outlined in this pa-per to minimize the reviewing process it should be possible to rapidly include verified data in each step and include this as positive training data for each new iteration.If the cor-pus is divided into roughly25sets containing just over a million words per set,and these subcorpora were processed one by one,with a short reviewing process included,the confirmed entries of each run could be fed into the training data for the next run of the automatic alignment.We also assume that interactive training using I*Link will not be necessary for each subsequent iteration,which means that the manual time spent on each new iteration will decrease and the precision will likely increase due to new training data.The quality of the corpus is an important feature and its choice is a major issue.In our case,we used the Web as a resource and processed a whole website.A study of the documents would have been useful in order to best char-acterize the type of data acquired—which documents are intended for medical specialists,which ones are for the gen-eral public and which ones have no medical content(in-dex pages for instance).Interesting developments of this method will include the specific search for patient-oriented translations(consumer vocabulary)which are even more lacking in medical terminologies.This can be achieved, for instance,to look for candidate translations of Medline Plus vocabulary.5.ConclusionWe described a methodology to acquire new translations of English medical terms in order to enrich existing medical terminologies.We argued that a natural language process-ing technique such as word alignment is an efficient way to do so.Indeed,we were able tofind a number of new trans-lations of English MeSH terms.Moreover,it is an auto-matic process which only requires limited human interven-tion.Finally,this method raises interesting prospects such as the acquisition of patient vocabulary,and more generally its application to other parallel corpora.AcknowledgementsWe thank Stéfan Darmoni for reviewing the candidate MeSH translations,and Didier Bourigault for providing SYNTEX.6.ReferencesLars Ahrenberg,M Andersson,and Magnus Merkel.2000.A knowledge-lite approach to word alignment.In Jean Véronis,editor,Parallel Text Processing:Alignment and Use of Translation Corpora.Springer.Ana-Maria Barbu.2004.Simple linguistic methods for im-proving a word alignment algorithm.In Proceedings7th International Conference on the Statistical Analysis of Textual Data,pages88–98,Louvain-la-Neuve,Belgium. Presses universitaires de Louvain.Robert H Baud,C Lovis,AM Rassinoux,PA Michel,and Scherrer JR.1998.Automatic extraction of linguis-tic knowledge from an international classification.In C Safran B Cesnik and P Degoulet,editors,Proc9th World Congress on Medical Informatics,pages581–5. Didier Bourigault,Cécile Fabre,Cécile Fréerot,Marie-Paule Jacques,and Sylwia Ozdowska.2005.Syntex, analyseur syntaxique de corpus.In Proceedings Traite-ment automatique des langues naturelles(Traitement au-tomatique des langues naturelles),Dourdan.PF Brown,SAD Pietra,VJD Pietra,and RL Mercer. 1993.The mathematics of statistical machine transla-tion:parameter putational Linguistics, 19(2):263–311.Yun-Chuang Chiao and Pierre Zweigenbaum.2002.Look-ing for French-English translations in comparable medi-cal corpora.Journal of the American Medical Informat-ics Association,8(suppl):150–154.L.Deléger,M.Merkel,P.Zweigenbaum1314LREC2006Workshop on Acquiring and Representing Multilingual,Specialized Lexicons K Church and P Hanks.1990.Word association norms,mutual information,and putationalLinguistics,16:22–29.Béatrice Daille,Éric Gaussier,and Jean-Marie Langé.1994.Towards automatic extraction of monolingual andbilingual terminology.In Proceedings of the15th COL-ING,pages515–521,Kyoto,Japan,August.Stéfan J.Darmoni,Éric Jarrousse,Pierre Zweigenbaum,Pierre Le Beux,Fiammetta Namer,Robert Baud,MichelJoubert,Huguette Vallée,Roger A.Côté,AntoineBuemi,Didier Bourigault,Gaelle Recourcé,S.Jeanneau,and Jean-Marie Rodrigues.2003.Extending the Frenchpart of the UMLS.In Mark Musen,editor,ProceedingsAMIA Annual Fall Symposium2003,page824,Washing-ton,DC,November.AMIA.(poster).P Fabry,R Baud,P Ruch,C Despont-Gros,and C Lovis.2005.Methodology to ease the construction of a termi-nology of problems.Int J Med Inform.Éric Gaussier.1998.Flow network models for word align-ment and terminology extraction from bilingual corpora.In Christian Boitet,editor,Proceedings of the17thCOLING,Montréal,Canada,10–14August.L S Larkey and W B bining classifiersin text categorization.In Proceedings of SIGIR,pages289–297.ACM Press,New York.I.Dan Melamed.2000.Bitext maps and alignments viapattern recognition.In Jean Véronis,editor,Parallel TextProcessing:Alignment and Use of Translation Corpora.Springer.Magnus Merkel,M Petterstedt,and Lars Ahrenberg.2003.Interactive word alignment for corpus linguistics.InProceedings Corpus Linguistics.Aurélie Névéol and Sylwia Ozdowska.2005.Extractionbilingue de termes médicaux dans un corpus parallèleanglais/français.In Proceedings EGC’05.A Patry and Philippe Langlais.2005.Paradocs:un sys-tèeme d’identification automatique de documents paral-lèles.In Michèle Jardino,editor,Proceedings of TALN2005(Traitement automatique des langues naturelles),pages223–232,Dourdan,June.ATALA,LIMSI.Philip Resnik and N.A.Smith.2003.The Web as a parallelputational Linguistics,29:349–380.Spe-cial Issue on the Web as a Corpus.W Stolz.1965.A probabilistic procedure for groupingwords into nguage and Speech,8:219–235.D Widdows,B Dorrow,and ing par-allel corpora to enrich multilingual lexical resources.InProceedings LREC,pages240–244,Las Palmas,Spain,May.ELRA.Dekai Wu.2000.Bracketing and aligning words and con-stituents in parallel text using stochastic inversion trans-duction grammar.In Jean Véronis,editor,Parallel TextProcessing:Alignment and Use of Translation Corpora.Springer.。