Automatic Conceptual Indexing of French Pharmaceutical

格式：pdf
大小：53.52 KB
文档页数：5

下载文档原格式

/ 5

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

AutomaticConceptualIndexingofFrench

Pharmaceuticaltheses

VincentMARY†,BrunoPOULIQUEN‡,FranckLEDUFF†

StefanJ.DARMONI󰀂,AlainSEGUI∝,PierreLEBEUX†

†Laboratoired’informatiquem´edicale,Facult´edeM´edecine-Rennes,France

‡EuropeanCommission,JointResearchCentre,IPSCIspra,Italie

󰀂Directiondel’InformatiqueetdesR´eseaux,CHU-Rouen,France

∝Laboratoiredemath´ematiquesetdephysiquepharmaceutique

Facult´edePharmacie-Rennes,France

Abstract.Frenchpharmaceuticalthesesarerarelyquoted.Ifthemainobstaclesorig-inatefromlanguageoraccessbarriers,properindexationcouldalsobeblamed.Man-uallyextractedkeywordsdon’tnecessarycomefromastructuredthesaurus.Inthefollowingwork,thismanualindexingmethodiscomparedtoanautomatedone,”No-mindex”,basedonUMLS.Theautomatedmethodisimprovedbytheadditionofarelevancescoringsystem.Theﬁrstindexingstepconsistsofdownloading,adaptingandindexingthesesinelectronicformat.Resultswillthenbeanalyzedandsortedbyrelevance,bycomparingclassicstatisticalindices(noise/silence/relevance).Itwasassumedthatthemanuallyobtainedkeywordswerealwaysrelevant.Thesilenceofmanualindexingisneverthelesshigh:sevennewkeywordsareproposedbyNomin-dex,whoseresultsaremixed(10%ofsilence,but50%ofnoise).Thesesresultsarepromisingfortheﬁrstexperiment:pharmaceuticaldocumentwithoutlexiconim-provement.Theindexing,ifitiscurrentlyinsufﬁcientforareallifeuse,couldeasilybeimprovedbyspeciﬁcupdatesofthelexicon.

1Introduction

OverathousandpharmaceuticalthesesarepublishedeveryyearinFrance,whichrepresents

averylargeamountofacademicwork.Thesepublicationscanbebasedonbibliography,

laboratoryorresearchwork,andarevalidatedbyapprovedprofessors.Itcouldthereforebe

referredtoonthesamegroundasapublicationintheclassicsense.Pharmacyisinvolved

inmanyaspectsofmodernmedicine.However,pharmaceuticalthesesarerarelyreferredto

inFrenchorinternationalpublications.Thethesescatalogueisupdatedevery6month,and

thesesareonlyexceptionallyavailableonline.Thisaddstotheproblemofthelanguage.

Anewautomatedindexingmethod,”Nomindex”[1],isproposedhere,whichcombinesa

conceptualautomaticindexingmethod,viatheUMLSmetathesaurus[2][3],andasimpliﬁed

searchthroughthemetathesaurussemanticnetworkandthecomputationoftherelevance

scoreTFIDF[4].Theindexingisalsomadeavailableonlinewithawebbrowser.

2Objective

Themainobjectiveofthisworkistocomparethemanualindexingsystemtoitsautomated

counterpart,onasampleofthesesdownloadedfromtheinternet.Theﬁrststepconsistsin2VincentMARYandal.

downloading,adaptingandindexingthesesinelectronicformat.Resultswillthenbeanalyzed

andsortedbyrelevance,byusingclassicstatisticalmeasures(noise/silence/relevance).

3StateoftheArt

3.1Docth`ese/Sudoc

Docthesewas,uptorecently,thesystemholdingthecatalogueofFrenchtheses.Itwasup-

datedbyCDROMevery6monthsandwasavailableatuniversitylibraries.Sudoc1(managedbyABES2)wasdevelopedinparallelandaimedtoreplaceDocthese.Availableonline,SU-

DOCoffersthesesdetailssuchasauthor’sname,title,director’snameand,usually,fouror

ﬁverelevantkeywords.ThesekeywordsarenotnecessarilyconsistentwiththeMeSH[5],

orwithanotherthesaurus.Thisindexingisperformedbytheauthorandthedirector,andis

validatedbyalibrarian.SUDOCisupdatedaccordingtothepublicationofnewtheses.

3.2IndexingEngine

Thepurposeoftheindexingengineistorecognizeconceptsfromatextandtousethemto

createadatabaseallowingtosearchdocuments.

TheengineusealocallexiconextractedfromtheADM[6]knowledgebase.Keywordsare

groupedbyconceptsconsideringassociatedwords,coumpoundwords,preﬁxesandsufﬁxes.

EachconceptcorrespondstoanotionfromtheFrenchMeSH.Weidentifyeachconceptusing

theUMLSCUI(ConceptUniqueIdentiﬁer).

Arelevancescore,TFIDF[7],isattachedtoeachconceptofoundintoadocument.This

scoreweightstheconceptsaccordingtotheirfrequencyinthedocument,relativelytotheir

frequencyinthewholecorpus.Afteraﬁrstsearch,itispossibletolookforthebroaderand

narrowconcepts,ortolookforthesimilardocuments(i.e.thedocumentssharingthesame

concepts).

4MaterialandMethod

4.1Downloadingthetheses,convertingtoHTML,indexing.

Pharmaceuticalthesesareavailableonlyintheirhardcopyforms,makingitdifﬁculttocon-

vertbacktoanelectronicformat.Ourcatalogueofthesesisthereforecomposedoftheses

downloadedfrompersonalsitesorwiththehelpofprofessorsofpharmaceuticaluniversities.

Theyareusuallyintwodifferentformats:MSWorddocumentorPDF.

TheindexingengineonlyworkswithHTMLformateddocuments.Word97allowsadirect

conversionbetweenworddocumentsandHTML.TwomodulesareofferedbyAdobetodeal

withthePDFdocuments:makeaccessiblepluginandsaveasXMLplugin.Thetheseswere

savedinHTML3.2format,withoutCSS(CascadingStyleSheets).Thisformatretainsthe

H1andH2tags,usedbytheenginetoweighttheconcepts.Thethesesarefedtotheengine,

whichextractstheconcepts.Therelevancescoreiscomputedonceallthetheseshavebeen

1Syst`emeUniversitairedeDocumentation2AgenceBibliographiquedel’EnseignementSup´erieur:http://corail.abes.sudoc.fr

Automatic Conceptual Indexing of French Pharmaceutical

合集下载

文档推荐

最新文档