Automatic Conceptual Indexing of French Pharmaceutical
- 格式:pdf
- 大小:53.52 KB
- 文档页数:5
AutomaticConceptualIndexingofFrench
Pharmaceuticaltheses
VincentMARY†,BrunoPOULIQUEN‡,FranckLEDUFF†
StefanJ.DARMONI,AlainSEGUI∝,PierreLEBEUX†
†Laboratoired’informatiquem´edicale,Facult´edeM´edecine-Rennes,France
‡EuropeanCommission,JointResearchCentre,IPSCIspra,Italie
Directiondel’InformatiqueetdesR´eseaux,CHU-Rouen,France
∝Laboratoiredemath´ematiquesetdephysiquepharmaceutique
Facult´edePharmacie-Rennes,France
Abstract.Frenchpharmaceuticalthesesarerarelyquoted.Ifthemainobstaclesorig-inatefromlanguageoraccessbarriers,properindexationcouldalsobeblamed.Man-uallyextractedkeywordsdon’tnecessarycomefromastructuredthesaurus.Inthefollowingwork,thismanualindexingmethodiscomparedtoanautomatedone,”No-mindex”,basedonUMLS.Theautomatedmethodisimprovedbytheadditionofarelevancescoringsystem.Thefirstindexingstepconsistsofdownloading,adaptingandindexingthesesinelectronicformat.Resultswillthenbeanalyzedandsortedbyrelevance,bycomparingclassicstatisticalindices(noise/silence/relevance).Itwasassumedthatthemanuallyobtainedkeywordswerealwaysrelevant.Thesilenceofmanualindexingisneverthelesshigh:sevennewkeywordsareproposedbyNomin-dex,whoseresultsaremixed(10%ofsilence,but50%ofnoise).Thesesresultsarepromisingforthefirstexperiment:pharmaceuticaldocumentwithoutlexiconim-provement.Theindexing,ifitiscurrentlyinsufficientforareallifeuse,couldeasilybeimprovedbyspecificupdatesofthelexicon.
1Introduction
OverathousandpharmaceuticalthesesarepublishedeveryyearinFrance,whichrepresents
averylargeamountofacademicwork.Thesepublicationscanbebasedonbibliography,
laboratoryorresearchwork,andarevalidatedbyapprovedprofessors.Itcouldthereforebe
referredtoonthesamegroundasapublicationintheclassicsense.Pharmacyisinvolved
inmanyaspectsofmodernmedicine.However,pharmaceuticalthesesarerarelyreferredto
inFrenchorinternationalpublications.Thethesescatalogueisupdatedevery6month,and
thesesareonlyexceptionallyavailableonline.Thisaddstotheproblemofthelanguage.
Anewautomatedindexingmethod,”Nomindex”[1],isproposedhere,whichcombinesa
conceptualautomaticindexingmethod,viatheUMLSmetathesaurus[2][3],andasimplified
searchthroughthemetathesaurussemanticnetworkandthecomputationoftherelevance
scoreTFIDF[4].Theindexingisalsomadeavailableonlinewithawebbrowser.
2Objective
Themainobjectiveofthisworkistocomparethemanualindexingsystemtoitsautomated
counterpart,onasampleofthesesdownloadedfromtheinternet.Thefirststepconsistsin2VincentMARYandal.
downloading,adaptingandindexingthesesinelectronicformat.Resultswillthenbeanalyzed
andsortedbyrelevance,byusingclassicstatisticalmeasures(noise/silence/relevance).
3StateoftheArt
3.1Docth`ese/Sudoc
Docthesewas,uptorecently,thesystemholdingthecatalogueofFrenchtheses.Itwasup-
datedbyCDROMevery6monthsandwasavailableatuniversitylibraries.Sudoc1(managedbyABES2)wasdevelopedinparallelandaimedtoreplaceDocthese.Availableonline,SU-
DOCoffersthesesdetailssuchasauthor’sname,title,director’snameand,usually,fouror
fiverelevantkeywords.ThesekeywordsarenotnecessarilyconsistentwiththeMeSH[5],
orwithanotherthesaurus.Thisindexingisperformedbytheauthorandthedirector,andis
validatedbyalibrarian.SUDOCisupdatedaccordingtothepublicationofnewtheses.
3.2IndexingEngine
Thepurposeoftheindexingengineistorecognizeconceptsfromatextandtousethemto
createadatabaseallowingtosearchdocuments.
TheengineusealocallexiconextractedfromtheADM[6]knowledgebase.Keywordsare
groupedbyconceptsconsideringassociatedwords,coumpoundwords,prefixesandsuffixes.
EachconceptcorrespondstoanotionfromtheFrenchMeSH.Weidentifyeachconceptusing
theUMLSCUI(ConceptUniqueIdentifier).
Arelevancescore,TFIDF[7],isattachedtoeachconceptofoundintoadocument.This
scoreweightstheconceptsaccordingtotheirfrequencyinthedocument,relativelytotheir
frequencyinthewholecorpus.Afterafirstsearch,itispossibletolookforthebroaderand
narrowconcepts,ortolookforthesimilardocuments(i.e.thedocumentssharingthesame
concepts).
4MaterialandMethod
4.1Downloadingthetheses,convertingtoHTML,indexing.
Pharmaceuticalthesesareavailableonlyintheirhardcopyforms,makingitdifficulttocon-
vertbacktoanelectronicformat.Ourcatalogueofthesesisthereforecomposedoftheses
downloadedfrompersonalsitesorwiththehelpofprofessorsofpharmaceuticaluniversities.
Theyareusuallyintwodifferentformats:MSWorddocumentorPDF.
TheindexingengineonlyworkswithHTMLformateddocuments.Word97allowsadirect
conversionbetweenworddocumentsandHTML.TwomodulesareofferedbyAdobetodeal
withthePDFdocuments:makeaccessiblepluginandsaveasXMLplugin.Thetheseswere
savedinHTML3.2format,withoutCSS(CascadingStyleSheets).Thisformatretainsthe
H1andH2tags,usedbytheenginetoweighttheconcepts.Thethesesarefedtotheengine,
whichextractstheconcepts.Therelevancescoreiscomputedonceallthetheseshavebeen
1Syst`emeUniversitairedeDocumentation2AgenceBibliographiquedel’EnseignementSup´erieur:http://corail.abes.sudoc.fr