c ○ World Scientific Publishing Company CHINESE WORD SEARCHING IN IMAGED DOCUMENTS
- 格式:pdf
- 大小:527.28 KB
- 文档页数:18
InternationalJournalofPatternRecognitionandArtificialIntelligenceVol.18,No.2(2004)229–246cWorldScientificPublishingCompany
CHINESEWORDSEARCHINGINIMAGEDDOCUMENTS
YUELU∗andCHEWLIMTAN†
DepartmentofComputerScience,SchoolofComputing,NationalUniversityofSingapore,KentRidge,Singapore117543∗luy@comp.nus.edu.sg†tancl@comp.nus.edu.sg
Anapproachtosearchingforuser-specifiedwordsinimagedChinesedocuments,with-outtherequirementsoflayoutanalysisandOCRprocessingoftheentiredocuments,isproposedinthispaper.AsmallnumberofChinesecharactersthatcannotbesuccess-fullyboundedusingconnectedcomponentanalysisduetolargergapsbetweenelementswithinthecharactersareblacklisted.Asuitablecharacterthatisnotincludedintheblacklistischosenfromtheuser-specifiedwordastheinitialcharactertosearchforamatchingcandidateinthedocument.Onceamatchedcandidateisfound,theadja-centcharactersinthehorizontalandverticaldirectionsareexaminedformatchingwithothercorrespondingcharactersintheuser-specifiedword,subjecttotheconstraintsofalignment(eitherhorizontalorverticaldirection)andsizesimilarity.AweightedHausdorffdistanceisproposedforthecharactermatching.Experimentalresultsshowthatthepresentmethodcaneffectivelysearchtheuser-specifiedChinesewordsfromthedocumentimageswiththeformatofeitherhorizontalorverticaltextlines,orbothappearingonthesameimage.
Keywords:Chinesedocumentimage;wordsearching;charactersegmentation;charactermatching;weightedHausdorffdistance.
1.Introduction
Withtheadventandrapiddevelopmentofcomputertechnology,digitaldocuments
havebecomepopularforstorageandtransmissioninsteadofthetraditionalpaper
documents.Themostwidespreadformatofthesedigitaldocumentsistextinwhich
thecharactersofthedocumentsarerepresentedbymachine-readablecodes(forex-
ample,ASCIIcodes).Undoubtedly,thetextformatfacilitatesnotonlythestorage
andtransmissionofdocuments,butalsodocumentinformationretrieval.There
hasbeenactiveresearchonWebcontentextractionusingtext-basedtechniques,on
whichthetextretrievalcommunityhasmadesignificantprogress.20,21,28
Althoughmostofthenewlygenerateddocumentsareintextformat,billionsof
volumesoftraditionalnewspapers,books,magazines,etc.areinthepaperformat,
andneedtobetransferredtothedigitaldomain.Ingeneral,thesepaperdocu-
mentsareeasilyscannedandconvertedtodigitalimagesbythedigitizationequip-
ment.ThetechnologyofOpticalCharacterRecognition(OCR)maybeutilizedto
229230Y.Lu&C.L.Tan
automaticallytransferthedigitalimagesofthesepaperdocumentstotheirmachine-
readabletextformat.However,OCRhasitsinherentweaknessesintherecognition
ability,especiallyfortherecognitionofChinesecharacters.Althoughmanycom-
mercialproductsofmachineprintedChinesecharacterrecognitionsystemshave
appearedonthemarketinrecentyears,5theprocessstillcarriesasignificantcost.
Anobviousreasonistheneedformanualcorrection/proofingofOCRresults,which
istypicallynotcosteffectivefortransferringahugeamountofpaperdocumentsto
theirtextformat.
ItisgenerallyacknowledgedthattheOCRaccuracyrequirementsforin-
formationretrievalareconsiderablylowerthanformanydocumentprocessing
applications.23Motivatedbythisobservation,themethodwiththeabilityoftol-
eratingrecognitionerrorsofOCRhasbeenresearchedrecently.10,19Additionally,
somemethodsarereportedtoimprovetheretrievalperformancebyusingtheOCR
candidates.12,13IntheseOCR-basedmethods,thelayoutanalysisandcharacter
segmentationareunavoidable.However,theresearchonlayoutanalysisandunder-
standingisstillimmature,especiallyfordocumentswithcomplexlayout.
Nowadays,manydocumentsarestoredintheimageformatduetovariousrea-
sons.Itisofsignificantmeaningtostudythestrategiesofretrievinginformation
fromthesedocumentimagesdirectly.Animage-basedapproachtosearchingand
locatingauserspecifiedwordinthedocumentimageshasitspracticalvalueforre-
trievinginformationfromthedocumentrepositories,suchasdigitallibraryarchives
ortheInternet.Forexample,byusingthistechniquetheusercandetectaspeci-
fiedwordinthedocumentimageswithouttherequirementofOCRingtheentire
document.
InthecasewheretherearealargenumberofimagedocumentsintheInternetor
digitallibraries,theuserhastodownloadeachonetoseeitscontentsbeforeknowing
whetherthedocumentisrelevanttohisinterest.Theimage-basedwordsearching
technologyiscapableofnotifyingtheuserwhetheradocumentimagecontains
wordsofinteresttohim,andindicatingwhichdocumentsareworthdownloading,
priortotransmittingthedocumentimagesthroughtheInternet.
Theimage-basedwordsearchingmethodissomewhatdifferentfromthemethod
ofOCRingthedocumentimagesfollowedbyanalysisoftheresultingtext.The
formerconcentratesonlyonthepresenceoftheuserspecifiedwordwhilepaying
noattentiontotheunrelatedwords.Thelattermustassigneachcharacterinthe
documenttoadefiniteclassintheOCRprocedurenomatterwhetherthecharacter
isrelevanttotheuser’sinterestornot.Ofcourse,comparingwiththeimage-based
approach,OCRhasanadvantageofproducingmachinereadabletextforsubsequent
use.However,ifthegoalisonlytoretrieveahandfulofdocumentsfromahuge
collectionofdocumentimages,itmakessensenottoOCRtheentirecollection
whichwillneverbeusedagain.
Inrecentyears,anumberofattemptshavealsobeenmadetoavoidtheuseof
OCRforvarioustextprocessingapplications,1,6buthavebeenmainlyonEnglish
textdocuments.Chineselanguageisquitedifferentfromthewesternlanguages.