c ○ World Scientific Publishing Company CHINESE WORD SEARCHING IN IMAGED DOCUMENTS

格式：pdf
大小：527.28 KB
文档页数：18

下载文档原格式

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

InternationalJournalofPatternRecognitionandArtiﬁcialIntelligenceVol.18,No.2(2004)229–246c󰀁WorldScientiﬁcPublishingCompany

CHINESEWORDSEARCHINGINIMAGEDDOCUMENTS

YUELU∗andCHEWLIMTAN†

DepartmentofComputerScience,SchoolofComputing,NationalUniversityofSingapore,KentRidge,Singapore117543∗luy@comp.nus.edu.sg†tancl@comp.nus.edu.sg

Anapproachtosearchingforuser-speciﬁedwordsinimagedChinesedocuments,with-outtherequirementsoflayoutanalysisandOCRprocessingoftheentiredocuments,isproposedinthispaper.AsmallnumberofChinesecharactersthatcannotbesuccess-fullyboundedusingconnectedcomponentanalysisduetolargergapsbetweenelementswithinthecharactersareblacklisted.Asuitablecharacterthatisnotincludedintheblacklistischosenfromtheuser-speciﬁedwordastheinitialcharactertosearchforamatchingcandidateinthedocument.Onceamatchedcandidateisfound,theadja-centcharactersinthehorizontalandverticaldirectionsareexaminedformatchingwithothercorrespondingcharactersintheuser-speciﬁedword,subjecttotheconstraintsofalignment(eitherhorizontalorverticaldirection)andsizesimilarity.AweightedHausdorﬀdistanceisproposedforthecharactermatching.Experimentalresultsshowthatthepresentmethodcaneﬀectivelysearchtheuser-speciﬁedChinesewordsfromthedocumentimageswiththeformatofeitherhorizontalorverticaltextlines,orbothappearingonthesameimage.

Keywords:Chinesedocumentimage;wordsearching;charactersegmentation;charactermatching;weightedHausdorﬀdistance.

1.Introduction

Withtheadventandrapiddevelopmentofcomputertechnology,digitaldocuments

havebecomepopularforstorageandtransmissioninsteadofthetraditionalpaper

documents.Themostwidespreadformatofthesedigitaldocumentsistextinwhich

thecharactersofthedocumentsarerepresentedbymachine-readablecodes(forex-

ample,ASCIIcodes).Undoubtedly,thetextformatfacilitatesnotonlythestorage

andtransmissionofdocuments,butalsodocumentinformationretrieval.There

hasbeenactiveresearchonWebcontentextractionusingtext-basedtechniques,on

whichthetextretrievalcommunityhasmadesigniﬁcantprogress.20,21,28

Althoughmostofthenewlygenerateddocumentsareintextformat,billionsof

volumesoftraditionalnewspapers,books,magazines,etc.areinthepaperformat,

andneedtobetransferredtothedigitaldomain.Ingeneral,thesepaperdocu-

mentsareeasilyscannedandconvertedtodigitalimagesbythedigitizationequip-

ment.ThetechnologyofOpticalCharacterRecognition(OCR)maybeutilizedto

229230Y.Lu&C.L.Tan

automaticallytransferthedigitalimagesofthesepaperdocumentstotheirmachine-

readabletextformat.However,OCRhasitsinherentweaknessesintherecognition

ability,especiallyfortherecognitionofChinesecharacters.Althoughmanycom-

mercialproductsofmachineprintedChinesecharacterrecognitionsystemshave

appearedonthemarketinrecentyears,5theprocessstillcarriesasigniﬁcantcost.

Anobviousreasonistheneedformanualcorrection/prooﬁngofOCRresults,which

istypicallynotcosteﬀectivefortransferringahugeamountofpaperdocumentsto

theirtextformat.

ItisgenerallyacknowledgedthattheOCRaccuracyrequirementsforin-

formationretrievalareconsiderablylowerthanformanydocumentprocessing

applications.23Motivatedbythisobservation,themethodwiththeabilityoftol-

eratingrecognitionerrorsofOCRhasbeenresearchedrecently.10,19Additionally,

somemethodsarereportedtoimprovetheretrievalperformancebyusingtheOCR

candidates.12,13IntheseOCR-basedmethods,thelayoutanalysisandcharacter

segmentationareunavoidable.However,theresearchonlayoutanalysisandunder-

standingisstillimmature,especiallyfordocumentswithcomplexlayout.

Nowadays,manydocumentsarestoredintheimageformatduetovariousrea-

sons.Itisofsigniﬁcantmeaningtostudythestrategiesofretrievinginformation

fromthesedocumentimagesdirectly.Animage-basedapproachtosearchingand

locatingauserspeciﬁedwordinthedocumentimageshasitspracticalvalueforre-

trievinginformationfromthedocumentrepositories,suchasdigitallibraryarchives

ortheInternet.Forexample,byusingthistechniquetheusercandetectaspeci-

ﬁedwordinthedocumentimageswithouttherequirementofOCRingtheentire

document.

InthecasewheretherearealargenumberofimagedocumentsintheInternetor

digitallibraries,theuserhastodownloadeachonetoseeitscontentsbeforeknowing

whetherthedocumentisrelevanttohisinterest.Theimage-basedwordsearching

technologyiscapableofnotifyingtheuserwhetheradocumentimagecontains

wordsofinteresttohim,andindicatingwhichdocumentsareworthdownloading,

priortotransmittingthedocumentimagesthroughtheInternet.

Theimage-basedwordsearchingmethodissomewhatdiﬀerentfromthemethod

ofOCRingthedocumentimagesfollowedbyanalysisoftheresultingtext.The

formerconcentratesonlyonthepresenceoftheuserspeciﬁedwordwhilepaying

noattentiontotheunrelatedwords.Thelattermustassigneachcharacterinthe

documenttoadeﬁniteclassintheOCRprocedurenomatterwhetherthecharacter

isrelevanttotheuser’sinterestornot.Ofcourse,comparingwiththeimage-based

approach,OCRhasanadvantageofproducingmachinereadabletextforsubsequent

use.However,ifthegoalisonlytoretrieveahandfulofdocumentsfromahuge

collectionofdocumentimages,itmakessensenottoOCRtheentirecollection

whichwillneverbeusedagain.

Inrecentyears,anumberofattemptshavealsobeenmadetoavoidtheuseof

OCRforvarioustextprocessingapplications,1,6buthavebeenmainlyonEnglish

textdocuments.Chineselanguageisquitediﬀerentfromthewesternlanguages.

平洋要诀

页数:8

c ○ World Scientific Publishing Company CHINESE WORD SEARCHING IN IMAGED DOCUMENTS

合集下载

文档推荐

最新文档