Guide to OCR for Indic Scripts: Document Recognition and Retrieval (Google eBook)

Front Cover
Venu Govindaraju, Srirangaraj (Ranga) Setlur
Springer Science & Business Media, Sep 25, 2009 - Computers - 325 pages
1 Review
Theoriginalmotivationsfordevelopingopticalcharacterrecognitiontechnologies weremodesttoconvertprintedtexton?atphysicalmediatodigitalform,prod- ingmachine-readabledigitalcontent. Bydoingthis,wordsthathadbeeninertand bound to physical material would be brought into the digital realm and thus gain newandpowerfulfunctionalitiesandanalyticalpossibilities. First-generation digital OCR researchers in the 1970s quickly realized that by limiting their ambitions primarily to contemporary documents printed in st- dard font type from the modern Roman alphabet (and of these, mostly English language materials), they were constraining the possibilities for future research andtechnologiesconsiderably. Domainresearchersalsosawthatthetrajectoryof OCR technologies if left unchanged would exclude a large portion of the human record. Digitalconversionofdocumentsandmanuscriptsinotheralphabets,scripts, and cursive styles was of critical importance. Embedded in non-Roman alp- bet source documents, including ancient manuscripts, papyri scrolls, clay tablets, and other inscribed artifacts was not only a wealth of scholarly information but alsonewopportunitiesandchallengesforadvancingOCR,imagingsciences,and othercomputationalresearchareas. Thelimitingcircumstancesatthetimeincluded the rudimentary capability (and high cost) of computational resources and lack of network-accessible digital content. Since then computational technology has advancedataveryrapidpaceandnetworkinginfrastructurehasproliferated. Over time, thisexponential decrease inthecost of computation, memory, and com- nicationsbandwidthcombinedwiththeexponentialincreaseinInternet-accessible digitalcontenthastransformededucation,scholarship,andresearch. Largenumbers ofresearchers,scholars,andstudentsuseanddependuponInternet-basedcontent andcomputationalresources. Thechaptersinthisbookdescribeacriticallyimportantareaofinvestigation– addressingconversionofIndicscriptintomachine-readableform. Roughestimates haveitthatcurrentlymorethanabillionpeopleuseIndicscripts. Collectively,Indic historic and cultural documents contain a vast richness of human knowledge and experience. The state-of-the-art research described in this book demonstrates the multiple values associated with these activities. Technically, the problems associated with Indicscriptrecognitionareverydif?cultandwillcontributetoandinformrelated v vi Foreword scriptrecognitionefforts. Theworkalsohasenormousconsequenceforenriching andenablingthestudyofIndicculturalheritagematerialsandthehistoricrecord of its people. This in turn broadens the intellectual context for domain scholars focusingonothersocieties,ancientandmodern. Digital character recognition has brought about another milestone in coll- tivecommunicationbybringinginert,?xed-in-place,textintoaninteractivedi- talrealm. Indoingso,theinformationhasgainedadditionalfunctionalitieswhich expandourabilitiestoconnect,combine,contextualize,share,andcollaboratively pursue knowledge making. High-quality Internet content continues to grow in an explosivefashion. Inthenewglobalcyberenvironment,thefunctionalitiesandapp- cationsofdigitalinformationcontinuetotransformknowledgeintonewundersta- ingsofhumanexperienceandtheworldinwhichwelive. Thepossibilitiesforthe futurearelimitedonlybyavailableresearchresourcesandcapabilitiesandtheim- inationandcreativityofthosewhousethem. Arlington,Virginia StephenM.
  

What people are saying - Write a review

User Review - Flag as inappropriate

waheguru

Contents

Building Data Sets for Indian Language OCR Research
3
Bangla and Devanagari
27
A Complete MachinePrinted Gurmukhi OCR System
43
Progress in Gujarati Document Processing and Character Recognition
73
Design of a Bilingual KannadaEnglish OCR
97
Recognition of Malayalam Documents
125
A Complete OCR System for Tamil Magazine Documents
147
Experiments on Urdu Text Recognition
163
Online Handwriting Recognition for Indic Scripts
209
Part II Retrieval of Indic Documents
235
Enhancing Access to Primary Cultural Heritage Materials of India
237
Digital Image Enhancement of Indic Historical Manuscripts
249
GFGBased Compression and Retrieval of Document Images in Indian Scripts
269
Word Spotting for Indic Documents to Facilitate Retrieval
285
Indian Language Information Retrieval
301
Colour Plates
315

The BBN Byblos Hindi OCR System
173
Generalization of Hindi OCR Using Adaptive Segmentation and Font Files
181

Common terms and phrases

About the author (2009)

Dr Venu Govindaraju is a UB Distinguished Professor of Computer Science and Engineering at the University at Buffalo (SUNY Buffalo) and the founder of the Center for Unified Biometrics and Sensors (CUBS). He has coauthored more than 300 reviewed technical papers, four U.S. patents and two books.