Toponym Resolution in Text: Annotation, Evaluation and Applications of Spatial Grounding of Place Names

Front Cover
Universal-Publishers, 2008 - Computers - 292 pages
The problem of automatic toponym resolution, or computing the mapping from occurrences of names for places as found in a text to an unambiguous spatial footprint of the location referred to, such as a geographic latitude/longitude centroid is difficult to automate due to insufficient and error-prone geographic databases, and a large degree of place name ambiguity: common words need to be distinguished from proper names (geo/non-geo ambiguity), and the mapping between names and locations is ambiguous (London can refer to the capital of the UK or to London, Ontario, Canada, or to about forty other Londons on earth). This thesis investigates how referentially ambiguous spatial named entities can be grounded, or resolved, with respect to an extensional coordinate model robustly on open-domain news text by collecting a repertoire of linguistic heuristics and extra-linguistic knowledge sources such as population. I then investigate how to combine these sources of evidence to obtain a superior method. Noise effects introduced by the named entity tagging that toponym resolution relies on are also studied. While few attempts have been made to solve toponym resolution, these were either not evaluated, or evaluation was done by manual inspection of system output instead of creating a re-usable reference corpus. A systematic comparison leads to an inventory of heuristics and other sources of evidence. In order to carry out a comparative evaluation procedure, an evaluation resource is required, so a reference gazetteer and an associated novel reference corpus with human-labelled referent annotation were created for this thesis, to be used to benchmark a selection of the reconstructed algorithms and a novel re-combination of the heuristics catalogued in the inventory. Performance of the same resolution algorithms is compared under different conditions, namely applying it to the output of human named entity annotation and automatic annotation using an existing Maximum Entropy sequence tagging model.
 

What people are saying - Write a review

We haven't found any reviews in the usual places.

Contents

47 Chapter Summary
143
Methods
145
53 A New Algorithm Based on Two Minimality Heuristics
146
54 Machine Learning Methods
153
542 Decision Tree Induction DTI
154
Learning Voting Weights
156
Design and Implementation
157
552 Design
159

Background
41
22 Geographic Information Systems GIS and Spatial Databases
42
23 Gazetteers
51
24 Textual Information Access and Natural Language Processing
52
242 Information Retrieval
53
243 Information Extraction
55
244 Question Answering
62
245 Word Sense Disambiguation
65
25 The Language of Geographic Space
71
26 Chapter Summary
75
Previous and Related Work
77
32 Previous Work in Toponym Resolution
82
TR for Speech Data
83
Smith and Crane 2001 Centroidbased TR
86
A Hybrid Approach to TR
87
Confidencebased TR
92
Multilingual TR and Mapping
94
WebaWhere
96
Cost Optimisation for German TR
98
33 Comparative Analysis
102
34 Chapter Summary
110
Dataset
117
42 Corpus Sampling
118
Global News from REUTERS
119
FBIS Central American Intelligence Reports
121
432 Problems of Gazetteer Selection
123
433 Gazetteer Ambiguity and Heterogeneity
125
44 Gazetteer
127
45 Document Annotation
129
452 ToolChain and Markup Process
131
Corpus Profile
135
462 TRMUC4
136
464 Toponym Distribution in Documents
139
465 Referential Ambiguity in the Corpora
141
56 Chapter Summary
160
Evaluation
163
62 Evaluation Methodology
164
622 TaskSpecific Evaluation Metrics
169
63 Component Evaluation Using a Named Entity Oracle in vitro
172
64 Component Evaluation Over System Output in vivo
177
641 Using a Maximum Entropy NERC Model
179
65 Discussion
184
66 Chapter Summary
185
Applications
187
Generating Map Surrogates for Stories
188
GeoSpatial News Browsing
192
Spatial Filtering for Document Retrieval
198
742 Evaluation in a GEOCLEF Context
204
743 Discussion
210
KnowledgeBased Approach
211
76 Chapter Summary
214
Summary and Conclusion
215
82 Future Work
217
83 Conclusions
219
Notational Conventions
221
Annotation Guidelines
223
Minimal Bounding Rectangles Extracted from NGA
225
TRCoNLL Sample Used in Prose Only Evaluation
227
TRCoNLL Evaluation All Documents Used
229
Performance Plots for Individual Heuristics
233
Stories Used in the Visualization Study
239
G2 Story News Crews Wait and Watch as Police Search Home of Missing Woman
240
Distance Queries
245
ADL Feature Type Thesaurus
253
Zusammenfassung in deutscher Sprache
261
Bibliography
265
Copyright

Common terms and phrases

Popular passages

Page 281 - School of Ocean and Earth Science and Technology University of Hawaii at Manoa Mr.
Page 274 - Paul A. Longley, Michael F. Goodchild, David J. Maguire, and David W. Rhind, Geographic Information Systems and Science (West Sussex, England: John Wiley & Sons, August 2001).
Page 274 - Infoxtract location normalization: a hybrid approach to geographic references in information extraction. In Proc.
Page 280 - Complementary video and audio analysis for broadcast news archives. Communications of the. ACM, 43(2):42-47, February 2000.
Page 280 - The TREC-8 question answering track evaluation. In EM Voorhees and DK Harman, editors, The Eighth Text REtrieval Conference (TREC 8), NIST Special Publication 500-246, pages 1-24.

Bibliographic information