Toponym Resolution in Text: Annotation, Evaluation and Applications of Spatial Grounding of Place NamesThe problem of automatic toponym resolution, or computing the mapping from occurrences of names for places as found in a text to an unambiguous spatial footprint of the location referred to, such as a geographic latitude/longitude centroid is difficult to automate due to insufficient and error-prone geographic databases, and a large degree of place name ambiguity: common words need to be distinguished from proper names (geo/non-geo ambiguity), and the mapping between names and locations is ambiguous (London can refer to the capital of the UK or to London, Ontario, Canada, or to about forty other Londons on earth). This thesis investigates how referentially ambiguous spatial named entities can be grounded, or resolved, with respect to an extensional coordinate model robustly on open-domain news text by collecting a repertoire of linguistic heuristics and extra-linguistic knowledge sources such as population. I then investigate how to combine these sources of evidence to obtain a superior method. Noise effects introduced by the named entity tagging that toponym resolution relies on are also studied. While few attempts have been made to solve toponym resolution, these were either not evaluated, or evaluation was done by manual inspection of system output instead of creating a re-usable reference corpus. A systematic comparison leads to an inventory of heuristics and other sources of evidence. In order to carry out a comparative evaluation procedure, an evaluation resource is required, so a reference gazetteer and an associated novel reference corpus with human-labelled referent annotation were created for this thesis, to be used to benchmark a selection of the reconstructed algorithms and a novel re-combination of the heuristics catalogued in the inventory. Performance of the same resolution algorithms is compared under different conditions, namely applying it to the output of human named entity annotation and automatic annotation using an existing Maximum Entropy sequence tagging model. |
Co říkají ostatní - Napsat recenzi
Na obvyklých místech jsme nenalezli žádné recenze.
Obsah
47 Chapter Summary | 143 |
Methods | 145 |
53 A New Algorithm Based on Two Minimality Heuristics | 146 |
54 Machine Learning Methods | 153 |
542 Decision Tree Induction DTI | 154 |
Learning Voting Weights | 156 |
Design and Implementation | 157 |
552 Design | 159 |
41 | |
42 | |
51 | |
24 Textual Information Access and Natural Language Processing | 52 |
242 Information Retrieval | 53 |
243 Information Extraction | 55 |
244 Question Answering | 62 |
245 Word Sense Disambiguation | 65 |
25 The Language of Geographic Space | 71 |
26 Chapter Summary | 75 |
Previous and Related Work | 77 |
32 Previous Work in Toponym Resolution | 82 |
TR for Speech Data | 83 |
Smith and Crane 2001 Centroidbased TR | 86 |
A Hybrid Approach to TR | 87 |
Confidencebased TR | 92 |
Multilingual TR and Mapping | 94 |
WebaWhere | 96 |
Cost Optimisation for German TR | 98 |
33 Comparative Analysis | 102 |
34 Chapter Summary | 110 |
Dataset | 117 |
42 Corpus Sampling | 118 |
Global News from REUTERS | 119 |
FBIS Central American Intelligence Reports | 121 |
432 Problems of Gazetteer Selection | 123 |
433 Gazetteer Ambiguity and Heterogeneity | 125 |
44 Gazetteer | 127 |
45 Document Annotation | 129 |
452 ToolChain and Markup Process | 131 |
Corpus Profile | 135 |
462 TRMUC4 | 136 |
464 Toponym Distribution in Documents | 139 |
465 Referential Ambiguity in the Corpora | 141 |
56 Chapter Summary | 160 |
Evaluation | 163 |
62 Evaluation Methodology | 164 |
622 TaskSpecific Evaluation Metrics | 169 |
63 Component Evaluation Using a Named Entity Oracle in vitro | 172 |
64 Component Evaluation Over System Output in vivo | 177 |
641 Using a Maximum Entropy NERC Model | 179 |
65 Discussion | 184 |
66 Chapter Summary | 185 |
Applications | 187 |
Generating Map Surrogates for Stories | 188 |
GeoSpatial News Browsing | 192 |
Spatial Filtering for Document Retrieval | 198 |
742 Evaluation in a GEOCLEF Context | 204 |
743 Discussion | 210 |
KnowledgeBased Approach | 211 |
76 Chapter Summary | 214 |
Summary and Conclusion | 215 |
82 Future Work | 217 |
83 Conclusions | 219 |
Notational Conventions | 221 |
Annotation Guidelines | 223 |
Minimal Bounding Rectangles Extracted from NGA | 225 |
TRCoNLL Sample Used in Prose Only Evaluation | 227 |
TRCoNLL Evaluation All Documents Used | 229 |
Performance Plots for Individual Heuristics | 233 |
Stories Used in the Visualization Study | 239 |
G2 Story News Crews Wait and Watch as Police Search Home of Missing Woman | 240 |
Distance Queries | 245 |
ADL Feature Type Thesaurus | 253 |
Zusammenfassung in deutscher Sprache | 261 |
Bibliography | 265 |
Běžně se vyskytující výrazy a sousloví
Algorithm annotation applications automatic toponym resolution Berlin Bounding Rectangles Cambridge CITY candidate referents centroid Chapter classification computed confidence CoNLL contains context coordinates corpus dataset defined definition described digital libraries disambiguation distance document example F-SCORE feature type field Figure find first geo-coding geographic information geographic information retrieval Geographic Information Systems Geographic References global gold standard heuristics implemented information retrieval latitude/longitude Leidner lines London machine learning markup MaxEnt MAXPOP metrics minimality heuristic named entity recognition named entity tagging natural language processing NERC number of toponym outperforms performance PERSEUS place names polygon population Pouliquen precision query query expansion recall referent per discourse relevant reported resolved toponym RMSD score significant significantly Smith and Crane spatial specific tagger TextGISR textual thesis toponym instances toponym recognition toponym resolution toponym resolution task TR-MUC4 United States COUNTRY Word Sense Disambiguation YAROWSKY
Oblíbené pasáže
Strana 281 - School of Ocean and Earth Science and Technology University of Hawaii at Manoa Mr.
Strana 274 - Paul A. Longley, Michael F. Goodchild, David J. Maguire, and David W. Rhind, Geographic Information Systems and Science (West Sussex, England: John Wiley & Sons, August 2001).
Strana 274 - Infoxtract location normalization: a hybrid approach to geographic references in information extraction. In Proc.
Strana 280 - Complementary video and audio analysis for broadcast news archives. Communications of the. ACM, 43(2):42-47, February 2000.
Strana 280 - The TREC-8 question answering track evaluation. In EM Voorhees and DK Harman, editors, The Eighth Text REtrieval Conference (TREC 8), NIST Special Publication 500-246, pages 1-24.