A Resource-light Approach to Morpho-syntactic Tagging
While supervised corpus-based methods are highly accurate for different NLP tasks, including morphological tagging, they are difficult to port to other languages because they require resources that are expensive to create. As a result, many languages have no realistic prospect for morpho-syntactic annotation in the foreseeable future. The method presented in this book aims to overcome this problem by significantly limiting the necessary data and instead extrapolating the relevant information from another, related language. The approach has been tested on Catalan, Portuguese, and Russian. Although these languages are only relatively resource-poor, the same method can be in principle applied to any inflected language, as long as there is an annotated corpus of a related language available. Time needed for adjusting the system to a new language constitutes a fraction of the time needed for systems with extensive, manually created resources: days instead of years.
This book touches upon a number of topics: typology, morphology, corpus linguistics, contrastive linguistics, linguistic annotation, computational linguistics and Natural Language Processing (NLP). Researchers and students who are interested in these scientific areas as well as in cross-lingual studies and applications will greatly benefit from this work. Scholars and practitioners in computer science and linguistics are the prospective readers of this book.
What people are saying - Write a review
We haven't found any reviews in the usual places.
Previous resourcelight approaches to NLP
Languages corpora and tagsets
List of tables
Quantifying language properties
Resourcelight morphological analysis
Crosslanguage morphological tagging
accuracy adjectives algorithm ambiguity tag/w analysis animacy approach Association for Computational automatically bilingual Catalan chapter classiﬁers CLiC-TALP clitics cognates combination Computational Linguistics Computational Linguistics ACL context Czech and Russian Czech emissions Czech tagset declension English epenthesis Evaluation example experiments feminine ﬁltering ﬁrst ﬁve forms Full tag fusional language gender genitive grammar Guesser Hajiˇc Hladká Indeﬁnite inﬁnitive inﬂected languages learning lemma lexical lexicon manually Markov model masculine morphemes morpho-syntactic morphological analyzer n-gram Named Entity Recognition Natural Language Processing negation neuter noun paradigms parallel corpora participle plural Portuguese POS tagging positional tagset Possessor’s Preposition Proceedings reduced tagset reﬂexive Romance languages Russian Russian tagset similar singular Slavic languages slots source language Spanish speciﬁc stem SubPOS subtag sufﬁx syntactic Table tagger tagging Russian tagset target language tense training corpus training data transitions translation Treebank Unsupervised Unsupervised Learning variant verb vocative wildcards word order Yarowsky