Computational Methods for Corpus Annotation and AnalysisIn the past few decades the use of increasingly large text corpora has grown rapidly in language and linguistics research. This was enabled by remarkable strides in natural language processing (NLP) technology, technology that enables computers to automatically and efficiently process, annotate and analyze large amounts of spoken and written text in linguistically and/or pragmatically meaningful ways. It has become more desirable than ever before for language and linguistics researchers who use corpora in their research to gain an adequate understanding of the relevant NLP technology to take full advantage of its capabilities. This volume provides language and linguistics researchers with an accessible introduction to the state-of-the-art NLP technology that facilitates automatic annotation and analysis of large text corpora at both shallow and deep linguistic levels. The book covers a wide range of computational tools for lexical, syntactic, semantic, pragmatic and discourse analysis, together with detailed instructions on how to obtain, install and use each tool in different operating systems and platforms. The book illustrates how NLP technology has been applied in recent corpus-based language studies and suggests effective ways to better integrate such technology in future corpus linguistics research. This book provides language and linguistics researchers with a valuable reference for corpus annotation and analysis. |
Other editions - View all
Common terms and phrases
annotation and analysis AntMover Association for Computational automatically awk print $1 bigram change your current character CLAN clause Coh-Metrix command line interface concordancing contain corpora corpus annotation corpus linguistics corpus/programs corpus/temp current working directory discussed egrep encoding English filename following command format frequency lists grammar graphic user interface illustrated immediate head indices input file input text language sample learners lemma linguistics research Mac OS X measures Morpha myfile.txt mylist.txt mypoem.txt n-gram noun option output files Parser part-of-speech tagging pattern Penn Treebank phrase structure tree POS categories POS tagging POS-tagged programs propositions punctuation marks regular expressions save the output script second language Sect semantic fields sentence specific speech.tt Springer Science+Business Media Stanford Parser Stanford POS Tagger string syntactic complexity T-units Table text files tokens total number Treebank TreeTagger Tregex types uniq unique trees VP node window