What does the Ngram Viewer do?
When you enter phrases into the Google Books Ngram Viewer, it displays a graph showing how those phrases have occurred in a corpus of books (e.g., "British English", "English Fiction", "French") over the selected years. Let's look at a sample graph:
This shows trends in three ngrams from 1950 to 2000: "nursery school" (a 2-gram or bigram), "kindergarten" (a 1-gram or unigram), and "child care" (another bigram). What the y-axis shows is this: of all the bigrams contained in our sample of books written in English and published in the United States, what percentage of them are "nursery school" or "child care"? Of all the unigrams, what percentage of them are "kindergarten"? Here, you can see that use of the phrase "child care" started to rise in the late 1960s, overtaking "nursery school" around 1970 and then "kindergarten" around 1973. It peaked shortly after 1990 and has been falling steadily since.
(Interestingly, the results are noticeably different when the corpus is switched to British English.)
You can also specify particular parts of speech, or add, subtract, and divide ngrams. More on those under Advanced Usage.
Two features of the Ngram Viewer may appeal to users who want to dig a little deeper into phrase usage: part-of-speech tags and ngram compositions.
Consider the word tackle, which can be a verb ("tackle the problem") or a noun ("fishing tackle"). You can distinguish between these different forms by appending _VERB or _NOUN:
The full list of tags is as follows:
_NOUN_ These tags can either stand alone (_PRON_)
or can be appended to a word (she_PRON)
_VERB_ _ADJ_ adjective _ADV_ adverb _PRON_ pronoun _DET_ determiner or article _ADP_ an adposition: either a preposition or a postposition _NUM_ numeral _CONJ_ conjunction _PRT_ particle _ROOT_ root of the parse tree These tags must stand alone (e.g., _START_) _START_ start of a sentence _END_ end of a sentence
Since the part-of-speech tags needn't attach to particular words, you can use the DET tag to search for read a book, read the book, read that book, read this book, and so on as follows:
The Ngram Viewer now tags sentence boundaries, allowing you to identify ngrams at starts and ends of sentences with the START and END tags:
Sometimes it helps to think about words in terms of dependencies rather than patterns. Let's say you want to know how often tasty modifies dessert. That is, you want to tally mentions of tasty frozen dessert, crunchy, tasty dessert, tasty yet expensive dessert, and all the other instances in which the word tasty is applied to dessert. For that, the Ngram Viewer provides dependency relations with the => operator:
Every parsed sentence has a _ROOT_. Unlike other tags, _ROOT_ doesn't stand for a particular word or position in the sentence. It's the root of the parse tree constructed by analyzing the syntax; you can think of it as a placeholder for what the main verb of the sentence is modifying. So here's how to identify how often will was the main verb of a sentence:
The above graph would include the sentence Larry will decide. but not Larry said that he will decide, since will isn't the main verb of that sentence.
"Pure" part-of-speech tags can be mixed freely with regular words in 1-, 2-, and 3-grams (e.g., the _ADJ_ toast or _DET_ _ADJ_ toast), but not with 4- or 5-grams.
The Ngram Viewer provides five operators that you can use to combine ngrams: +, -, /, *, and :.
+ sums the expressions on either side, letting you combine multiple ngram time series into one. - subtracts the expression on the right from the expression on the left, giving you a way to measure one ngram relative to another. Because users often want to search for hyphenated phrases, put spaces on either side of the - sign. / divides the expression on the left by the expression on the right, which is useful for isolating the behavior of an ngram with respect to another. * multiplies the expression on the left by the number on the right, making it easier to compare ngrams of very different frequencies. : applies the ngram on the left to the corpus on the right, allowing you to compare ngrams across different corpora.
The Ngram Viewer will try to guess whether to apply these behaviors. You can use parentheses to force them on, and square brackets to force them off. Example: and/or will divide and by or; to measure the usage of the phrase and/or, use [and/or]. And well-meaning will search for the phrase well-meaning; if you want to subtract meaning from well, use (well - meaning).
To demonstrate the + operator, here's how you might find the sum of of game, sport, and play:
When determining whether people wrote more about choices over the years, you could compare choice, selection, option, and alternative, specifying the noun forms to avoid the adjective forms (e.g., choice delicacy, alternative music):
Ngram subtraction gives you an easy way to compare one set of ngrams to another:
Here's how you might combine + and / to show how the word applesauce has blossomed at the expense of apple sauce:
The * operator is useful when you want to compare ngrams of widely varying frequencies, like violin and the more esoteric theremin:
The : corpus selection operator lets you compare ngrams in different languages, or American versus British English (or fiction), or between the 2009 and 2012 versions of our book scans. Here's chat in English versus the same unigram in French:
When we generated the original Ngram Viewer corpora in 2009, our OCR wasn't as good as it is today. This was especially obvious in pre-19th century English, where the elongated medial-s (ſ) was often interpreted as an f, so best was often read as beft. Here's evidence of the improvements we've made since then, using the corpus operator to compare the 2009 and 2012 versions:
By comparing fiction against all of English, we can see that uses of wizard in general English have been gaining recently compared to uses in fiction:
Below are descriptions of the corpora that can be searched with the Google Books Ngram Viewer. All corpora were generated in either July 2009 or July 2012; we will update these corpora as our book scanning continues, and the updated versions will have distinct persistent identifiers. Books with low OCR quality and serials were excluded.
Informal corpus name Shorthand Persistent identifier Description American English 2012 eng_us_2012 googlebooks-eng-us-all-20120701 Books predominantly in the English language that were published in the United States. American English 2009 eng_us_2009 googlebooks-eng-us-all-20090715 British English 2012 eng_gb_2012 googlebooks-eng-gb-all-20120701 Books predominantly in the English language that were published in Great Britain. British English 2009 eng_gb_2009 googlebooks-eng-gb-all-20090715 Chinese 2012 chi_sim_2012 googlebooks-chi-sim-all-20120701 Books predominantly in simplified Chinese script. Chinese 2009 chi_sim_2009 googlebooks-chi-sim-all-20090715 English 2012 eng_2012 googlebooks-eng-all-20120701 Books predominantly in the English language published in any country. English 2009 eng_2009 googlebooks-eng-all-20090715 English Fiction 2012 eng_fiction_2012 googlebooks-eng-fiction-all-20120701 Books predominantly in the English language that a library or publisher identified as fiction. English Fiction 2009 eng_fiction_2009 googlebooks-eng-fiction-all-20090715 English One Million eng_1m_2009 googlebooks-eng-1M-20090715 The "Google Million". All are in English with dates ranging from 1500 to 2008. No more than about 6000 books were chosen from any one year, which means that all of the scanned books from early years are present, and books from later years are randomly sampled. The random samplings reflect the subject distributions for the year (so there are more computer books in 2000 than 1980). French 2012 fre_2012 googlebooks-fre-all-20120701 Books predominantly in the French language. French 2009 fre_2009 googlebooks-fre-all-20090715 German 2012 ger_2012 googlebooks-ger-all-20120701 Books predominantly in the German language. German 2009 ger_2009 googlebooks-ger-all-20090715 Hebrew 2012 heb_2012 googlebooks-heb-all-20120701 Books predominantly in the Hebrew language. Hebrew 2009 heb_2009 googlebooks-heb-all-20090715 Spanish 2012 spa_2012 googlebooks-spa-all-20120701 Books predominantly in the Spanish language. Spanish 2009 spa_2009 googlebooks-spa-all-20090715 Russian 2012 rus_2012 googlebooks-rus-all-20120701 Books predominantly in the Russian language. Russian 2009 rus_2009 googlebooks-rus-all-20090715 Italian 2012 ita_2012 googlebooks-ita-all-20120701 Books predominantly in the Italian language.
Compared to the 2009 versions, the 2012 versions have larger numbers of books, improved OCR, improved library and publisher metadata. The 2012 versions also don't form ngrams that cross sentence boundaries, and do form ngrams across page boundaries, unlike the 2009 versions.
With the 2012 corpora, the tokenization has improved as well, using a set of manually devised rules (except for Chinese, where a statistical system is used for segmentation). In the 2009 corpora, tokenization was based simply on whitespace.
Searching inside Google Books
Below the graph, we show "interesting" year ranges for your query terms. Clicking on those will submit your query directly to Google Books. Note that the Ngram Viewer is case-sensitive, but Google Books search results are not.
Those searches will yield phrases in the language of whichever corpus you selected, but the results are returned from the full Google Books corpus. So if you use the Ngram Viewer to search for a French phrase in the French corpus and then click through to Google Books, that search will be for the same French phrase -- which might occur in a book predominantly in another language.
Why am I not seeing the results I expect?
Perhaps for one of these reasons:
- The Ngram Viewer is case-sensitive. Try capitalizing.
- You're searching in an unexpected corpus. For instance, Frankenstein doesn't appear in Russian books, so if you search in the Russian corpus you'll see a flatline. You can choose the corpus via the dropdown menu below the search box, or through the corpus selection operator, e.g., Frankenstein:eng_2012.
- Your phrase has a comma, plus sign, hyphen, asterisk, colon, or forward slash in it. Those have special meanings to the Ngram Viewer; see Advanced Usage. Try enclosing the phrase in square brackets (although this won't help with commas).
How does the Ngram Viewer handle punctuation?
We apply a set of tokenization rules specific to the particular language. In English, contractions become two words (they're becomes the bigram they 're, we'll becomes we 'll, and so on). The possessive 's is also split off, but R'n'B remains one token. Negations (n't) are normalized so that don't becomes do not. In Russian, the diacritic ё is normalized to e, and so on. The same rules are applied to parse both the ngrams typed by users and the ngrams extracted from the corpora, which means that if you're searching for don't, don't be alarmed by the fact that the Ngram Viewer rewrites it to do not; it is accurately depicting usages of both don't and do not in the corpus. However, this means there is no way to search explicitly for the specific forms can't (or cannot): you get can't and can not and cannot all at once.
How can I see sample usages in context?
Below the Ngram Viewer chart, we provide a table of predefined Google Books searches, each narrowed to a range of years. We choose the ranges according to interestingness: if an ngram has a huge peak in a particular year, that will appear by itself as a search, with other searches covering longer durations.
Unlike the 2012 Ngram Viewer corpus, the Google Books corpus isn't part-of-speech tagged. One can't search for, say, the verb form of cheer in Google Books. So any ngrams with part-of-speech tags (e.g., cheer_VERB) are excluded from the table of Google Books searches.
The Ngram Viewer has 2009 and 2012 corpora, but Google Books doesn't work that way. When you're searching in Google Books, you're searching all the currently available books, so there may be some differences between what you see in Google Books and what you would expect to see given the Ngram Viewer chart.
Why do I see more spikes and plateaus in early years?
Publishing was a relatively rare event in the 16th and 17th centuries. (There are only about 500,000 books published in English before the 19th century.) So if a phrase occurs in one book in one year but not in the preceding or following years, that creates a taller spike than it would in later years.
Plateaus are usually simply smoothed spikes. Change the smoothing to 0.
What does "smoothing" mean?
Often trends become more apparent when data is viewed as a moving average. A smoothing of 1 means that the data shown for 1950 will be an average of the raw count for 1950 plus 1 value on either side: ("count for 1949" + "count for 1950" + "count for 1951"), divided by 3. So a smoothing of 10 means that 21 values will be averaged: 10 on either side, plus the target value in the center of them.
At the left and right edges of the graph, fewer values are averaged. With a smoothing of 3, the leftmost value (pretend it's the year 1950) will be calculated as ("count for 1950" + "count for 1951" + "count for 1952" + "count for 1953"), divided by 4.
A smoothing of 0 means no smoothing at all: just raw data.
Many more books are published in modern years. Doesn't this skew the results?
It would if we didn't normalize by the number of books published in each year.
Why are you showing a 0% flatline when I know the phrase in my query occurred in at least one book?
Under heavy load, the Ngram Viewer will sometimes return a flatline; reload to confirm that there are actually no hits for the phrase. Also, we only consider ngrams that occur in at least 40 books. Otherwise the dataset would balloon in size and we wouldn't be able to offer them all.
How accurate is the part-of-speech tagging?
The part-of-speech tags and dependency relations are predicted automatically. Assessing the accuracy of these predictions is difficult, but for modern English we expect the accuracy of the part-of-speech tags to be around 95% and the accuracy of dependency relations around 85%. On older English text and for other languages the accuracies are lower, but likely above 90% for part-of-speech tags and above 75% for dependencies. This implies a significant number of errors, which should be taken into account when drawing conclusions.
The part-of-speech tags are constructed from a small training set (a mere million words for English). This will sometimes underrepresent uncommon usages, such as green or dog or book as verbs, or ask as a noun.
An additional note on Chinese: Before the 20th century, classical Chinese was traditionally used for all written communication. Classical Chinese is based on the grammar and vocabulary of ancient Chinese, and the syntactic annotations will therefore be wrong more often than they're right.
Also, note that the 2009 corpora have not been part-of-speech tagged.
I'm writing a paper based on your results. How can I cite your work?
If you're going to use this data for an academic publication, please cite the original paper:
Jean-Baptiste Michel*, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, William Brockman, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden*. Quantitative Analysis of Culture Using Millions of Digitized Books. Science (Published online ahead of print: 12/16/2010)
We also have a paper on our part-of-speech tagging:
Yuri Lin, Jean-Baptiste Michel, Erez Lieberman Aiden, Jon Orwant, William Brockman, Slav Petrov. Syntactic Annotations for the Google Books Ngram Corpus. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics Volume 2: Demo Papers (ACL '12) (2012)
Can I download your data to run my own experiments?
Yes! The ngram data is available for download here. To make the file sizes manageable, we've grouped them by their starting letter and then grouped the different ngram sizes in separate files. The ngrams within each file are not alphabetically sorted.
To generate machine-readable filenames, we transliterated the ngrams for languages that use non-roman scripts (Chinese, Hebrew, Russian) and used the starting letter of the transliterated ngram to determine the filename. The same approach was taken for characters such as ä in German. Note that the transliteration was used only to determine the filename; the actual ngrams are encoded in UTF-8 using the language-specific alphabet.
I'd like to publish an Ngram graph in my book/magazine/blog/presentation. What are your licensing terms?
Ngram Viewer graphs and data may be freely used for any purpose, although acknowledgement of Google Books Ngram Viewer as the source, and inclusion of a link to http://books.google.com/ngrams, would be appreciated.
The Google Ngram Viewer Team, part of Google Research