When you enter phrases into the Google Books Ngram Viewer, it displays a graph showing how those phrases have occurred in a corpus of books (e.g., "British English", "English Fiction", "French") over the selected years. Let's look at a sample graph:
This shows trends in three ngrams from 1950 to 2000: "nursery school" (a 2-gram or bigram), "kindergarten" (a 1-gram or unigram), and "child care" (another bigram). What the y-axis shows is this: of all the bigrams contained in our sample of books written in English and published in the United States, what percentage of them are "nursery school" or "child care"? Of all the unigrams, what percentage of them are "kindergarten"? Here, you can see that use of the phrase "child care" started to rise in the late 1960s, overtaking "nursery school" around 1970 and then "kindergarten" around 1973. It peaked shortly after 1990 and has been falling steadily since.
(Interestingly, the results are noticeably different when the corpus is switched to British English.)
You can hover over the line plot for an ngram, which highlights it. With a left-click on a line plot, you can focus on a particular ngram, greying out the other ngrams in the chart, if any. On subsequent left clicks on other line plots in the chart, multiple ngrams can be focused on. You can double click on any area of the chart to reinstate all the ngrams in the query.
You can also specify wildcards in queries, search for inflections, perform case insensitive search, look for particular parts of speech, or add, subtract, and divide ngrams. More on those under Advanced Usage.
A few features of the Ngram Viewer may appeal to users who want to dig a little deeper into phrase usage: wildcard search, inflection search, case insensitive search, part-of-speech tags and ngram compositions.
When you put a * in place of a word, the Ngram Viewer will display the top ten substitutions. For instance, to find the most popular words following "University of", search for "University of *".
You can right click on any of the replacement ngrams to collapse them all into the original wildcard query, with the result being the yearwise sum of the replacements. A subsequent right click expands the wildcard query back to all the replacements. Note that the Ngram Viewer only supports one * per ngram.
Note that the top ten replacements are computed for the specified time range. You might therefore get different replacements for different year ranges. We've filtered punctuation symbols from the top ten list, but for words that often start or end sentences, you might see one of the sentence boundary symbols (_START_ or _END_) as one of the replacements.
An inflection is the modification of a word to represent various grammatical categories such as aspect, case, gender, mood, number, person, tense and voice. You can search for them by appending _INF to an ngram. For instance, searching "book_INF a hotel" will display results for "book", "booked", "books", and "booking":
Right clicking any inflection collapses all forms into their sum. Note that the Ngram Viewer only supports one _INF keyword per query.
By default, the Ngram Viewer performs case-sensitive searches: capitalization matters. You can perform a case-insensitive search by selecting the "case-insensitive" checkbox to the right of the query box. The Ngram Viewer will then display the yearwise sum of the most common case-insensitive variants of the input query. Here are two case-insensitive ngrams, "Fitzgerald" and "Dupont":
Right clicking any yearwise sum results in an expansion into the most common case-insensitive variants. For example, a right click on "Dupont (All)" results in the following four variants: "DuPont", "Dupont", "duPont" and "DUPONT".
Warning: You can't freely mix wildcard searches, inflections and case-insensitive searches for one particular ngram. However, you can search with either of these features for separate ngrams in a query: "book_INF a hotel, book * hotel" is fine, but "book_INF * hotel" is not.
Consider the word tackle, which can be a verb ("tackle the problem") or a noun ("fishing tackle"). You can distinguish between these different forms by appending _VERB or _NOUN:
The full list of tags is as follows:
|_NOUN_||These tags can either stand alone (_PRON_)|
or can be appended to a word (she_PRON)
|_DET_||determiner or article|
|_ADP_||an adposition: either a preposition or a postposition|
|_ROOT_||root of the parse tree||These tags must stand alone (e.g., _START_)|
|_START_||start of a sentence|
|_END_||end of a sentence|
Since the part-of-speech tags needn't attach to particular words, you can use the DET tag to search for read a book, read the book, read that book, read this book, and so on as follows:
If you wanted to know what the most common determiners in this context are, you could combine wildcards and part-of-speech tags to read *_DET book:
To get all the different inflections of the word book which have been followed by a NOUN in the corpus you can issue the query book_INF _NOUN_:
Most frequent part-of-speech tags for a word can be retrieved with the wildcard functionality. Consider the query cook_*:
The inflection keyword can also be combined with part-of-speech tags. For example, consider the query cook_INF, cook_VERB_INF below, that separates out the inflections of the verbal sense of "cook":
The Ngram Viewer tags sentence boundaries, allowing you to identify ngrams at starts and ends of sentences with the START and END tags:
Sometimes it helps to think about words in terms of dependencies rather than patterns. Let's say you want to know how often tasty modifies dessert. That is, you want to tally mentions of tasty frozen dessert, crunchy, tasty dessert, tasty yet expensive dessert, and all the other instances in which the word tasty is applied to dessert. For that, the Ngram Viewer provides dependency relations with the => operator:
Every parsed sentence has a _ROOT_. Unlike other tags, _ROOT_ doesn't stand for a particular word or position in the sentence. It's the root of the parse tree constructed by analyzing the syntax; you can think of it as a placeholder for what the main verb of the sentence is modifying. So here's how to identify how often will was the main verb of a sentence:
The above graph would include the sentence Larry will decide. but not Larry said that he will decide, since will isn't the main verb of that sentence.
Dependencies can be combined with wildcards. For example, consider the query drink=>*_NOUN below:
"Pure" part-of-speech tags can be mixed freely with regular words in 1-, 2-, and 3-grams (e.g., the _ADJ_ toast or _DET_ _ADJ_ toast), but not with 4- or 5-grams.
The Ngram Viewer provides five operators that you can use to combine ngrams: +, -, /, *, and :.
|+||sums the expressions on either side, letting you combine multiple ngram time series into one.|
|-||subtracts the expression on the right from the expression on the left, giving you a way to measure one ngram relative to another. Because users often want to search for hyphenated phrases, put spaces on either side of the - sign.|
|/||divides the expression on the left by the expression on the right, which is useful for isolating the behavior of an ngram with respect to another.|
|*||multiplies the expression on the left by the number on the right, making it easier to compare ngrams of very different frequencies. (Be sure to enclose the entire ngram in parentheses so that * isn't interpreted as a wildcard.)|
|:||applies the ngram on the left to the corpus on the right, allowing you to compare ngrams across different corpora.|
The Ngram Viewer will try to guess whether to apply these behaviors. You can use parentheses to force them on, and square brackets to force them off. Example: and/or will divide and by or; to measure the usage of the phrase and/or, use [and/or]. And well-meaning will search for the phrase well-meaning; if you want to subtract meaning from well, use (well - meaning).
To demonstrate the + operator, here's how you might find the sum of of game, sport, and play:
When determining whether people wrote more about choices over the years, you could compare choice, selection, option, and alternative, specifying the noun forms to avoid the adjective forms (e.g., choice delicacy, alternative music):
Ngram subtraction gives you an easy way to compare one set of ngrams to another:
Here's how you might combine + and / to show how the word applesauce has blossomed at the expense of apple sauce:
The * operator is useful when you want to compare ngrams of widely varying frequencies, like violin and the more esoteric theremin:
The : corpus selection operator lets you compare ngrams in different languages, or American versus British English (or fiction), or between the 2009 and 2012 versions of our book scans. Here's chat in English versus the same unigram in French:
When we generated the original Ngram Viewer corpora in 2009, our OCR wasn't as good as it is today. This was especially obvious in pre-19th century English, where the elongated medial-s (ſ) was often interpreted as an f, so best was often read as beft. Here's evidence of the improvements we've made since then, using the corpus operator to compare the 2009 and 2012 versions:
By comparing fiction against all of English, we can see that uses of wizard in general English have been gaining recently compared to uses in fiction:
Below are descriptions of the corpora that can be searched with the Google Books Ngram Viewer. All corpora were generated in either July 2009 or July 2012; we will update these corpora as our book scanning continues, and the updated versions will have distinct persistent identifiers. Books with low OCR quality and serials were excluded.
|Informal corpus name||Shorthand||Persistent identifier||Description|
|American English 2012||eng_us_2012||googlebooks-eng-us-all-20120701||Books predominantly in the English language that were published in the United States.|
|American English 2009||eng_us_2009||googlebooks-eng-us-all-20090715|
|British English 2012||eng_gb_2012||googlebooks-eng-gb-all-20120701||Books predominantly in the English language that were published in Great Britain.|
|British English 2009||eng_gb_2009||googlebooks-eng-gb-all-20090715|
|Chinese 2012||chi_sim_2012||googlebooks-chi-sim-all-20120701||Books predominantly in simplified Chinese script.|
|English 2012||eng_2012||googlebooks-eng-all-20120701||Books predominantly in the English language published in any country.|
|English Fiction 2012||eng_fiction_2012||googlebooks-eng-fiction-all-20120701||Books predominantly in the English language that a library or publisher identified as fiction.|
|English Fiction 2009||eng_fiction_2009||googlebooks-eng-fiction-all-20090715|
|English One Million||eng_1m_2009||googlebooks-eng-1M-20090715||The "Google Million". All are in English with dates ranging from 1500 to 2008. No more than about 6000 books were chosen from any one year, which means that all of the scanned books from early years are present, and books from later years are randomly sampled. The random samplings reflect the subject distributions for the year (so there are more computer books in 2000 than 1980).|
|French 2012||fre_2012||googlebooks-fre-all-20120701||Books predominantly in the French language.|
|German 2012||ger_2012||googlebooks-ger-all-20120701||Books predominantly in the German language.|
|Hebrew 2012||heb_2012||googlebooks-heb-all-20120701||Books predominantly in the Hebrew language.|
|Spanish 2012||spa_2012||googlebooks-spa-all-20120701||Books predominantly in the Spanish language.|
|Russian 2012||rus_2012||googlebooks-rus-all-20120701||Books predominantly in the Russian language.|
|Italian 2012||ita_2012||googlebooks-ita-all-20120701||Books predominantly in the Italian language.|
Compared to the 2009 versions, the 2012 versions have more books, improved OCR, improved library and publisher metadata. The 2012 versions also don't form ngrams that cross sentence boundaries, and do form ngrams across page boundaries, unlike the 2009 versions.
With the 2012 corpora, the tokenization has improved as well, using a set of manually devised rules (except for Chinese, where a statistical system is used for segmentation). In the 2009 corpora, tokenization was based simply on whitespace.
Below the graph, we show "interesting" year ranges for your query terms. Clicking on those will submit your query directly to Google Books. Note that the Ngram Viewer is case-sensitive, but Google Books search results are not.
Those searches will yield phrases in the language of whichever corpus you selected, but the results are returned from the full Google Books corpus. So if you use the Ngram Viewer to search for a French phrase in the French corpus and then click through to Google Books, that search will be for the same French phrase -- which might occur in a book predominantly in another language.
Perhaps for one of these reasons:
- The Ngram Viewer is case-sensitive. Try capitalizing your query or check the "case-insensitive" box to the right of the search box.
- You're searching in an unexpected corpus. For instance, Frankenstein doesn't appear in Russian books, so if you search in the Russian corpus you'll see a flatline. You can choose the corpus via the dropdown menu below the search box, or through the corpus selection operator, e.g., Frankenstein:eng_2012.
- Your phrase has a comma, plus sign, hyphen, asterisk, colon, or forward slash in it. Those have special meanings to the Ngram Viewer; see Advanced Usage. Try enclosing the phrase in square brackets (although this won't help with commas).
We apply a set of tokenization rules specific to the particular language. In English, contractions become two words (they're becomes the bigram they 're, we'll becomes we 'll, and so on). The possessive 's is also split off, but R'n'B remains one token. Negations (n't) are normalized so that don't becomes do not. In Russian, the diacritic ё is normalized to e, and so on. The same rules are applied to parse both the ngrams typed by users and the ngrams extracted from the corpora, which means that if you're searching for don't, don't be alarmed by the fact that the Ngram Viewer rewrites it to do not; it is accurately depicting usages of both don't and do not in the corpus. However, this means there is no way to search explicitly for the specific forms can't (or cannot): you get can't and can not and cannot all at once.
Below the Ngram Viewer chart, we provide a table of predefined Google Books searches, each narrowed to a range of years. We choose the ranges according to interestingness: if an ngram has a huge peak in a particular year, that will appear by itself as a search, with other searches covering longer durations.
Unlike the 2012 Ngram Viewer corpus, the Google Books corpus isn't part-of-speech tagged. One can't search for, say, the verb form of cheer in Google Books. So any ngrams with part-of-speech tags (e.g., cheer_VERB) are excluded from the table of Google Books searches.
The Ngram Viewer has 2009 and 2012 corpora, but Google Books doesn't work that way. When you're searching in Google Books, you're searching all the currently available books, so there may be some differences between what you see in Google Books and what you would expect to see given the Ngram Viewer chart.
Publishing was a relatively rare event in the 16th and 17th centuries. (There are only about 500,000 books published in English before the 19th century.) So if a phrase occurs in one book in one year but not in the preceding or following years, that creates a taller spike than it would in later years.
Plateaus are usually simply smoothed spikes. Change the smoothing to 0.
Often trends become more apparent when data is viewed as a moving average. A smoothing of 1 means that the data shown for 1950 will be an average of the raw count for 1950 plus 1 value on either side: ("count for 1949" + "count for 1950" + "count for 1951"), divided by 3. So a smoothing of 10 means that 21 values will be averaged: 10 on either side, plus the target value in the center of them.
At the left and right edges of the graph, fewer values are averaged. With a smoothing of 3, the leftmost value (pretend it's the year 1950) will be calculated as ("count for 1950" + "count for 1951" + "count for 1952" + "count for 1953"), divided by 4.
A smoothing of 0 means no smoothing at all: just raw data.
It would if we didn't normalize by the number of books published in each year.
Under heavy load, the Ngram Viewer will sometimes return a flatline; reload to confirm that there are actually no hits for the phrase. Also, we only consider ngrams that occur in at least 40 books. Otherwise the dataset would balloon in size and we wouldn't be able to offer them all.
The part-of-speech tags and dependency relations are predicted automatically. Assessing the accuracy of these predictions is difficult, but for modern English we expect the accuracy of the part-of-speech tags to be around 95% and the accuracy of dependency relations around 85%. On older English text and for other languages the accuracies are lower, but likely above 90% for part-of-speech tags and above 75% for dependencies. This implies a significant number of errors, which should be taken into account when drawing conclusions.
The part-of-speech tags are constructed from a small training set (a mere million words for English). This will sometimes underrepresent uncommon usages, such as green or dog or book as verbs, or ask as a noun.
An additional note on Chinese: Before the 20th century, classical Chinese was traditionally used for all written communication. Classical Chinese is based on the grammar and vocabulary of ancient Chinese, and the syntactic annotations will therefore be wrong more often than they're right.
Also, note that the 2009 corpora have not been part-of-speech tagged.
If you're going to use this data for an academic publication, please cite the original paper:
Jean-Baptiste Michel*, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, William Brockman, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden*. Quantitative Analysis of Culture Using Millions of Digitized Books. Science (Published online ahead of print: 12/16/2010)
We also have a paper on our part-of-speech tagging:
Yuri Lin, Jean-Baptiste Michel, Erez Lieberman Aiden, Jon Orwant, William Brockman, Slav Petrov. Syntactic Annotations for the Google Books Ngram Corpus. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics Volume 2: Demo Papers (ACL '12) (2012)
Yes! The ngram data is available for download here. To make the file sizes manageable, we've grouped them by their starting letter and then grouped the different ngram sizes in separate files. The ngrams within each file are not alphabetically sorted.
To generate machine-readable filenames, we transliterated the ngrams for languages that use non-roman scripts (Chinese, Hebrew, Russian) and used the starting letter of the transliterated ngram to determine the filename. The same approach was taken for characters such as ä in German. Note that the transliteration was used only to determine the filename; the actual ngrams are encoded in UTF-8 using the language-specific alphabet.
Ngram Viewer graphs and data may be freely used for any purpose, although acknowledgement of Google Books Ngram Viewer as the source, and inclusion of a link to http://books.google.com/ngrams, would be appreciated.
The Google Ngram Viewer Team, part of Google Research