In the last decade or so, various fields with the suffix “–omics” have risen in biological and biomedical sciences. The oldest and most well-known is genomics, the high throughput study of all the genes in the genome. Together with other emerging fields such as transcriptomics, proteomics and connectomics, culturomics is taking its place in the omics family.
What is Culturomics?
Culturomics is the collection and analysis of large amounts of data for the study of human culture. Disciplines that we traditionally associate with the study of human cultures include disciplines in the social sciences and the humanities. While they have different interests and goals, they do share some common methods, such as the close human examination of materials such as books and historical artifacts. The culturomic approach employs a more quantitative methodology. It applies the power of modern computers to collect and analyze huge amounts of data, in order to gain statistical insights into human cultural trends. It is a new science applied to age-old questions of human cultures: how did they evolve? What drives their evolution? How do ideas and popularity spread?
A forerunner of this new science was a major quantitative analysis of printed books, published recently in the journal Science . The research was a collaboration between technology giant Google and a multi-disciplinary team of academics based mainly at Harvard University. Over 15 million printed books, mostly in the English language, were scanned using custom scanning machines and converted into data that could be analyzed by computers. This dataset is believed to contain around 12% of all books ever printed. A subset of this massive collection, comprising over 500 billion words, was analyzed to determine how often certain words appeared in print. The frequency with which words appeared in the analyzed literature was corrected for the total number of words written during each year. In other words, changes in word frequency reflect actual differences in representation over time, and do not reflect the general increase in the number of printed words over time.
This analysis provided interesting perspectives relating to diverse aspects of history, linguistics and cultural evolution. For example, as shown in Figure 1a, the usage frequency of the word “slavery” in these books rose steadily in the 1800s and peaked during the Civil War (1861-1865). It also experienced an upward trend during the civil rights movement in the 1950s and 1960s, highlighting the potential for this method to broadly reflect known cultural milieu over time. The analysis was not restricted to single words. Phrases could also be analyzed as shown in Figure 1b, documenting the rise and fall of the phrases “the Great War, “World War I” and “World War II”.
This research went beyond just correlating word usage frequency with known cultural trends. It also provided insights into the evolution of language. For example, they quantified the rate of regularization of verbs, such as the switch from irregular verb “burnt” to its regular form “burned”. The most interesting analyses pertained to “cultural memory”. The authors found that just as individuals remember and forget, so do societies. Using digits of years such as “1883”, they could trace the rise and fall of memory of specific years. For example, “1883” peaked around 1883, but it was gradually “forgotten” (Figure 1c). Intriguingly, the rate of rise in popularity and rate of decline (forgetting) were both much faster for more recent years (“1950”) than for older years. These trends were replicated among frequency of celebrity names: more recent celebrities of the 20th Century rose to fame quicker but were also forgotten twice as quickly as their celebrity peers in the 19th Century. The authors quipped that “in the future, everyone will be famous for 7.5 minutes.”
Like any new science, culturomics has its fair share of detractors who not only criticize its method but also its usefulness. Since its publication, the pioneer culturomics paper by Michel et al.  has received various criticisms. One set of criticisms revolves around the technical limitations of the scanning system that converts images into bits of data. Mistakes can be made such as “s” systematically being mistaken for “f”, especially in older works. But such technical limitations, including the use of mostly English language books as data, would likely be overcome as technology improves over time. Another more potent criticism deals with the nature of computational analysis possible from such large-scale “distant reading”. The analyses do not actually differentiate parts of speech or meaning, the latter being the staple of close reading in literature and cultural studies. Counting words does not count for much in the world of printed literature, as critics would put it; counting does not tell us the subtleties of the written word in a way reading would. Proponents counter-argue that some of these criticisms may reflect an old reflexive fear of turning the humanities and social sciences into the domain of computers and computation. However, it is clear that culturomics does not replace the value of close reading prominent in the humanities, but instead serves as yet another tool that can be used to understand human cultures.
Future plans for culturomics include launching large-scale analyses of newspapers, maps, artwork, music and films, with the hope of providing further insights into cultural evolution. While it is not clear what hypotheses can be tested with such data and what insights will be produced, it is certain is that new computational tools need to be developed. Large-scale linguistic tools were used for analyzing words and texts, but machine vision and image analysis tools would be necessary for deciphering art, maps and other cultural images. It is likely that the resources, methods, and algorithms pioneered on the cultural data would benefit researchers within and beyond the field. In this sense, culturomics is just like genomics: a discovery-driven science that creates data and resources for all to use.
Farhan Ali is a PhD student at the Department of Organismic & Evolutionary Biology.
 Michel, J-B., et al. (2011). Quantitative Analysis of Culture Using Millions of Digitized Books. Science, 331(6014): p. 176-182.