The GDELT Project

3.5 Million Books 1800-2015: GDELT Processes Internet Archive and HathiTrust Book Archives and Available In Google BigQuery

Today we are enormously excited to announce that more than 3.5 million digitized books stretching back two centuries, encompassing the complete English-language public domain collections of the Internet Archive (1.3M volumes) and HathiTrust (2.2 million volumes), have been processed using the GDELT Global Knowledge Graph and are now available in Google BigQuery.  More than a billion pages stretching back 215 years have been examined to compile a list of all people, organizations, and other names, fulltext geocoded to render them fully mappable, and more than 4,500 emotions and themes compiled.  All of this computed metadata is combined with all available book-level metadata, including title, author, publisher, and subject tags as provided by the contributing libraries.  Even more excitingly, the complete fulltext of all Internet Archive books published 1800-1922 are included to allow you to perform your own near-realtime analyses.  All of this is housed in Google BigQuery, making it possible to perform sophisticated analyses across 122 years of history in just seconds.  A single line of SQL can execute even the most complex regular expression or complete JavaScript algorithm over nearly half a terabyte of fulltext in just 11 seconds and combine it with all of the extracted data above.  Track emotions or themes over time or map the geography of the world as seen through books – the sky is the limit!

The Internet Archive books include all books in the Archive’s American Libraries collection for which English-language fulltext was available using the search “collection:(americana)”.  In addition, from 1800-1922 a few hundred books per year met these criteria, but the combined size of the extracted metadata, library-provided metadata, and fulltext exceeded 2MB, which is the maximum record size for Google BigQuery, and so are not included here.  For HathiTrust, all English-language public domain books 1800-2015 were provided by HathiTrust as part of a special research extract.  Only public domain volumes were requested.

You will find a much higher degree of error in both the computed and library-provided metadata in this collection compared with other GDELT collections.  This stems from the high level of OCR error in earlier works, the inadvertent inclusion of non-English works due to errors in language metadata, language differences over time, the lack of robust global gazetteers and support databases for historical placenames and organization names in the 1800’s and early 1900’s, changes in word connotations and spellings, and so on.  Library-provided metadata can differ substantially in quality and accuracy, even within the same collection over time.  Differences in the handling of publication date, especially with regards to periodicals and reprintings can introduce certain nuances into the datasets.

In particular, keep in mind that both collections change substantially in 1923 with the beginning of the so-called “copyright barrier”, so most analyses will likely wish to stay within the 1800-1922 period for consistent results over time.

Examples of BigQuery SQL code that can be used with these tables can be found at Google BigQuery + GKG 2.0: Sample Queries and we will be releasing another set of sample queries demonstrating various kinds of analyses with these two collections.  When using the two collections in BigQuery, use the following “FROM” clauses:

While the code above looks a bit complex, you can just copy-paste it as-is and just change the “1800” and “2015” dates to the start and end dates that you want to use (or set them to the same year to examine just a single year).

ACKNOWLEGEMENTS

We'd like to extend our tremendous thanks to Google, Clemson University, the Internet Archive, HathiTrust, and OCLC in making this project possible.

For the Internet Archive collection, we'd like to thank the Internet Archive and the contributing libraries and digitization sponsors that have made all of these digitized books available through the Archive.  Their incredible work in creating an archive of millions of freely available books made this dataset possible.

We'd like to thank HathiTrust for creating the customized research collection used in this project and for the staff time they provided to assist in answering questions and generating the extract.

We'd like to thank OCLC for permitting the inclusion of their subject tags in the HathiTrust BigQuery table.  Subject tags found in the HathiTrust table are derived from the OCLC WorldCat® database. OCLC has granted permission for the subject tags to be included in this dataset. The subject tags may not be used outside the bounds of this dataset.

We'd like to thank Google for the computing power that made the processing of this collection possible (stay tuned for a blog post detailing the technical details of how we actually processed all of this material, which included one of Google's largest Compute Engine machines with 32 cores, 200GB of RAM, 10TB of persistent solid state disk, and four 375GB direct-attached local solid state scratch disks).

GET STARTED NOW

You can jump right in and start working with the two Google BigQuery tables below.  You'll need to sign up for a Google account.  Note that BigQuery is a commercial service by Google and that Google requires billing enabled in case you exceed the freely monthly quota it provides all users.  We recommend that you start small, examining the early 1800's years to minimize the amount of data you process until you are comfortable using BigQuery.

 

TECHNICAL DOCUMENTATION

The remainder of this blog post details the contents of each field in the Google BigQuery tables for the two collections and how to use them.

Several of following fields (marked with “NOT USED”) are not used for either of the books collections and will be NULL for all records.  They were included so that the column ordering of the dataset matched over GDELT datasets to make it easier to load into databases designed to hold other GDELT datasets.

 

These two fields are not used and were included to ensure that column ordering remained the same as other GDELT datasets.

 

BOOK-LEVEL METADATA

Finally, the remaining metadata fields are prefaced with “BookMeta_” and contain the record-level metadata about each book held by the Internet Archive or HathiTrust.  These fields are provided as-is in their raw form directly as they appear in the record metadata fields (compiled from the MARC XML records in the case of HathiTrust).  The set of available fields differs between the Internet Archive and HathiTrust collections due to their use of different metadata schemas.  Note that even common fields like subject tags may have very different contents or formatting between the two collections.

 

Internet Archive

Books are published from 1800 to 1922, inclusive, are stored within yearly tables named simply as the year (1800, 1801, 1802, etc).  Books published 1923 and after are stored in yearly tables ending in “notxt” (such as “1923notxt”) to indicate that fulltext is not available (fulltext is only available for books published from 1800-1922).

 

HathiTrust

HathiTrust books are stored in yearly tables by publication year.  No fulltext is available for any HathiTrust works, so there is no difference in the table names before and after 1923.