2015-internet-archive-hathitrust-books

3.5 Million Books 1800-2015: GDELT Processes Internet Archive and HathiTrust Book Archives and Available In Google BigQuery

Today we are enormously excited to announce that more than 3.5 million digitized books stretching back two centuries, encompassing the complete English-language public domain collections of the Internet Archive (1.3M volumes) and HathiTrust (2.2 million volumes), have been processed using the GDELT Global Knowledge Graph and are now available in Google BigQuery.  More than a billion pages stretching back 215 years have been examined to compile a list of all people, organizations, and other names, fulltext geocoded to render them fully mappable, and more than 4,500 emotions and themes compiled.  All of this computed metadata is combined with all available book-level metadata, including title, author, publisher, and subject tags as provided by the contributing libraries.  Even more excitingly, the complete fulltext of all Internet Archive books published 1800-1922 are included to allow you to perform your own near-realtime analyses.  All of this is housed in Google BigQuery, making it possible to perform sophisticated analyses across 122 years of history in just seconds.  A single line of SQL can execute even the most complex regular expression or complete JavaScript algorithm over nearly half a terabyte of fulltext in just 11 seconds and combine it with all of the extracted data above.  Track emotions or themes over time or map the geography of the world as seen through books – the sky is the limit!

The Internet Archive books include all books in the Archive’s American Libraries collection for which English-language fulltext was available using the search “collection:(americana)”.  In addition, from 1800-1922 a few hundred books per year met these criteria, but the combined size of the extracted metadata, library-provided metadata, and fulltext exceeded 2MB, which is the maximum record size for Google BigQuery, and so are not included here.  For HathiTrust, all English-language public domain books 1800-2015 were provided by HathiTrust as part of a special research extract.  Only public domain volumes were requested.

You will find a much higher degree of error in both the computed and library-provided metadata in this collection compared with other GDELT collections.  This stems from the high level of OCR error in earlier works, the inadvertent inclusion of non-English works due to errors in language metadata, language differences over time, the lack of robust global gazetteers and support databases for historical placenames and organization names in the 1800’s and early 1900’s, changes in word connotations and spellings, and so on.  Library-provided metadata can differ substantially in quality and accuracy, even within the same collection over time.  Differences in the handling of publication date, especially with regards to periodicals and reprintings can introduce certain nuances into the datasets.

In particular, keep in mind that both collections change substantially in 1923 with the beginning of the so-called “copyright barrier”, so most analyses will likely wish to stay within the 1800-1922 period for consistent results over time.

Examples of BigQuery SQL code that can be used with these tables can be found at Google BigQuery + GKG 2.0: Sample Queries and we will be releasing another set of sample queries demonstrating various kinds of analyses with these two collections.  When using the two collections in BigQuery, use the following “FROM” clauses:

  • Internet Archive: FROM (TABLE_QUERY([gdelt-bq:hathitrustbooks], 'REGEXP_EXTRACT(table_id, r"(d{4})") BETWEEN "1800" AND "2015"'))
  • HathiTrust: FROM (TABLE_QUERY([gdelt-bq:hathitrustbooks], 'REGEXP_EXTRACT(table_id, r"(d{4})") BETWEEN "1800" AND "2015"'))

While the code above looks a bit complex, you can just copy-paste it as-is and just change the “1800” and “2015” dates to the start and end dates that you want to use (or set them to the same year to examine just a single year).

ACKNOWLEGEMENTS

We'd like to extend our tremendous thanks to Google, Clemson University, the Internet Archive, HathiTrust, and OCLC in making this project possible.

For the Internet Archive collection, we'd like to thank the Internet Archive and the contributing libraries and digitization sponsors that have made all of these digitized books available through the Archive.  Their incredible work in creating an archive of millions of freely available books made this dataset possible.

We'd like to thank HathiTrust for creating the customized research collection used in this project and for the staff time they provided to assist in answering questions and generating the extract.

We'd like to thank OCLC for permitting the inclusion of their subject tags in the HathiTrust BigQuery table.  Subject tags found in the HathiTrust table are derived from the OCLC WorldCat® database. OCLC has granted permission for the subject tags to be included in this dataset. The subject tags may not be used outside the bounds of this dataset.

We'd like to thank Google for the computing power that made the processing of this collection possible (stay tuned for a blog post detailing the technical details of how we actually processed all of this material, which included one of Google's largest Compute Engine machines with 32 cores, 200GB of RAM, 10TB of persistent solid state disk, and four 375GB direct-attached local solid state scratch disks).

GET STARTED NOW

You can jump right in and start working with the two Google BigQuery tables below.  You'll need to sign up for a Google account.  Note that BigQuery is a commercial service by Google and that Google requires billing enabled in case you exceed the freely monthly quota it provides all users.  We recommend that you start small, examining the early 1800's years to minimize the amount of data you process until you are comfortable using BigQuery.

 

TECHNICAL DOCUMENTATION

The remainder of this blog post details the contents of each field in the Google BigQuery tables for the two collections and how to use them.

  • GKGRECORDID. This is a unique identifier assigned to every book.  For Internet Archive books it begins with “iabook-“ followed by the unique Internet Archive identifier for the book, while for HathiTrust books it begins with “htbook-“ followed by the unique HathiTrust identifier for the book.
  • DATE. This is the four-digit year of publication of the book.  Note that both the Internet Archive and HathiTrust have a certain level of error in this field and may differ over time in how they assign a publication year to periodicals and reprints.
  • SourceCollectionIdentifier. This is a numeric code used by GDELT to indicate the collection the document came from.  It will be “7” for all Internet Archive books and “8” for all HathiTrust books.
  • SourceCommonName. This is a human-friendly textual label indicating the collection the document came from.  It will be “InternetArchiveBooks” for all Internet Archive books and “HathiTrustBooks” for all HathiTrust books.
  • DocumentIdentifier. This is the unique Internet Archive or HathiTrust-provided identifier for the book.
  • Counts. (semicolon-delimited blocks, with pound symbol (“#”) delimited fields)  This is the list of “Counts” found in this document, where a Count is defined as one of a subset of selected GDELT Themes appearing beside a numeric indicator, such as a phrase like “22 killed”.  Each Count found is separated with a semicolon, while the fields within a Count are separated by the pound symbol (“#”).  Unlike the primary GDELT event stream, these records are not issued unique identifier numbers, nor are they dated. As an example of how to interpret this file, an entry with CountType=KILL, Number=47, ObjectType=”jihadists” indicates that the text contained a passage like “47 jihadists were killed.”
    • Count Type. (text)  This is the category this count is of.  At the time of this writing, this is most often AFFECT, ARREST, KIDNAP, KILL, PROTEST, SEIZE, or WOUND, though other categories may appear here as well in certain circumstances when they appear in context with one of these categories, or as other Count categories are added over time.  A value of “PROTEST” in this field would indicate that this is a count of the number of protesters at a protest.
    • (integer)  This is the actual count being reported.  If CountType is “PROTEST” and Number is 126, this means that the source article contained a mention of 126 protesters.
    • Object Type. (text)  This records any identifying information as to what the number refers to.  For example, a mention of “20 Christian missionaries were arrested” will result in “Christian missionaries” being captured here.  This field will be blank in cases where no identifying information could be identified.
    • Location Type. See the documentation for V1Locations below.
    • Location FullName. See the documentation for V1Locations below.
    • Location CountryCode. See the documentation for V1Locations below.
    • Location ADM1Code. See the documentation for V1Locations below.
    • Location Latitude. See the documentation for V1Locations below.
    • Location Longitude. See the documentation for V1Locations below.
    • Location FeatureID. See the documentation for V1Locations below.
  • V2Counts. (semicolon-delimited blocks, with pound symbol (“#”) delimited fields)  This field is identical to the Counts field except that it adds a final additional field to the end of each entry that records its approximate character offset in the document, allowing it to be associated with other entries from other “V2” fields that appear in closest proximity to it.  Note: unlike the other location-related fields, the Counts field does NOT add ADM2 support at this time.  This is to maintain compatibility with assumptions that many applications make about the contents of the Count field.  Those applications needing ADM2 support for Counts should cross-reference the FeatureID field of a given Count against the V2Locations field to determine its ADM2 value.
  • Themes. (semi-colon-delimited)  This is the list of all recognized GDELT Themes found in the document.  The complete list of themes is constantly growing, but you can browse a sample of records to see the kind of themes present in this field.    Note that this field is NOT arbitrary topic extraction, it contains a list of which predefined GDELT Themes were found in the document.
  • (semicolon-delimited blocks, with comma-delimited fields)  This contains a list of all GDELT Themes referenced in the document, along with the character offsets of approximately where in the document they were found.  Each theme reference is separated by a semicolon, and within each reference, the name of the theme is specified first, followed by a comma, and then the approximate character offset of the reference of that theme in the document, allowing it to be associated with other entries from other “V2” fields that appear in closest proximity to it.  If a theme is mentioned multiple times in a document, each mention will appear separately in this field.
  • Locations. (semicolon-delimited blocks, with pound symbol (“#”) delimited fields)  This is a list of all locations found in the text, extracted through a second generation version of the Leetaru (2012)  The algorithm is run in a more aggressive stance here than ordinary in order to extract every possible locative referent, so may have a slightly elevated level of false positives.  The gazetteers used reflect the world circa 1979-2015, so may not properly capture all historical placenames.  NOTE: some locations have multiple accepted formal or informal names and this field is collapsed on name, rather than feature (since in some applications the understanding of a geographic feature differs based on which name was used to reference it).  In cases where it is necessary to collapse by feature, the Geo_FeatureID column should be used, rather than the Geo_Fullname column.  This is because the Geo_Fullname column captures the name of the location as expressed in the text and thus reflects differences in transliteration, alternative spellings, and alternative names for the same location.  For example, Mecca is often spelled Makkah, while Jeddah is commonly spelled Jiddah or Jaddah.  The Geo_Fullname column will reflect each of these different spellings, while the Geo_FeatureID column will resolve them all to the same unique GNS or GNIS feature identification number.  For more information on the GNS and GNIS identifiers, see Leetaru (2012).
    • Location Type. (integer) This field specifies the geographic resolution of the match type and holds one of the following values:  1=COUNTRY (match was at the country level), 2=USSTATE (match was to a US state), 3=USCITY (match was to a US city or landmark), 4=WORLDCITY (match was to a city or landmark outside the US), 5=WORLDSTATE (match was to an Administrative Division 1 outside the US – roughly equivalent to a US state).  This can be used to filter counts by geographic specificity, for example, extracting only those counts with a landmark-level geographic resolution for mapping.  Note that matches with codes 1 (COUNTRY), 2 (USSTATE), and 5 (WORLDSTATE) will still provide a latitude/longitude pair, which will be the centroid of that country or state, but the FeatureID field below will contain its textual country or ADM1 code instead of a numeric featureid.
    • Location FullName. (text) This is the full human-readable name of the matched location.  In the case of a country it is simply the country name.  For US and World states it is in the format of “State, Country Name”, while for all other matches it is in the format of “City/Landmark, State, Country”.  This can be used to label locations when placing counts on a map.  Note: this field reflects the precise name used to refer to the location in the text itself, meaning it may contain multiple spellings of the same location – use the FeatureID column to determine whether two location names refer to the same place.
    • Location CountryCode. (text) This is the 2-character FIPS10-4 country code for the location.  Note: GDELT continues to use the FIPS10-4 codes under USG guidance while GNS continues its formal transition to the successor Geopolitical Entities, Names, and Codes (GENC) Standard (the US Government profile of ISO 3166).
    • Location ADM1Code. (text) This is the 2-character FIPS10-4 country code followed by the 2-character FIPS10-4 administrative division 1 (ADM1) code for the administrative division housing the landmark.  In the case of the United States, this is the 2-character shortform of the state’s name (such as “TX” for Texas).  Note: see the notice above for CountryCode regarding the FIPS10-4 / GENC transition.  Note: to obtain ADM2 (district-level) assignments for locations, use V2Locations below instead.
    • Location Latitude. (floating point number) This is the centroid latitude of the landmark for mapping.  In the case of a country or administrative division this will reflect the centroid of that entire country/division.
    • Location Longitude. (floating point number) This is the centroid longitude of the landmark for mapping.  In the case of a country or administrative division this will reflect the centroid of that entire country/division.
    • Location FeatureID. (text OR signed integer) This is the numeric GNS or GNIS FeatureID for this location OR a textual country or ADM1 code.  More information on these values can be found in Leetaru (2012)Note: This field will be blank or contain a textual ADM1 code for country or ADM1-level matches – see above.  Note: For numeric GNS or GNIS FeatureIDs, this field can contain both positive and negative numbers, see Leetaru (2012) for more information on this.
  • V2Locations. (semicolon-delimited blocks, with pound symbol (“#”) delimited fields)  This field is identical to the Locations field with the primary exception of an extra field appended to the end of each location block after its FeatureID that lists the approximate character offset of the reference to that location in the text.  In addition, if a location appears multiple times in the article, it will be listed multiple times in this field.  The only other modification from Locations is the addition of a single new field “Location ADM2Code” in between “Location ADM1Code” and “Location Latitude”.
  • Persons. (semicolon-delimited)  This is the list of all person names found in the text, extracted through the Leetaru (2012)  This name recognition algorithm is unique in that it is specially designed to recognize the African, Asian, and Middle Eastern names that yield significantly reduced accuracy with most name recognition engines.
  • V2Persons. (semicolon-delimited blocks, with comma-delimited fields)  This contains a list of all person names referenced in the document, along with the character offsets of approximately where in the document they were found.  Each person reference is separated by a semicolon, and within each reference, the person name is specified first, followed by a comma, and then the approximate character offset of the reference of that person in the document, allowing it to be associated with other entries from other “V2” fields that appear in closest proximity to it.  If a person is mentioned multiple times in a document, each mention will appear separately in this field.
  • Organizations. (semicolon-delimited)  This is the list of all company and organization names found in the text, extracted through the Leetaru (2012)  This is a combination of corporations, IGOs, NGOs, and any other local organizations such as a local fair or council.  This engine is highly adaptive and is currently tuned to err on the side of inclusion when it is less confident about a match to ensure maximal recall of smaller organizations around the world that are of especial interest to many users of the GKG.  Conversely, certain smaller companies with names and contexts that do not provide a sufficient recognition latch may be missed or occasionally misclassified as a person name depending on context.  It is highly recommended that users of the Persons and Organizations fields histogram the results and discard names appearing just once or twice to eliminate most of these false positive matches.  Users wishing to compile lists of major NGOs, aid organizations, terror and political groups, and so on should take a look at the Themes and V2Themes fields, since there are specific “TAX_” categories that capture these more extensively.
  • V2Organizations. (semicolon-delimited blocks, with comma-delimited fields)  This contains a list of all organizations/companies referenced in the document, along with the character offsets of approximately where in the document they were found.  Each organization reference is separated by a semicolon, and within each reference, the name of the organization is specified first, followed by a comma, and then the approximate character offset of the reference of that organization in the document, allowing it to be associated with other entries from other “V2” fields that appear in closest proximity to it.  If an organization is mentioned multiple times in a document, each mention will appear separately in this field.
  • (comma-delimited floating point numbers)  This field contains a comma-delimited list of six core emotional dimensions, described in more detail below.  Each is recorded as a single precision floating point number.
    • Tone. (floating point number)  This is the average “tone” of the document as a whole.  The score ranges from -100 (extremely negative) to +100 (extremely positive).  Common values range between -10 and +10, with 0 indicating neutral.  This is calculated as Positive Score minus Negative Score.  Note that both Positive Score and Negative Score are available separately below as well.  A document with a Tone score close to zero may either have low emotional response or may have a Positive Score and Negative Score that are roughly equivalent to each other, such that they nullify each other.  These situations can be detected either through looking directly at the Positive Score and Negative Score variables or through the Polarity variable.
    • Positive Score. (floating point number)  This is the percentage of all words in the article that were found to have a positive emotional connotation.  Ranges from 0 to +100.
    • Negative Score. (floating point number)  This is the percentage of all words in the article that were found to have a positive emotional connotation.  Ranges from 0 to +100.
    • Polarity. (floating point number)  This is the percentage of words that had matches in the tonal dictionary as an indicator of how emotionally polarized or charged the text is.  If Polarity is high, but Tone is neutral, this suggests the text was highly emotionally charged, but had roughly equivalent numbers of positively and negatively charged emotional words.
    • Activity Reference Density. (floating point number)  This is the percentage of words that were active words offering a very basic proxy of the overall “activeness” of the text compared with a clinically descriptive text.
    • Self/Group Reference Density. (floating point number)  This is the percentage of all words in the article that are pronouns, capturing a combination of self-references and group-based discourse.  News media material tends to have very low densities of such language, but this can be used to distinguish certain classes of news media and certain contexts.
    • Word Count. (integer)  This is the total number of words in the document.
  • Dates. (semicolon-delimited blocks, with comma-delimited fields)  This contains a list of all date references in the document, along with the character offsets of approximately where in the document they were found.  If a date was mentioned multiple times in a document, it will appear multiple times in this field, once for each mention.  Each date reference is separated by a semicolon, while the fields within a date are separated by commas.
    • Date Resolution. This indicates whether the date was a month-day date that did not specify a year (4), a fully-resolved day-level date that included the year (3), a month-level date that included the year but not a day (2), or a year-level (1) date that did not include month or day-level information.
    • This is the month of the date represented as 1-12.  For year dates this field will contain a 0.
    • This is the day of the date represented as 1-31. For month and year dates this field will contain a 0.
    • This is the year of the date.  For Resolution=4 dates that include a month and day, but not a year, this field will contain a 0.
    • This is the character offset of the date within the document, indicating approximately where it was found in the body.  This can be used to associate the date with the entries from other “V2” fields that appeared in closest proximity to it.
  • V2GCAM. (comma-delimited blocks, with colon-delimited key/value pairs)  The Global Content Analysis Measures (GCAM) system runs an array of content analysis systems over each document and compiles their results into this field.  New content analysis systems will be constantly added to the GCAM pipeline over time, meaning the set of available fields will constantly grow over time.  Given that the GCAM system debuted with over 2,300 dimensions and continues to grow over time, it differs in its approach to encoding matches from the GKG’s native thematic coding system.  Instead of displaying the full English name of a content analysis dictionary or dimension, it assigns each dictionary a unique numeric identifier (DictionaryID) and each dimension within that dictionary is assigned a unique identifier from 1 to the number of dimensions in the dictionary (DimensionID).  Each dimension of each dictionary is assessed on a document and ONLY those dimensions that had one or more matches onto the document are reported.  If a dimension did not register any matches on a document, it is not reported in order to save space.  Thus, the absence of a dimension in this field can be interpreted as a score of 0.  Each dimension’s score is written to the V2GCAM field separated by a comma.  For each dimension, a numeric “key” identifies it of the form “DictionaryID.DimensionID”, followed by a colon, followed by its score.  Most dictionaries are count-based, meaning they report how many words in the document were found in that dictionary.  Thus, a score of 18 would mean that 18 words from the document were found in that dictionary.  Count-based dimensions have a key that begins with “c”.  Some dictionaries, such as SentiWordNet and SentiWords actually assign each word a numeric score and the output of the tool is the average of those scores for that document.  For those dictionaries, an entry will report the number of words in the document that matched into that dictionary, and a separate entry, beginning with a “v” instead of a “c” will report its floating-point average value.  The very first entry in the field has the special reserved key of “wc” and reports the total number of words in the document – this can be used to divide the score of any word-count field to convert to a percentage density score.  As an example, assume a document with 125 words.  The General Inquirer dictionary has been assigned the DictionaryID of 2 and its “Bodypt” dimension has a DimensionID of 21.  SentiWordNet has a DictionaryID of 10 and its “Positive” dimension has a DimensionID of 1.  Thus, the V2GCAM field for a document might look like “wc:125,c2.21:4,c10.1:40,v10.1:3.21111111” indicating that the document had 125 words, that 4 of those words were found the General Inquirer “Bodypt” lexicon, that 40 of those words were found in the SentiWordNet lexicon, and that the average numeric score of all of the words found in the SentiWordNet lexicon was 3.21111111.  For a complete list of the available dimensions, along with their assigned DictionaryID and DimensionID codes, their assigned key, and their human name and full citation to cite that dimension, please see the GCAM Master Codebook.

Several of following fields (marked with “NOT USED”) are not used for either of the books collections and will be NULL for all records.  They were included so that the column ordering of the dataset matched over GDELT datasets to make it easier to load into databases designed to hold other GDELT datasets.

  • SharingImage. NOT USED. This field is not used and is NULL for all records.  It was included so that the column ordering matched other GDELT datasets for easy database loading.
  • RelatedImages. NOT USED. This field is not used and is NULL for all records.  It was included so that the column ordering matched other GDELT datasets for easy database loading.
  • SocialImageEmbeds. NOT USED. This field is not used and is NULL for all records.  It was included so that the column ordering matched other GDELT datasets for easy database loading.
  • SocialVideoEmbeds. NOT USED. This field is not used and is NULL for all records.  It was included so that the column ordering matched other GDELT datasets for easy database loading.

 

  • Quotations. (pound-delimited (“#”) blocks, with pipe-delimited (“|”) fields).  This field contains results only for the Internet Archive collection, it is blank for the HathiTrust collection at their request due to a subset of restricted HathiTrust books.  Books frequently features excerpted statements and these quotations can offer critical insights into differing perspectives and emotions surrounding that event.  GDELT identifies and extracts all quoted statements from each article and additionally attempts to identify the verb introducing the quote to help lend additional context, separating “John retorted…” from “John agreed…” to show whether the speaker was agreeing with or rejecting the statement being made.  Each quoted statement is separated by a “#” character, and within each block the following fields appear, separated by pipe (“|”) symbols:
    • This is the character offset of the quoted statement within the document, indicating approximately where it was found in the body.  This can be used to associate the date with the entries from other “V2ENHANCED” fields that appeared in closest proximity to it.
    • This is the length of the quoted statement in characters.
    • This is the verb used to introduce the quote, allowing for separation of agreement versus disagreement quotes.  May not be present for all quotes and not all verbs are recognized for this field.
    • The actual quotation itself.
  • AllNames. (semicolon-delimited blocks, with comma-delimited fields)  This field contains a list of all proper names referenced in the document, along with the character offsets of approximately where in the document they were found.  Unlike the V2Persons and V2Organizations fields, which are restricted to person and organization names, respectively, this field records ALL proper names referenced in the article, ranging from named events like the Orange Revolution, Umbrella Movement, and Arab Spring, to movements like the Civil Rights Movement, to festivals and occurrences like the Cannes Film Festival and World Cup, to named wars like World War I, to named dates like Martin Luther King Day and Holocaust Remembrance Day, to named legislation like Iran Nuclear Weapon Free Act, Affordable Care Act and Rouge National Urban Park Initiative.  This field goes beyond people and organizations to capturing a much broader view of the named events, objects, initiatives, laws, and other types of names in each article.  Each name reference is separated by a semicolon, and within each reference, the name is specified first, followed by a comma, and then the approximate character offset of the reference of that name in the document, allowing it to be associated with other entries from other “V2” fields that appear in closest proximity to it.  If a name is mentioned multiple times in a document, each mention will appear separately in this field.  This field is designed to be maximally inclusive and in cases of ambiguity, to err on the side of inclusion of a name.
  • Amounts. (semicolon-delimited blocks, with comma-delimited fields)  This field contains a list of all precise numeric amounts referenced in the document, along with the character offsets of approximately where in the document they were found.  This field contains results only for the Internet Archive collection, it is blank for the HathiTrust collection at their request due to a subset of restricted HathiTrust books.  Its primary role is to allow for rapid numeric assessment of evolving situations (such as mentions of everything from the number of affected households to the estimated dollar amount of damage to the number of relief trucks and troops being sent into the area, to the price of food and medicine in the affected zone) and general assessment of geographies and topics.  Both textual and numeric formats are supported (“twenty-five trucks”, “two million displaced civilians”, “hundreds of millions of dollars”, “$1.25 billion was spent”, “75 trucks were dispatched”, “1,345 houses were affected”, “we spent $25m on it”, etc).  At this time, percentages are not supported due to the large amount of additional document context required for meaningful deciphering (“reduced by 45%” is meaningless without understanding what was reduced and whether the reduction was good or bad, often requiring looking across the entire enclosing paragraph for context).  This field is designed to be maximally inclusive and in cases of ambiguity, to err on the side of inclusion of an amount even if the object of the amount is more difficult to decipher.
    • Amount. This is the precise numeric value of the amount.  Embedded commas are removed (“1,345,123” becomes 1345123), but decimal numbers are left as is (thus this field can range from a floating point number to a “long long” integer).  Numbers in textual or mixed numeric-textual format (“such as “2m” or “two million” or “tens of millions”) are converted to numeric digit representation.
    • Object. This is the object that the amount is of or refers to.  Thus, “20,000 combat soldiers” will result in “20000” in the Amount field and “combat soldiers” in this field.
    • This is the character offset of the quoted statement within the document, indicating approximately where it was found in the body.  This can be used to associate the date with the entries from other “V2” fields that appeared in closest proximity to it.

These two fields are not used and were included to ensure that column ordering remained the same as other GDELT datasets.

  • TranslationInfo. NOT USED. This field is not used and is NULL for all records.  It was included so that the column ordering matched other GDELT datasets for easy database loading.
  • Extras. NOT USED. This field is not used and is NULL for all records.  It was included so that the column ordering matched other GDELT datasets for easy database loading.

 

BOOK-LEVEL METADATA

Finally, the remaining metadata fields are prefaced with “BookMeta_” and contain the record-level metadata about each book held by the Internet Archive or HathiTrust.  These fields are provided as-is in their raw form directly as they appear in the record metadata fields (compiled from the MARC XML records in the case of HathiTrust).  The set of available fields differs between the Internet Archive and HathiTrust collections due to their use of different metadata schemas.  Note that even common fields like subject tags may have very different contents or formatting between the two collections.

 

Internet Archive

Books are published from 1800 to 1922, inclusive, are stored within yearly tables named simply as the year (1800, 1801, 1802, etc).  Books published 1923 and after are stored in yearly tables ending in “notxt” (such as “1923notxt”) to indicate that fulltext is not available (fulltext is only available for books published from 1800-1922).

  • BookMeta_Identifier. This field contains the Internet Archive unique identifier for the work.  This can be appended to “https://archive.org/details/” to access the full details page for a book.  For example, the full details page for the book “foundationsofbot00berg” can be accessed via the URL “https://archive.org/details/foundationsofbot00berg”.
  • BookMeta_Title. This is the full title, including subtitle, of the book as listed in the Internet Archive’s metadata.
  • BookMeta_Creator. This is the full author listing for the book as listed in the Internet Archive’s metadata.  The Archive’s metadata does not distinguish between individual and corporate authors, so this field is a mixture of both.  Multiple authors are separated with semicolons.  Authors are often listed in lastname-firstname format, frequently with the years the author was active, such as “Ridpath, John Clark, 1840-1900.” The same author name may occasionally appear with and without year information or in firstname-lastname format, so careful filtering and parsing may be necessary in some applications.
  • BookMeta_Subjects. This is the full list of library-provided subject tags as listed in the Internet Archive’s metadata.  Subject tags can vary substantially from formal Library of Congress headings through more informal library-specific tags in some cases.  There are 201,919 unique tags in the collection.
  • BookMeta_Publisher. This is the fullname of the book’s publisher as listed in the Internet Archive’s metadata.  Publisher metadata is highly varied and may contain only the city of publication, such as “Urbana”, may contain mismatched brackets, such as “[Helena”, and multiple spelling, typographical errors, and other variants of a given publisher’s name.  Careful filtering and parsing is often required when using this field.
  • BookMeta_Language. This is the language name or code of the primary language of the work.  The contents of this field may vary, from “eng” to “ENG” to “English” to “English.” Or may be blank.  Language metadata can be inconsistent and this field may be inaccurate in some cases.  While this project attempted to process only English works, accuracy issues with the language field (especially missing and incorrect language codes) caused us to ignore the language designation when selecting the works to be processed.
  • BookMeta_Year. Internet Archive metadata records can contain both a “Publication Date” and “Publication Year” field.  The year field is frequently blank and may contain mixed values such as “1992-1996” or “Fall 2006” or even non-date information.  To compute the value in the DATE field from earlier, this field was combined with the BookMeta_Date field below.
  • BookMeta_Date. Internet Archive metadata records can contain both a “Publication Date” and “Publication Year” field.  The Date field contains the primary known publication date information for the book.  It is present for most works and is frequently simply a year, but can also contain dates in a range of formats like “1924-05-08” or “Dec 18-31 1949” or “Jan-Nov 1951” or “May 1938 – April 1939” or “4/8/1938” (sic).  To compute the value in the DATE field from earlier, this field was combined with the BookMeta_Year field above.
  • BookMeta_Sponsor. This is the organization which paid the digitizing costs of scanning this book.  There are 265 unique sponsors, ranging from “Google” and “MSN” to the “Sloan Foundation” to the “Getty Research Institute” to the “Perkins School of Theology.”  Note that there can be variations in this field like “MSN” and “msn” and “University of Illinois Urbana-Champaign” and “University of Illinois Urbana-Champaign Alternates.”
  • BookMeta_Contributor. This is the library or library system which made the book available from its collections to allow it to be scanned.  There are 965 unique values, such as “University of California Libraries” or “The Library of Congress” or “Sonoma County Wine Library.”  There are also values like “unknown library” and alternate entries like “University of California” versus “University of California Libraries.”
  • BookMeta_ScanningCenter. This is a code that represents the specific physical digitization center where the book was scanned.  This field will likely only be of interest for specific research questions regarding the flow of books through the Archive’s digitization facilities.  There are 48 values, such as “capitolhill” or “iala.”
  • BookMeta_Collections. All books in the Internet Archive’s holdings are part of a specific Collection and can be a member of multiple collections.  This field contains a semicolon-delimited list of all collections a book is a member of.  There are 1,357 unique values.  To learn more about a collection, append it to “https://archive.org/details/” like “https://archive.org/details/cdl”.
  • BookMeta_AddedDate. This is the precise date in “YYYY-MM-DD” format that the book was added to the Internet Archive’s collection.  This field is likely only of interest for specific research questions regarding the flow of books through the Archive’s digitization facilities.  There are 2,680 unique values and not all books may have a valid entry in this field.
  • BookMeta_ScannedImages. This records the number of unique scanned page images generated from a given book, including the cover, cover inside pages, and any additional closeup images taken of foldout or other special materials.  It can be used as a rough approximation of the number of pages in the book.
  • BookMeta_DownloadsJune2015. This is the total number of times the fulltext version of a book has been downloaded from the Internet Archive’s website as of June 15, 2015 and can be used as a rough approximation of its “popularity” or interest among Archive users.
  • BookMeta_CallNumber. This field contains the library-specific call number for the work as provided by the Contributing Library.  The format of this field differs substantially across books and may be null.  An example value is “SRLF_UCLA:LAGE-2822855”.
  • BookMeta_IdentifierBib. This field contains additional identifying information provided by the Contributing Library.  The format of this field differs substantially across books and may be null.  An example value is “LAGE-2822855”.
  • BookMeta_IdentifierArk. This field contains the Archival Resource Key (ARK) identifier for the work.  This field may be null if no ARK identifier was provided.  An example value is “ark:\/13960\/t20c57d7p”.
  • BookMeta_OCLDID. The name of this field is in error and should have been “OCLCID”, containing the OCLC Collection Identifier for the book.  This field may be null if no OCLC ID was provided.
  • BookMeta_FullText. This field contains the complete raw OCR text of the book for works published between 1800 and 1922, inclusive.  The text is as-is from the ASCII OCR file with the exception that hyphenated words at the end of lines have been reconstituted and carriage returns have been removed.  Books published in 1923 and after do not contain this field and the table names contain “notxt” to reflect this (ie “1805” vs “1926notxt”).
    • WARNING: this field contains a massive total volume of text, totaling 445,349,814,757 bytes (445GB). A single query scanning this field will consume roughly one half of your monthly free BigQuery quota, so use this field with extreme caution.  However, for those queries that need access to the fulltext of books, this entire field (nearly half a terabyte) can be scanned in just 11 seconds, making it possible to perform near-realtime fulltext analyses over 122 years of books.

 

HathiTrust

HathiTrust books are stored in yearly tables by publication year.  No fulltext is available for any HathiTrust works, so there is no difference in the table names before and after 1923.

  • BookMeta_Identifier. This field contains the HathiTrust unique identifier for the work.  This can be appended to “http://babel.hathitrust.org/cgi/pt?id=” to access the full details page for a book.  For example, the full details page for the HathiTrust book “hvd.32044004968673” can be accessed via the URL “http://babel.hathitrust.org/cgi/pt?id=hvd.32044004968673”.
  • BookMeta_Date. This is the year of publication of the book, computed by HathiTrust for the book from all available date information for the book.
  • BookMeta_AddedDate. This is the precise date in “YYYYMMDD” format that the book was added to the HathiTrust collection. This field is likely only of interest for specific research questions regarding the flow of books through the HathiTrust collection.  There are 660 unique values.
  • BookMeta_Contributor. This is the identifying code of the library or library system which made the book available from its collections to allow it to be scanned.  There are 55 unique values, such as “MIU” or “HVD”.
  • BookMeta_InternetArchiveIdentifier. A subset of HathiTrust books have been linked against their corresponding Internet Archive entries in which case this field contains the unique Internet Archive identifier for the book.  If this field is blank it does not necessarily mean that the Internet Archive does not have its own copy of this book, only that HathiTrust has not provided a link to a corresponding internet Archive book record.
  • This is the full title, including subtitle, of the book as listed in the HathiTrust metadata.
  • This is the full list of semicolon-delimited OCLC subject tags as listed in the HathiTrust metadata.  Unlike Internet Archive subject tags, these tags are largely consistent and have few non-standardized entries.  There are 298,000 unique tags in the collection.
    • NOTICE: Subject tags are derived from the OCLC WorldCat® database. OCLC has granted permission for the subject tags to be included in this dataset. The subject tags may not be used outside the bounds of this dataset.
  • The complete semicolon-delimited list of individual authors of the book.  Unlike the Internet Archive, HathiTrust distinguishes between individual and corporate authors.  Authors are often listed in lastname-firstname format, frequently with the years the author was active, such as “Shakespeare, William, 1564-1616.” The same author name may occasionally appear with and without year information or in firstname-lastname format.  There can also be non-standard entries like “Maw, Thomas E. (Thomas Edward), b. 1873.” and “Kirbye, G. W. [from old catalog].”  Careful filtering and parsing may be necessary in some applications.
  • The complete semicolon-delimited list of corporate authors of the book.  Unlike the Internet Archive, HathiTrust distinguishes between individual and corporate authors.  There are 153,207 unique values.
  • This is the fullname of the book’s publisher as listed in the HathiTrust metadata.  Publisher metadata is highly varied and may contain only the city of publication, such as “Urbana”, may contain mismatched brackets, such as “[Helena”, and multiple spelling, typographical errors, and other variants of a given publisher’s name.  Careful filtering and parsing is often required when using this field.  There are 156,142 unique values.
  • The HathiTrust MARC record for each book contains a field indicating the source of the book, ranging from “google” to “ia” to “getty” to “yale”.  There are 15 unique values and all books have a value in this field.  Google accounts for 93% of the entries in this field.
  • In addition to the HathiTrust-computed publication date for each book, the Publisher MARC field can also contain publication date information, which is provided here.  This can range from a specific year like “1913” to ranges like “1875-1893” to months like “April 1903” to combined copyright+publication date like “`1908 [c1871]” to entries like “between 1812 and 1814” or “1963? c1898.”  This field will frequently, but not always, match the results of the HathiTrust-computed publication date.  Mismatches can be due to a reprinting of an earlier copyrighted work, a series of reprintings over time, perhaps with slight differences in each, etc.  This field is provided for those in need of more precise publication date information to be able to access and reconcile the available information for each book.