500 Years Of Images From The World’s Books As An AI Training Dataset

Eight years ago GDELT's Kalev Leetaru collaborated with the Internet Archive to extract the images from 600 million digitized public domain book pages dating back 500 years from over 1,000 libraries worldwide and make them all browseable and searchable (via both the metadata of the original book and the text surrounding each image), "reimagining" how we interact with books. The resulting project received global media coverage, with roughly half of the collection, totaling around 4.7 million images, made available on Flickr.

The entire collection of more than 12 million extracted book images contain the text immediately surrounding each image and both the imagery and text are public domain. Unlike contemporary web scraped image collections of unknown provenance and which tend to skew to modern representations of past imagery, each image in the Book Images collection is connected back to its source book for clear sourcing and represents a contemporary representation of that event or theme.

There were a number of fascinating computational art projects using the dataset when it first came out that showcased the power of how these images could be used to create new art. Fast forward nearly a decade and with the rise of neural generation tools like Midjourney, DALL-E, Stable Diffusion and others, this incredibly unique dataset of 500 years of public domain imagery and their associated textual descriptions offers a nearly unprecedented opportunity for fundamentally new kinds of artistic creation.

The entire collection can be downloaded from the Internet Archive's collections. Due to their size, the collection was split into more than 1,000 items, each of which contains a series of ZIP files, one per book, that contains both the images as JPEG files and a metadata file that contains the associated text for each image. You can learn more in the user manual and the generation script.

We'd love to see what can be done with this vast archive. Please get in touch with any questions or ideas.

Learn More About The Collection.
Download The Book Images.
Metadata Documentation.

The GDELT Project

500 Years Of Images From The World's Books As An AI Training Dataset

Archives