Google Books Data Set
The Google Books Dataset (GDS) is a collection of scanned books, totaling approximately 3 million volumes of text, or 2.9 terabytes (2,970 gigabytes) of data in its zipped form. The books included in the dataset are public domain works digitized by Google and made available by the Hathi Trust Digital Library.
The subset generator provides a means of accessing these texts. On-campus users are permitted to search, compile collections, and download full text and metadata files. (Users are not permitted to in any way reproduce the downloaded data.) It is possible to access the collection in its entirety directly, however the way the data is organized is not well-suited to browsing (paths to texts are based on unique identifiers, not author name or title), and search is not available. The subset generator was created (using the Python web framework Django in coordination with a MySQL database and a Solr index) to allow users to built their own sets of materials based on their own particular research interests.