Michigan State University

United States Congressional Collection Dataset

Download Text and Data

U.S. Congressional Collection

The dataset is available to download in full or in part by on-campus users. 


The U.S. Congressional Collection at Michigan State contains plain text and PDF files for the Congressional Record and its precursor publications the Annals of Congress, Register of Debates, and Congressional Globe, for the years 1789 through 1918. 

This collection has been compiled particularly for data mining and computational text analysis purposes -- for more robust search functionality and other tools for finding particular documents for reading, please feel free to access ProQuest Congressional.

The collection contains daily digest files for Congressional proceedings. Something approaching verbatim transcripts of debates did not appear until the 32nd Congress in 1852. Previous Congressional documents (including the Annals of Congress, Register of Debates, and Congressional Globe) were compiled from third-party accounts and contain paraphrased content. For more information, see the Library of Congress American Memory pages or THOMAS.

Some daily files are specifically labeled as "appendices", "supplementary", or as containing "laws." PDF filenames with the word "laws" contain the full text of acts of Congress. The appendices and other supplementary materials (housed in file names containing "app" or "supplemental") contain additional speeches, the text of public acts, and other documents, depending on year.  

Data Summary


The congressional data is provided in several different formats, along with a couple "helper files" to improve the usability of the collection. All files and directories are annotated below -- for additional assistance, don't hesitate to email dts@lib.msu.edu

Use caution when clicking file links below -- some files are VERY large!

Daily digest TXT and XML files organized by year, including zipped files of all content. 
The XML file for each day contains the full text of the day's proceedings along with some limited metadata about the file, including where to find the corresponding PDF file with the same content. Files from the latter end of the collection also include some further metadata, in the form of some limited subject headings describing the content.

congressional_dicrectory.csv (1.6 MB)
CSV file containing one line for each text file in the collection. 
Separated by commas, the first element is a year, the second is a title (for which there should be both XML and TXT files available), and the third is a path to the corresponding PDF document (see below under gis_congrecord). For example:


congressional_directory.json (1.6 MB)
JSON file for looking up the location of PDF files based on TXT of XML file titles, or by year.
Provides identical information to CSV file above.

congressional_pdfs_all.tar (99 GB - Caution!)
Uncompressed archive of all PDF files in the collection (very large!
The gis_congrecord/ directory (described below) is the "untarred" version of this file, in which all PDF files can be browsed individually.

congressional_txt_files.zip (4.5 GB)
Zip file containing all .txt files in the collection. 
Once unzipped, the directory structure should match the year-by-year arrangement of the congressional_by_year/ archive described above.

congressional_xml_all.xml (4.6 GB)
XML file containing all XML files in the collection within one single file.
The size of the file may pose challenges to accessing the data. Consider instead the file below. 

congressional_xml_files.zip (4.5 GB)
Zip file containing all .xml files in the collection. 
Once unzipped, the directory structure should match the year-by-year arrangement of the congressional_by_year/ archive above.

Directory containing all PDF files.
To find paths to particular PDF content, use JSON or CSV files above.


Full TXT for Congressional files - 4.5 GB
Full XML for Congressional files - 4.5 GB
All PDF Congressional files - 99 GB
Helper files in CSV / JSON - 1.5 MB

Data Quality

Metadata is minimal and text quality via OCR is of varying quality.


Data description prepared by Devin Higgins and Julia Frankosky.