Deep Linking Democracy: An Index Of Half A Million Legislative Mentions On CSPAN Spanning 2009-2023

Today in collaboration with the Internet Archive's TV News Archive, we are immensely excited to announce a massive new index of more than 573,000 legislative mentions of specific bills, resolutions and other Congressional actions spanning more than a quarter-million broadcasts representing 1.3 billion seconds of airtime over the past 14 years across CSPAN, CSPAN2 and CSPAN3. Using an automated scanner, we processed the Internet Archive's complete archives of the three channels to identify any spoken mention of a specific piece of legislation by number (such as "H.R.101"). Each row in the JSON file represents a single mention of bill/resolution/amendment, with its immediate context, citation information to the broadcast and a link to the complete legislative record of the legislation on Congress.gov, making the Visual Explorer's "Deep Linking Democracy" experiment searchable for the first time!

This past May we unveiled a powerful new metaphor for understanding democracy in action: Deep Linking Democracy, connecting our nation's legislation to the legislative process through deep linking of CSPAN coverage in the Visual Explorer in collaboration with the Internet Archive's TV News Archive. In short, when any of the more than quarter-million CSPAN, CSPAN2 and CSPAN3 broadcasts are viewed in the Visual Explorer, the broadcast transcript is scanned live and all legislative references automatically converted to hyperlinks that link directly to the underlying legislative record on Congress.gov. A mention of "s. res. 198" in a congressional debate becomes a hyperlink to "A resolution designating the week of April 23 through April 29, 2023, as 'National Water Week'", making it instantly clear what is being debated, negotiated and discussed. Until now that legislative indexing has occurred only on-demand when a given broadcast was viewed in the Visual Explorer, meaning it wasn't possible to search the index by legislation or perform basic analyses like ask which piece of legislation has been the most-mentioned. Today we are releasing the complete underlying index of all half-million legislative mentions in JSON format.

Note that this index was 100% automatically compiled and identifies only explicit numeric legislation mentions. It counts only explicit mentions of a piece of legislation, not the total time spent discussing or debating it. A bill mentioned once, followed by 3 hours of debate will count as only a single mention in this index. The index also has several limitations, but nonetheless, offers a first glimpse at the power and potential of automated legislative indexing.

In all, of the 256,941 broadcasts spanning the three channels, 156,157 (61%) contained at least one explicit numeric legislative mention. In total, 573,040 mentions were identified to 13,895 distinct legislative numbers. Given that numbers are reused across Congresses ("S.1" refers to a different piece of legislation in each Congress), there are 28,315 unique Congress-legislation pairs in the index.

What are the top ten most-commonly mentions pieces of legislation over the past 14 years?

S.1     217918
S.2     24540
S.3     11552
H.RES.8 10966
S.4     8758
S.5     6214
H.RES.965       6027
S.8     5604
H.R.1   4922
S.6     4639

Remember that legislative numbers are reused with each Congress, so the table above simply shows which are the most-debated legislative numbers (with the number referring to different legislation in each Congress). Let's combine with the specific Congress in session at the time so that we have a list of distinct pieces of legislation:

S.1-CONGRESS:114        35752
S.1-CONGRESS:115        35035
S.1-CONGRESS:117        32627
S.1-CONGRESS:116        29634
S.1-CONGRESS:113        28729
S.1-CONGRESS:112        26863
S.1-CONGRESS:111        15621
S.1-CONGRESS:118        13657
H.RES.8-CONGRESS:117    10946
H.RES.965-CONGRESS:116  5936

You can download the complete versions of the two lists above:

You can download the complete legislative mention index (through mid-day September 27, 2023) below:

You can see an example entry below:

{
	"preText": "WHY ANYONE ACROSS THE AISLE WOULD VOTE AGAINST THAT. THE OTHER BILL, ",
	"IAID": "CSPAN_20230516_155900_U.S._House_of_Representatives_U.S._House_of_Representatives",
	"IAURL": "https://archive.org/details/CSPAN_20230516_155900_U.S._House_of_Representatives_U.S._House_of_Representatives",
	"postText": ", THE FEDERAL LAW ENFORCEMENT SERVICE WEAPON PURCHASE ACT. MR. SPEAKER, WE ",
	"broadcastDate": "2023-05-16T15:59:00.000Z",
	"legislation": "H.R.3091",
	"channel": "CSPAN",
	"legislationNum": 3091,
	"congressGovLink": "https://www.congress.gov/quick-search/legislation?wordsPhrases=\"H.R.3091\"&wordVariants=on&congresses%5B%5D=118",
	"legislationType": "H.R.",
	"TVEURL": "https://api.gdeltproject.org/api/v2/tvv/tvv?id=CSPAN_20230516_155900_U.S._House_of_Representatives_U.S._House_of_Representatives",
	"congressNum": 118,
	"program": "U.S. House of Representatives U.S. House of Representatives"
}

The index is in JSON-NL format, with each extracted mention of a piece of legislation appearing on its own line. If legislation is mentioned multiple times in a broadcast it will have multiple entries in the file, one per mention. Each mention has 13 fields:

  • IAID: The unique Internet Archive item identifier for this broadcast.
  • IAURL: The URL to the Internet Archive's entry for this broadcast.
  • TVEURL: The URL to the Visual Explorer interface for this broadcast where all legislative mentions are automatically converted to clickable inline hyperlinks.
  • broadcastDate: The JSON timestamp of the start of the broadcast in GMT. Note that this is not the timestamp of the actual mention – all mentions in a broadcast will have the same timestamp – the start of the broadcast.
  • channel: The specific channel (CSPAN, CSPAN2, CSPAN3).
  • program: The EPG (Electronic Program Guide) name of the broadcast.
  • congressNum: The specific Congress in session during the broadcast.
  • preText: To provide context for the mention, up to 75 characters (including spaces) to the left of the mention are extracted here (if the first word is a partial word it is removed).
  • legislation: The fully-qualified normalized form of the legislation like "S.3".
  • legislationType: The normalized form of the legislation type like "S.", "H.AMDT.", etc.
  • legislationNum: The normalized number of the legislation like "1", "2321", etc. Spelled-out legislation numbers are normalized to numeric form as possible.
  • congressGovLink: The URL of the Congress.gov official record for this legislation.
  • postText: To provide context for the mention, up to 75 characters (including spaces) to the right of the mention are extracted here (if the last word is a partial word it is removed).

To build this database, we took our Visual Explorer in-browser legislative extractor and converted it into a standalone Perl script that can be run on any SRT closed captioning file and in collaboration with the Internet Archive's TV News Archive, used it to process their complete archive of CSPAN, CSPAN2 and CSPAN3 broadcasts back to 2009.

For those interested in the complete technical workflow, you can see the complete sequence below. On a 64-core VM using RAMdisk, it takes around 25 minutes to complete the full processing workflow:

#run over all of the files...
wget https://storage.googleapis.com/data.gdeltproject.org/blog/2023-deeplinkingdemocracycspanlegislativedatabase/cspanlegislationextractor_srt.pl
mkdir OUTPUT
time find ./CACHE/ | parallel --eta './cspanlegislationextractor_srt.pl {} ./OUTPUT/{/}.json'

#how many shows had legislative mentions?
find CACHE/ -type f | wc -l
256941
find OUTPUT/ -type f | wc -l
156157

#concatenate all of the outputs together and verify...
time find OUTPUT/ -type f -exec cat {} + > CSPANLEGISLATIONMENTIONS-2009-20230927.json
wc CSPANLEGISLATIONMENTIONS-2009-20230927.json
573040  17278123 446975524
time find OUTPUT/ -type f -print0 | xargs -0 cat | wc 
573040  17278123 446975524

#compute two sample histograms...
time jq -r '.legislation' CSPANLEGISLATIONMENTIONS-2009-20230927.json | tr '[:lower:]' '[:upper:]' | grep -v '^$' | sort | uniq -c | sort -nr | awk '{print $2 "\t" $1}' > histogram-2009-20230927-legislation.txt
time jq -r '.legislation + "-Congress:" + (.congressNum | tostring) + ""' CSPANLEGISLATIONMENTIONS-2009-20230927.json | tr '[:lower:]' '[:upper:]' | grep -v '^$' | sort | uniq -c | sort -nr | awk '{print $2 "\t" $1}' > histogram-2009-20230927-legislationxcongress.txt
wc -l histogram-2009-20230927-legislation.txt
13895
wc -l histogram-2009-20230927-legislationxcongress.txt
28315

We're excited to see what journalists and scholars are able to do with this massive new index!