Earlier this week we examined the storage growth of the Internet Archive's TV News Archive over the past quarter-century, charting the total bytes of new video content stored daily in its smallest video format (either the original MPEG format or the derived MP4). Total daily new storage volume has remained sharply elevated since early 2021 even while the number of new channels has not increased sufficiently over that period to fully explain that surge. What might explain this large increase in storage volume?
Unsurprisingly, the Archive's storage growth closely tracks its transition from SD to HD resolution beginning in March 2021 to around 65% of broadcasts and 60% of airtime today. The graphs below also demonstrate the immense potential of digital twins like BigQuery + Bigtable + GCS to provide advanced archive-scale analytic insights into how vast digital archives evolve over time.
Let's start by looking at our previous graph of daily bytes of new ingested content stored in the smallest available video format for each broadcast. There is a sharp sustained increase in daily storage ingest in early 2021.
The underlying filesize of a video format is typically driven by several factors, including compression algorithm, level of visual detail and resolution. That last variable, resolution, is of particular interest given the transition over the past quarter-century from SD to HD resolution by most television news channels across the world. Let's use our BigQuery + Bigtable GCS digital twin to examine the underlying resolution of the Archive by year.
Once again, the BigQuery + Bigtable connector makes it trivial to ask this question of our GCS digital twin. To accommodate for the influence of outliers, let's look at both the average and median height and width of all videos ingested into the archive each day, using BigQuery's AVG() and APPROX_QUANTILES() functions respectively:
SELECT DATE, COUNT(1) totShows, ROUND(SUM(sec)) totSec, ROUND(SUM(bytes)) totBytes, ROUND(SUM(pixels)) totVEPixels, ROUND(AVG(vidDims)) avgVidDims, ROUND(AVG(vidHeight)) avgHeight, ROUND(AVG(vidWidth)) avgWidth, APPROX_QUANTILES(vidDims, 100)[OFFSET(50)] AS medVidDims, APPROX_QUANTILES(vidHeight, 100)[OFFSET(50)] AS medHeight, APPROX_QUANTILES(vidWidth, 100)[OFFSET(50)] AS medWidth, COUNT(CASE WHEN (vidHeight>480 AND vidWidth>640) THEN 1 END) totShowsHD, ROUND(SUM(CASE WHEN (vidHeight>480 AND vidWidth>640) THEN sec END)) totSecHD from ( select FORMAT_DATE('%m/%d/%Y', PARSE_DATE('%Y%m%d', substr(rowkey, 0, 8) )) AS DATE, CAST(substr(rowkey, 0, 8) as NUMERIC) DATESQL, CAST(JSON_EXTRACT_SCALAR(DOWN, '$.durSec') AS FLOAT64) sec, CAST(JSON_EXTRACT_SCALAR(DOWN, '$.sizeBytes') AS FLOAT64) bytes, ( (CAST(JSON_EXTRACT_SCALAR(DOWN, '$.durSec') AS FLOAT64) / 4) * CAST(JSON_EXTRACT_SCALAR(DOWN, '$.height') AS FLOAT64) * CAST(JSON_EXTRACT_SCALAR(DOWN, '$.width') AS FLOAT64) ) AS pixels, ( CAST(JSON_EXTRACT_SCALAR(DOWN, '$.height') AS NUMERIC) * CAST(JSON_EXTRACT_SCALAR(DOWN, '$.width') AS NUMERIC) ) AS vidDims, CAST(JSON_EXTRACT_SCALAR(DOWN, '$.height') AS NUMERIC) vidHeight, CAST(JSON_EXTRACT_SCALAR(DOWN, '$.width') AS NUMERIC) vidWidth FROM ( SELECT rowkey, ( select array(select value from unnest(cell))[OFFSET(0)] from unnest(cf.column) where name in ('DOWN') ) DOWN FROM `[PROJECTID].bigtableconnections.digtwin` where CAST(substr(rowkey, 0, 8) as NUMERIC) > 20000000 ) WHERE DOWN like '%DOWNLOADED_SUCCESS%' ) group by DATE order by MIN(DATESQL)
Plotting the average resolution over time by broadcast, we see that the Archive remains largely at SD resolution until January 15, 2019, where there is a sustained jump, followed by another jump on March 20, 2021 that leads to a long-term sustained rise in the resolution of the overall archive.
Given how susceptible averages are to outliers, plotting the median resolution shows the gradual shift of the Archive to HD resolution. Interestingly, the increase applies to width first, reflecting the transition to widescreen formats:
What if we just plot the percent of the archive that has resolution greater than 640×480 (SD resolution)? Here we can see some early glimmers in 2009-2012, followed by a sustained slow experimental increase from early 2019, before a wholesale switchover from the late evening of March 19, 2021 from 3.3% of airtime to 6.3%, sweeping archive-wide on March 20, 2021 to 32% of airtime.
Zooming into the January 2021 – present period, we can more clearly see the archive's transition to HD. The increasing deviation between the percentage of broadcasts versus airtime that are HD is due to how EPG data chops airtime into discrete broadcasts and the lack of EPG availability for all channels. The transition to HD increased steadily from March 2021 to an October 2022 peak of around 60% of the archive, before falling slowly and then surging. Last month it represented around 60% of daily airtime and around 65% of broadcasts, dropping sharply on July 25th, possibly due to the addition of new channels.