When exploring datasets like the Visual Global Entity Graph 2.0, one key challenge in interpreting the results lies in being able to distinguish news content from advertisements. We're excited to announce that in the coming weeks we will be releasing a special lookup file for each broadcast that can be used alongside the VGEG 2.0 dataset that lists for each second of airtime whether it was an advertisement or news content with extremely high accuracy.
To perform advertisement detection, we are running the widely used open source utility "ccextractor" over the raw original MPEG2 capture files with the TTXT output format. The actual command used is:
cat VIDEOFILE | mbuffer -q -m 10M | ccextractor -out=ttxt -bi - -o ./OUTPUTTRANSCRIPT.TTXT
One of the unique aspects of ccextractor's TTXT format compared with its more widely used TXT and SRT output formats is that it includes what is known as the "Caption Mode" which can be one of the following:
- Roll-Up. In the case of television news this is the live transcription generated as the show airs and uniquely identifies the news content of the show.
- Pop-On & Paint-On. In the case of television news this indicates prerecorded captioning provided with the video content and uniquely identifies advertising content.
In the case of ccextractor's TTXT format, advertisements appear effectively exclusively under the "POP" captioning mode, with a small number of transition lines appearing as "PAI". In contrast, the news programming appears exclusively as RU1-RU4.
Through extensive manual exploration and spot checking across the past decade of CNN, MSNBC and Fox News and the ABC/CBS/NBC evening news broadcasts, it appears these codes represent almost perfect filters for news versus advertising content and to date we have not identified a single line of captioning in our tests that the captioning mode field was not able to properly segment.
It will take a number of weeks to reprocess a decade of 24/7 broadcasts across three channels, but as this reprocessing completes we will be releasing a special inventory dataset that contains the timecodes of all advertising content in each broadcast on these channels to make it easier for you to exclude those time ranges from your analysis of the VGEG V2 and other datasets. Over time we will be updating the TV Explorer to incorporate this data as well.
We're tremendously excited by the possibilities these new inventory files will enable!