Summarizing An Entire Day Of Russian TV News Using Google's New 32K PaLM LLM Model

Last month we demonstrated the use of Google's new 32K PaLM LLM to summarize entire evening news and CSPAN broadcasts in a single operation, avoiding the use of prompt chaining, which can dramatically increase hallucination and loss of attention and coherence. Given that 32K token models are sufficient to process broadcasts of an hour or two (less for languages like Russian due to tokenization differences), but not an entire day, what would it look like to take an entire day of Russian television news from Russia 24, split it into 30 minute chunks and summarize each in turn? While there is the risk that stories will be split in the middle on the 30 minute boundaries, having half-hour chunks should give the model sufficient content volume to yield reasonably high-quality summaries while remaining sufficiently under the 32K token limit as to allocate as much as possible of the token budget to output. Given that PaLM now natively supports Russian language content, we tested the impact of language and NMT machine translation on the summaries by testing both the native Russian broadcast transcript (via Google's Chirp ASR model) and the English translation of the transcript via Google Translate. To test how how stable the resulting summaries are, we ran each 30 minute broadcast chunk through the model five times each for the native Russian and English machine translations, yielding 10 total summaries for each chunk.

The end result is that there is substantial promise in using large-token models to summarize global television news broadcasts by combining large-model ASR like Chirp with large-token LLMs like PaLM's 32K model, though also substantial limitations at this time. Performance on the native Russian language content yields considerably more detailed and rich summaries, but with slightly lower coherence as the model occasionally ignores its instructions and outputs a Russian-language instead of English-language summary or exhibits codeswitching. Performance is sufficiently higher on native content to suggest LLM workflows be applied directly to native source material instead of translating first. As with previous summarization workflows, summaries are strongly non-deterministic in scope, with the list of stories output in each summary varying strongly from run to run. Some stories tend to appear in all five runs, while others differ in scope, detail or existence between runs. This offers a potent reminder that the quality of LLM-based summarization comes down substantially to luck and the random chance of the model's output.

No attempts were made to examine hallucination or to quantify the "quality" of the summaries below.

You can see the full technical details at the end of this post, including all code. The most important piece of which is the following Perl script that we've put together to make this process easier. It accepts any standard SRT file, transcodes it into plain text and crafts a Bison-32K request to the GCP Vertex AI API for PaLM and returns its results:

To explore this workflow, we chose a complete day of coverage of Russia 24 from September 28th, 2023. Due to a lack of EPG data this day's coverage was naturally split into monotonic 30 minute chunks, making it trivial to run through our LLM workflow. Each 30 minute chunk was processed five times for Russian five times for English using the following prompt:

Divide the television news transcript below into a set of discrete stories, make them into a bulleted list with a title and short description and summary of each in English. Each story should include a Title, Description and Summary and be written in English. Do not use any information beyond what is in the transcript below. \n\nTELEVISION NEWS TRANSCRIPT:\n

We have concatenated all of the 30 minute chunks for each day in order into the following day-long summaries:

You can view the individual half-hour summaries below. For these we have also provided the raw JSON API response from PaLM for each chunk, which includes the trust and safety category metrics for that chunk. Three of the broadcasts triggered guardrail false positives, which can be seen below. Interestingly, one chunk triggered guardrails in all five English runs, but did not trigger a guardrail in its original Russian form, demonstrating how guardrails are often highly language-specific for commercial LLM vendors (indeed, this is one of the reasons they are often not publicly available in certain languages – the model supports that language, but language-specific guardrails have not been tuned yet):

  • Russian Transcripts
    • RUSSIA24_20230928_023000_RIK_Rossiya_24 (Run 3)
    • RUSSIA24_20230928_223000_RIK_Rossiya_24 (Runs 2 & 3)
  • Translated English Transcripts:
    • RUSSIA24_20230928_023000_RIK_Rossiya_24 (Run 3)
    • RUSSIA24_20230928_183000_RIK_Rossiya_24 (Runs 1-5)

The outputs are available below. Compare the summaries across the five runs for a given chunk to get a sense of how they can be similar and how they can differ across runs. Also compare how they can differ across languages – for example the English and Russian versions of RUSSIA24_20230928_153000_RIK_Rossiya_24:

 

Technical Details

For those interested in replicating this workflow, you can find the complete code below:

#download the original Russian and Google Translate-translated English translations of the transcripts for Russia 24:
start=20230928; end=20230928; while [[ ! $start > $end ]]; do echo $start; start=$(date -d "$start + 1 day" "+%Y%m%d"); done > DATES
rm -rf INVENTORIES
mkdir INVENTORIES
time cat DATES | parallel --eta 'wget -q https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/RUSSIA24.{}.inventory.json -P ./INVENTORIES/'
rm IDS; find ./INVENTORIES/ -depth -name '*.json' | parallel --eta 'cat {} | jq -r .shows[].id >> IDS'
wc -l IDS
rm -rf ./INVENTORIES/
mkdir ORIGINAL; time cat IDS | parallel --eta 'wget -q https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/{}.transcript.srt -P ./ORIGINAL/'
mkdir TRANSLATED; time cat IDS | parallel --eta 'wget -q https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/{}.transcript.en.srt -P ./TRANSLATED/'

#download our SRT LLM summarizer script:
wget https://storage.googleapis.com/data.gdeltproject.org/blog/2023-llmsummarizer-tvnews/llmsummarizebroadcast_srt.pl
chmod 755 ./llmsummarizebroadcast_srt.pl
#summarize Russian transcripts...
mkdir RUN1; time find ./ORIGINAL/*.srt | parallel --eta -j 10 './llmsummarizebroadcast_srt.pl {} ./RUN1/{/}.txt SRT SUMMARIZE'
mkdir RUN2; time find ./ORIGINAL/*.srt | parallel --eta -j 10 './llmsummarizebroadcast_srt.pl {} ./RUN2/{/}.run2.txt SRT SUMMARIZE'
mkdir RUN3; time find ./ORIGINAL/*.srt | parallel --eta -j 10 './llmsummarizebroadcast_srt.pl {} ./RUN3/{/}.run3.txt SRT SUMMARIZE'
mkdir RUN4; time find ./ORIGINAL/*.srt | parallel --eta -j 10 './llmsummarizebroadcast_srt.pl {} ./RUN4/{/}.run4.txt SRT SUMMARIZE'
mkdir RUN5; time find ./ORIGINAL/*.srt | parallel --eta -j 10 './llmsummarizebroadcast_srt.pl {} ./RUN5/{/}.run5.txt SRT SUMMARIZE'&

#summarize Google Translate-generated English translations of the transcripts...
mkdir RUN1EN; time find ./TRANSLATED/*.srt | parallel --eta -j 10 './llmsummarizebroadcast_srt.pl {} ./RUN1EN/{/}.txt SRT SUMMARIZE'
mkdir RUN2EN; time find ./TRANSLATED/*.srt | parallel --eta -j 10 './llmsummarizebroadcast_srt.pl {} ./RUN2EN/{/}.run2.txt SRT SUMMARIZE'
mkdir RUN3EN; time find ./TRANSLATED/*.srt | parallel --eta -j 10 './llmsummarizebroadcast_srt.pl {} ./RUN3EN/{/}.run3.txt SRT SUMMARIZE'
mkdir RUN4EN; time find ./TRANSLATED/*.srt | parallel --eta -j 10 './llmsummarizebroadcast_srt.pl {} ./RUN4EN/{/}.run4.txt SRT SUMMARIZE'
mkdir RUN5EN; time find ./TRANSLATED/*.srt | parallel --eta -j 10 './llmsummarizebroadcast_srt.pl {} ./RUN5EN/{/}.run5.txt SRT SUMMARIZE'

#copy summaries and raw API output to GCS:
gsutil -m -h "Content-Type:text/plain; charset=utf-8" cp ./RUN2/*.txt gs://[YOURBUCKET]/
gsutil -m -h "Content-Type:text/plain; charset=utf-8" cp ./RUN2/*.fromapi gs://[YOURBUCKET]/

#generate the daily summaries...
mkdir COMBINED
rm COMBINED/*
for file in ./RUN1/*.txt; do echo -e "===========================================\n$file\n-------------------------------------------\n" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run1.txt; cat "$file" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run1.txt; echo "" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run1.txt; done
for file in ./RUN2/*.txt; do echo -e "===========================================\n$file\n-------------------------------------------\n" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run2.txt; cat "$file" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run2.txt; echo "" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run2.txt; done
for file in ./RUN3/*.txt; do echo -e "===========================================\n$file\n-------------------------------------------\n" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run3.txt; cat "$file" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run3.txt; echo "" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run3.txt; done
for file in ./RUN4/*.txt; do echo -e "===========================================\n$file\n-------------------------------------------\n" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run4.txt; cat "$file" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run4.txt; echo "" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run4.txt; done
for file in ./RUN5/*.txt; do echo -e "===========================================\n$file\n-------------------------------------------\n" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run5.txt; cat "$file" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run5.txt; echo "" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run5.txt; done

for file in ./RUN1EN/*.txt; do echo -e "===========================================\n$file\n-------------------------------------------\n" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run1.en.txt; cat "$file" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run1.en.txt; echo "" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run1.en.txt; done
for file in ./RUN2EN/*.txt; do echo -e "===========================================\n$file\n-------------------------------------------\n" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run2.en.txt; cat "$file" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run2.en.txt; echo "" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run2.en.txt; done
for file in ./RUN3EN/*.txt; do echo -e "===========================================\n$file\n-------------------------------------------\n" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run3.en.txt; cat "$file" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run3.en.txt; echo "" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run3.en.txt; done
for file in ./RUN4EN/*.txt; do echo -e "===========================================\n$file\n-------------------------------------------\n" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run4.en.txt; cat "$file" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run4.en.txt; echo "" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run4.en.txt; done
for file in ./RUN5EN/*.txt; do echo -e "===========================================\n$file\n-------------------------------------------\n" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run5.en.txt; cat "$file" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run5.en.txt; echo "" >> ./COMBINED/RUSSIA24_20230928_230000_RIK_Rossiya_24.dayconcat.run5.en.txt; done

#make a list of guardrail false positives:
grep '"blocked": true' RUN1/*.fromapi
grep '"blocked": true' RUN2/*.fromapi
grep '"blocked": true' RUN3/*.fromapi
grep '"blocked": true' RUN4/*.fromapi
grep '"blocked": true' RUN5/*.fromapi
grep '"blocked": true' RUN1EN/*.fromapi
grep '"blocked": true' RUN2EN/*.fromapi
grep '"blocked": true' RUN3EN/*.fromapi
grep '"blocked": true' RUN4EN/*.fromapi
grep '"blocked": true' RUN5EN/*.fromapi

#examine the log for any API errors...
cat SUBMIT.LOG

We hope this serves as a template for those interested in leveraging large-token LLMs for at-scale television news summarization.