The GDELT Project

Television News Segmentation Using Our New TV News Sentence Embeddings

Using our new Global Similarity Graph Television News Sentence Embeddings, can we segment television news broadcasts into discrete "stories" simply by computing the similarity of each closed captioning line to the previous line or lines and flagging dissimilar transitions? To explore this, we tested two scenarios:

For our sample broadcast, we selected ABC World News Tonight With David Muir on June 18, 2021 at 3:30PM PDT.

You can see the final results below:

We color-coded similarities below 0 in red, from 0 to 0.1 in orange and from 0.1 to 0.15 in yellow to show different thresholds. You can see that, while not perfect, most of the transition points are indeed captured even using this simple approach.  You can compare these machine-annotated transition points with Vanderbilt Television News Archive's human-annotated story delineation for the same broadcast.

Below you can see the simple PERL script we wrote to take the GSG TV News Sentence Embedding file for this broadcast and compare each line and each set of four lines.

#!/usr/bin/perl

use JSON::XS;

$INFILE = $ARGV[0];

#########################################################
#read the file in and load the embeddings into an array...
open(FILE, $INFILE);
while(<FILE>) {
   undef($ref); $ref = decode_json($_); if (!defined($ref)) { next; }
   push(@OFFSETS, $ref->{'offset'}); 
   push(@PREVIEWURL, $ref->{'previewUrl'}); 
   $ref->{'lead'}=~s/\s+/ /gs; $ref->{'lead'}=~s/,//g; push(@LEAD, $ref->{'lead'}); 
   push(@EMBED, $ref->{'sentEmbed'}); if ($NUMDIMS < 1) { $NUMDIMS = scalar(@{$ref->{'sentEmbed'}}); }
   $NUMRECS++;
}
close(FILE);

#and define our cosine sub...
sub cossim {
   my $sumt = 0.0, $suma = 0.0, $sumb = 0.0;
   for(my $i=0;$i<scalar(@{$_[0]});$i++) {
      $sumt += ($_[0]->[$i]*$_[1]->[$i]);
      $suma += ($_[0]->[$i]*$_[0]->[$i]);
      $sumb += ($_[1]->[$i]*$_[1]->[$i]);
   }
   $suma = sqrt($suma); $sumb = sqrt($sumb);
   return $sumt / ($suma * $sumb);
}

#########################################################

##################################################################################################################
##################################################################################################################
#now loop through all of the lines and compute their rolling similarities...

open(OUT1, ">./COMP1.csv"); binmode(OUT1, ":utf8");
open(OUT2, ">./COMP2.csv"); binmode(OUT2, ":utf8");
for($linei=0;$linei<$NUMRECS;$linei++) {

   #this line against the previous line...
   if ($linei > 0) {
      $sim = cossim($EMBED[$linei], $EMBED[$linei - 1]); $sim = sprintf("%0.4f", $sim);
      print OUT1 "$OFFSETS[$linei],$sim,$LEAD[$linei],$PREVIEWURL[$linei]\n";
   }

   #four lines against the previous four lines
   if ($linei > 2 && $linei < ($NUMRECS-5)) {
      #first average the last four lines into a single vector...
      my @vec1;
      for ($j=0;$j<$NUMDIMS;$j++) {
         $vec1[$j] = ($EMBED[$linei]->[$j] * $EMBED[$linei-1]->[$j] * $EMBED[$linei-2]->[$j] * $EMBED[$linei-3]->[$j]) / 4;
      }
      #then average the next four lines into a single vector...
      my @vec2;
      for ($j=0;$j<$NUMDIMS;$j++) {
         $vec2[$j] = ($EMBED[$linei+1]->[$j] * $EMBED[$linei+2]->[$j] * $EMBED[$linei+3]->[$j] * $EMBED[$linei+4]->[$j]) / 4;
      }
      #now compute the similarity of the two averaged vectors...
      $sim = cossim(\@vec1, \@vec2); $sim = sprintf("%0.4f", $sim);
      print OUT2 "$OFFSETS[$linei],$sim,$LEAD[$linei],$PREVIEWURL[$linei]\n";
   }

}
close(OUT1);
close(OUT2);
##################################################################################################################
##################################################################################################################

We hope this inspires you for your own experiments! Further refinements could include using the onscreen OCR, shot change and other visual cues to add additional signals for story transitions.