Data Science Accelerated: Gemini Cloud Assist In BigQuery

One of the most cumbersome tasks in data science is all of the bespoke adhoc code that must be written on the fly, from initial surveys of the data, to all the myriad data cleaning and casting, to various statistical intermediaries, to the ultimate iterative pursuit of a given finding. When working with immense datasets at GDELT scale, this kind of interactive analysis typically happens in BigQuery, where it turns out Gemini Cloud Assist can be a huge accelerator, as we learned today in interactively compiling statistics on our story inventory work to date: we let Gemini write all of the core analytic code for us that we needed to assess Gemini's performance on our segmentation task, interactively and iteratively coding with it! We started by co-coding with Gemini, giving it schemas and operators and asking it to write SQL to perform specific tasks, then eased into just giving it plain English descriptions of what we wanted and having it write the SQL. Incredibly, for our final analysis we just asked Gemini in plain English to write SQL to perform pairwise cross-correlations of all columns across our dataset and output as a list, then reformat as a correlation matrix, then give us step-by-step instructions to turn it into a heatmap in Excel. Amazingly, every line of SQL was flawless and we didn't have to debug or change a single line!

To compile overall statistics on the performance of our Gemini-based story segmentation work, we need to extract and analyze various fields stored in our logging metadata stored in BigTable using BigQuery's Bigtable integration. To do this, we have to extract each field (knowing to use JSON_EXTRACT_SCALAR()) and perform a numeric cast and remember to use safe casting with SAFE_CAST() since some legacy values may be nulls. Instead of having to remember all of this and copy-paste and edit multiple lines of code, we just copy-pasted the textual schema and asked Gemini to write the code for us using the Cloud Assist pane that appears directly in BigQuery:

write me the query to extract the following fields from JSON field FIELD:
status STRING apiFinishReason STRING durSec NUMERIC tokensPrompt NUMERIC tokensCandidates NUMERIC tokensThoughts NUMERIC promptWords NUMERIC promptChars NUMERIC cntStories NUMERIC cntStoryClips NUMERIC emotionsCnt NUMERIC entitiesCnt NUMERIC keywordsCnt NUMERIC locationsCnt NUMERIC

Which yielded the following to which we added a few filtering lines. Look at all that extraction and casting SQL we didn't have to write! Note how Gemini even correctly recognized that the first two fields are STRINGS so didn't attempt to cast them to numeric!

select 
  JSON_EXTRACT_SCALAR(FIELD, '$.status') status,
  JSON_EXTRACT_SCALAR(FIELD, '$.apiFinishReason') apiFinishReason,
  SAFE_CAST(JSON_EXTRACT_SCALAR(DOWN, '$.durSec') as NUMERIC) durSec,
  SAFE_CAST(JSON_EXTRACT_SCALAR(FIELD, '$.tokensPrompt') as NUMERIC) tokensPrompt,
  SAFE_CAST(JSON_EXTRACT_SCALAR(FIELD, '$.tokensCandidates') AS NUMERIC) tokensCandidates,
  SAFE_CAST(JSON_EXTRACT_SCALAR(FIELD, '$.tokensThoughts') AS NUMERIC) tokensThoughts,
  SAFE_CAST(JSON_EXTRACT_SCALAR(FIELD, '$.promptWords') AS NUMERIC) promptWords,
  SAFE_CAST(JSON_EXTRACT_SCALAR(FIELD, '$.promptChars') AS NUMERIC) promptChars,
  SAFE_CAST(JSON_EXTRACT_SCALAR(FIELD, '$.cntStories') AS NUMERIC) cntStories,
  SAFE_CAST(JSON_EXTRACT_SCALAR(FIELD, '$.cntStoryClips') AS NUMERIC) cntStoryClips,
  SAFE_CAST(JSON_EXTRACT_SCALAR(FIELD, '$.emotionsCnt') AS NUMERIC) emotionsCnt,
  SAFE_CAST(JSON_EXTRACT_SCALAR(FIELD, '$.entitiesCnt') AS NUMERIC) entitiesCnt,
  SAFE_CAST(JSON_EXTRACT_SCALAR(FIELD, '$.keywordsCnt') AS NUMERIC) keywordsCnt,
  SAFE_CAST(JSON_EXTRACT_SCALAR(FIELD, '$.locationsCnt') AS NUMERIC) locationsCnt,
   FROM (
  SELECT  
    rowkey,
    substr(rowkey,0,8) as day,
    ( select array(select value from unnest(cell))[OFFSET(0)] from unnest(cf.column) where name in ('STORYINV') ) FIELD,
    ( select array(select value from unnest(cell))[OFFSET(0)] from unnest(cf.column) where name in ('DOWN') ) DOWN,
  FROM bigtableconnections.digtwin where SAFE_CAST(substr(rowkey, 0, 8) as NUMERIC) > 20000000 and rowkey like '20%'
   ) where JSON_EXTRACT_SCALAR(DOWN, '$.status') ='SUCCESS' and JSON_EXTRACT_SCALAR(FIELD, '$.status') is not null

We ran this query to generate a temporary table with the results and then asked Gemini to write the SQL to sum all the fields:

write the SQL to SUM() all the fields

Note how Gemini even correctly excludes the two STRING columns!

SELECT SUM(durSec) AS sum_dursec,
  SUM(tokensPrompt) AS sum_tokensprompt,
  SUM(tokensCandidates) AS sum_tokenscandidates,
  SUM(tokensThoughts) AS sum_tokensthoughts,
  SUM(promptWords) AS sum_promptwords,
  SUM(promptChars) AS sum_promptchars,
  SUM(cntStories) AS sum_cntstories,
  SUM(cntStoryClips) AS sum_cntstoryclips,
  SUM(emotionsCnt) AS sum_emotionscnt,
  SUM(entitiesCnt) AS sum_entitiescnt,
  SUM(keywordsCnt) AS sum_keywordscnt,
  SUM(locationsCnt) AS sum_locationscnt FROM `[TMPTABLE]`

Let's go for broke and not even hint to Gemini what SQL operators to use: we'll just describe what we want in plain English and let it figure things out for us:

make me a histogram of tokensThoughts, sorted by number of rows with each value

This yields:

SELECT
  tokensThoughts,
  COUNT(*) AS row_count
FROM
  `[TMPTABLE]`
WHERE
  tokensThoughts IS NOT NULL
GROUP BY tokensThoughts
ORDER BY row_count DESC;

Now let's try something that would ordinarily be much much messier and time-consuming: let's ask Gemini to give us the SQL to compute the pairwise correlation of every numeric value over the entire dataset. This can be done several ways, but requires either brute-force coding all of the various permutations or some complex nesting. Either way would take time. Let's just ask Gemini to do it for us:

perfect. now give me the pairwise pearson correlation of each of the numeric values and output each as a row sorted from highest to lowest correlation.

Yielding the following code that worked perfectly (truncated for space):

SELECT
  field1,
  field2,
  corr_value
FROM (
  SELECT
    'durSec' AS field1,
    'tokensPrompt' AS field2,
    CORR(durSec, tokensPrompt) AS corr_value
  FROM
    `[TMPTABLE]`
  WHERE
    durSec IS NOT NULL AND tokensPrompt IS NOT NULL

  UNION ALL

  SELECT
    'durSec' AS field1,
    'tokensCandidates' AS field2,
    CORR(durSec, tokensCandidates) AS corr_value
  FROM
    `[TMPTABLE]`
  WHERE
    durSec IS NOT NULL AND tokensCandidates IS NOT NULL

  UNION ALL

  SELECT
    'durSec' AS field1,
    'tokensThoughts' AS field2,
    CORR(durSec, tokensThoughts) AS corr_value
  FROM
    `[TMPTABLE]`
  WHERE
    durSec IS NOT NULL AND tokensThoughts IS NOT NULL

  UNION ALL
............
)
ORDER BY
  ABS(corr_value) DESC;

This output exactly what we expected:

field1 field2 corr_value
tokensPrompt promptChars 0.997927421
promptWords promptChars 0.996994463
tokensPrompt promptWords 0.993297727
emotionsCnt keywordsCnt 0.983410468
durSec tokensPrompt 0.926492994
durSec promptChars 0.924034464
durSec promptWords 0.919612691
cntStories keywordsCnt 0.914936234
cntStories emotionsCnt 0.902522682
keywordsCnt locationsCnt 0.741347453
tokensCandidates entitiesCnt 0.733144453
emotionsCnt locationsCnt 0.73139142
cntStories locationsCnt 0.689335368
tokensPrompt tokensCandidates 0.678399271

But, actually, in retrospect, this isn't the best way to present this information, since it can make it difficult to visually spot underlying patterns. Instead, these pairwise correlations are typically presented in what's called a "correlation matrix". Let's ask Gemini to give us that instead.

that's perfect, now give me the code to output as a correlation matrix instead with the rows and columns ordered alphabetically top to bottom and left to right.

Incredibly, Gemini instantly transforms output from a long list into the requested correlation matrix (truncated for space):

SELECT
  field_name,
  CORR(cntStories, value) AS cntStories,
  CORR(cntStoryClips, value) AS cntStoryClips,
  CORR(durSec, value) AS durSec,
  CORR(emotionsCnt, value) AS emotionsCnt,
  CORR(entitiesCnt, value) AS entitiesCnt,
  CORR(keywordsCnt, value) AS keywordsCnt,
  CORR(locationsCnt, value) AS locationsCnt,
  CORR(promptChars, value) AS promptChars,
  CORR(promptWords, value) AS promptWords,
  CORR(tokensCandidates, value) AS tokensCandidates,
  CORR(tokensPrompt, value) AS tokensPrompt,
  CORR(tokensThoughts, value) AS tokensThoughts
FROM
  (
    SELECT
      cntStories,
      cntStoryClips,
      durSec,
      emotionsCnt,
      entitiesCnt,
      keywordsCnt,
      locationsCnt,
      promptChars,
      promptWords,
      tokensCandidates,
      tokensPrompt,
      tokensThoughts,
      'cntStories' AS field_name,
      cntStories AS value
    FROM
      `[TMPTABLE]`
    WHERE
      cntStories IS NOT NULL
    UNION ALL
    SELECT
      cntStories,
      cntStoryClips,
      durSec,
      emotionsCnt,
      entitiesCnt,
      keywordsCnt,
      locationsCnt,
      promptChars,
      promptWords,
      tokensCandidates,
      tokensPrompt,
      tokensThoughts,
      'cntStoryClips' AS field_name,
      cntStoryClips AS value
    FROM
      `[TMPTABLE]`
    WHERE
      cntStoryClips IS NOT NULL
    UNION ALL
..........
  )
GROUP BY
  field_name
ORDER BY
  field_name;

Yielding exactly what we requested:

field_name cntStories cntStoryClips durSec emotionsCnt entitiesCnt keywordsCnt locationsCnt promptChars promptWords tokensCandidates tokensPrompt tokensThoughts
cntStories 1 0.200296346 0.179297425 0.902522682 0.573129814 0.914936234 0.689335368 0.131207921 0.126558097 0.599625358 0.141428972 0.357401833
cntStoryClips 0.200296346 1 0.452408735 0.20810114 0.294740727 0.211894857 0.221829594 0.418794685 0.4134904 0.43909004 0.420511932 0.293430561
durSec 0.179297425 0.452408735 1 0.181291738 0.491692803 0.184496431 0.234972858 0.924034464 0.919612691 0.650416452 0.926492994 0.538289572
emotionsCnt 0.902522682 0.20810114 0.181291738 1 0.613528235 0.983410468 0.73139142 0.129973678 0.124968629 0.631453648 0.14093968 0.374128798
entitiesCnt 0.573129814 0.294740727 0.491692803 0.613528235 1 0.61843528 0.562146762 0.498507607 0.493484693 0.733144453 0.50380652 0.431885144
keywordsCnt 0.914936234 0.211894857 0.184496431 0.983410468 0.61843528 1 0.741347453 0.133116151 0.128084048 0.63924357 0.144089898 0.375932311
locationsCnt 0.689335368 0.221829594 0.234972858 0.73139142 0.562146762 0.741347453 1 0.239508566 0.23789482 0.629665871 0.241643895 0.423105498
promptChars 0.131207921 0.418794685 0.924034464 0.129973678 0.498507607 0.133116151 0.239508566 1 0.996994463 0.674127833 0.997927421 0.568473166
promptWords 0.126558097 0.4134904 0.919612691 0.124968629 0.493484693 0.128084048 0.23789482 0.996994463 1 0.66984916 0.993297727 0.563727512
tokensCandidates 0.599625358 0.43909004 0.650416452 0.631453648 0.733144453 0.63924357 0.629665871 0.674127833 0.66984916 1 0.678399271 0.426388285
tokensPrompt 0.141428972 0.420511932 0.926492994 0.14093968 0.50380652 0.144089898 0.241643895 0.997927421 0.993297727 0.678399271 1 0.570853657
tokensThoughts 0.357401833 0.293430561 0.538289572 0.374128798 0.431885144 0.375932311 0.423105498 0.568473166 0.563727512 0.426388285 0.570853657 1

Usually these kinds of tables are colored-coded, with each cell colored based on its value. Can Gemini even tell us what to do in Excel?

how can i color code this into a heatmap in excel?

Incredibly, Gemini is even able to give us step-by-step instructions on how to find and use Conditional Formatting in Excel, giving us the following results. It is hard to believe that Gemini not only coded the SQL to make this table, but even walked us through color-coding it in Excel: