The GDELT Project

Generative AI Experiments: Asking Gemini Ultra To Reformat BigTable's CBT CLI Output Into JSON

GCP has a wealth of data storage offerings, ranging from full-fledged warehouse and database platforms like BigQuery and Spanner to key-value stores like BigTable. For the purposes of building a highly scalable digital twin infrastructure tracking , BigTable is an especially well suited platform. For a developer entirely new to BigTable, how might GCP's own Gemini Ultra (whose coding abilities have been heavily emphasized) help that developer quickly get up to speed and writing sample test scripts to examine whether BigTable might work for their application? As a first step, most developers will likely run some sample inserts and queries with BigTable's CBT CLI tool to understand how BigTable works.

To test how a coding copilot like Gemini Ultra might help us, let's ask it to help us take the output of BigTable's CBT CLI tool and reformat it from prettified human formatted textual output into machine-friendly JSON output to help us wrap it into a quick demo to show how BigTable might be used in one of our workflows. Before deep diving into the rich Python or other language bindings for BigTable, most developers to BigTable are similarly likely to do their first test queries using the CBT CLI and want JSON-formatted output to test with their workflow. Unfortunately, as the results below show, despite Gemini Ultra and BigTable both being GCP products, Ultra first hallucinates a JSON output parameter, then is ultimately unable to provide a shell one-liner that can parse the CBT output into JSON. Even providing example inputs and eventually example inputs and outputs fails to yield workable results despite extensive testing. This demonstrates one of the great challenges of copilot systems in the real world: they often hallucinate and struggle to produce workable results when deployed in even trivial real-world use cases that don't have trivial answers found widely across the web.

Let's say we've already followed a tutorial to create a new BigTable table and populate it with sample data. Now we want to read that data, so we might run:

time cbt -instance myinstance read mytable count=20

This yields output like the following, which, while readable to a human, is less than ideal for machine processing:

----------------------------------------
value1
  content:dateAdded                        @ 2024/06/01-23:36:37.395000
    "2024-06-01 23:36:37.395313"

----------------------------------------
value2
  content:dateAdded                        @ 2024/06/02-14:03:35.510000
    "2024-06-02 14:03:35.510540"

If we want to actually try processing these results in a downstream application as part of a proof of concept demo, we'll need to convert the output to JSON. A quick review of the CBT documentation turns up no obvious option for requesting JSON output rather than this prettified human output. Let's ask Gemini:

Using the CBT BigTable CLI, output the results from "read" as JSON.

This yields the following results, which at first glance seem like exactly what we're looking for – a trivial CLI parameter to request JSON output. The only problem? There is no such flag:

Absolutely! Here's how you can achieve that using the `cbt` CLI tool, along with some important considerations:

**Core Command**

The fundamental command to read data from Bigtable using the CLI is:

```bash
cbt read  [options]
```

**Options for JSON Output**

To get JSON-like output, you'll primarily use these options:

* **`--format=json`:** This tells `cbt` to format the results as JSON.
* **`--timestamp`:** Includes timestamps (important if your data has versions).

**Example**

```bash
cbt read my-table --format=json --timestamp
```

**Sample Output (Simplified)**

```json
[
  {
    "row_key": "row1",
    "cells": [
      {
        "family": "cf1",
        "qualifier": "col1",
        "value": "value1",
        "timestamp_micros": 1624572800000000
      }
    ]
  },
  {
    "row_key": "row2",
    "cells": [
      {
        "family": "cf2",
        "qualifier": "col2",
        "value": "value2",
        "timestamp_micros": 1624572800000000
      }
    ]
  }
]
```

**Important Considerations**

* **Complex Data Types:**  If your cells contain more complex data types (e.g., protocol buffers or JSON structures within Bigtable), the `--format=json` output might still be somewhat nested. You may need to parse it further using a tool like `jq`.
* **Encoding:** By default, Bigtable cell values are byte arrays. If your data is meant to be text (like JSON strings), you might need to decode it accordingly in your processing script.

**Parsing and Post-Processing with `jq`**

```bash
cbt read my-table --format=json --timestamp | jq '.[] | {row_key, cells: [.cells[] | {family, qualifier, value, timestamp_micros}]}'
```

This refined `jq` command does the following:

1. Iterates over each row in the result.
2. Extracts the `row_key` and processes the `cells` array.
3. For each cell, extracts the desired fields (`family`, `qualifier`, `value`, `timestamp_micros`).

**Additional Tips**

* **Filtering:** Use `--regex`, `--prefix`, or `--start`/`--end` options to read specific rows.
* **Limiting:** Use `--count` or `--cells-per-column` to control the amount of data retrieved. 
* **Specific Columns:** Use `--columns` to only read certain column families and qualifiers.

**Script Example (Bash)**

```bash
#!/bin/bash
table_name="my-table"

# Read from Bigtable
json_output=$(cbt read $table_name --format=json --timestamp)

# Parse and Process JSON
parsed_output=$(echo "$json_output" | jq '.[] | {row_key, cells: [.cells[] | {family, qualifier, value, timestamp_micros}]}')

# Example: Output to a file (replace with your processing logic)
echo "$parsed_output" > output.json
```

Let me know if you have any specific data types or formatting requirements. I can help you tailor a more precise solution for your Bigtable data.

Let's tell it that there is no JSON output option:

There is no format=json option.

This yields the following script:

cbt read my-table | awk -F'[: =]+' '{ 
    printf "{ \"row_key\": \"%s\", \"family\": \"%s\", \"qualifier\": \"%s\", \"value\": \"%s\", \"timestamp_micros\": %s },\n", $1, $2, $3, $4, $5 
}' | sed '$s/,$//'

Which unfortunately does not correctly reformat CBT's output into JSON:

{ "row_key": "value1", "family": "", "qualifier": "", "value": "", "timestamp_micros": },
{ "row_key": "", "family": "content", "qualifier": "dateAdded", "value": "@", "timestamp_micros": 2024/06/02-14 },
{ "row_key": "", "family": ""2024-06-02", "qualifier": "14", "value": "03", "timestamp_micros": 35.510540" },
{ "row_key": "", "family": "", "qualifier": "", "value": "", "timestamp_micros": }

Let's ask it to try again:

This is not correct. The JSON is valid, but it does not properly parse the output from CBT.

Which outputs:

cbt read my-table | awk -F'[: =]+' 'BEGIN { printf "[" } { printf "{ \"row_key\": \"%s\", \"family\": \"%s\", \"qualifier\": \"%s\", \"value\": \"%s\", \"timestamp_micros\": %s },", $1, $2, $3, $4, $5 } END { print "]"}' | sed '$s/,$//'

Which yields the same results:

{ "row_key": "", "family": ""2024-06-01", "qualifier": "17", "value": "59", "timestamp_micros": 09.181951" },
{ "row_key": "", "family": "", "qualifier": "", "value": "", "timestamp_micros": },
{ "row_key": "----------------------------------------", "family": "", "qualifier": "", "value": "", "timestamp_micros": }

Repeated attempts continued to fail, so let's switch gears to giving it an example of the output instead:

Write a Linux one-liner that parses this into JSON:

----------------------------------------
value1
  content:dateAdded                        @ 2024/06/02-14:03:20.102000
    "2024-06-02 14:03:20.102800"

----------------------------------------
value2
  content:dateAdded                        @ 2024/06/01-17:59:09.181000
    "2024-06-01 17:59:09.181951"

----------------------------------------
value3
  content:dateAdded                        @ 2024/06/01-19:52:56.643000
    "2024-06-01 19:52:56.643232"

----------------------------------------
value4
  content:dateAdded                        @ 2024/06/01-23:36:37.395000
    "2024-06-01 23:36:37.395313"

----------------------------------------
value5
  content:dateAdded                        @ 2024/06/02-14:03:35.510000
    "2024-06-02 14:03:35.510540"

Which yields:

sed -n '/^[0-9]/p' bigtable_output.txt | awk -F'[: @]+' '{ print "{\"row_key\": \""$1"\", \"dateAdded\": \"" $NF "\"}" }' | jq -s '.'

Which produces the following output that is still wrong:

[
  {
    "row_key": "value1",
    "dateAdded": "value1"
  },
  {
    "row_key": "value2",
    "dateAdded": "value2"
  },
  {
    "row_key": "value3",
    "dateAdded": "value3"
  },

Let's ask for a new formulation:

The dateAdded isn't being properly extracted.

Which yields:

sed -n '/^[0-9]/p' | awk -F'[: @]+' '{ print "{\"row_key\": \""$1"\", \"dateAdded\": \"" $6 " " $7 "\"}" }' | jq -s '.'

Which is still wrong:

[
  {
    "row_key": "value1",
    "dateAdded": " "
  },
  {
    "row_key": "value2",
    "dateAdded": " "
  },
  {
    "row_key": "value3",
    "dateAdded": " "
  },

Let's try one more time by giving it sample output:

Write a Linux one-liner that parses this into JSON: 

INPUT:
----------------------------------------
value1
  content:dateAdded                        @ 2024/06/02-14:03:20.102000
    "2024-06-02 14:03:20.102800"

----------------------------------------
value2
  content:dateAdded                        @ 2024/06/01-17:59:09.181000
    "2024-06-01 17:59:09.181951"


JSON OUTPUT:
{"value":"value1", "dateAdded":"2024-06-02 14:03:20.102800"},
{"value":"value2", "dateAdded":"2024-06-01 17:59:09.181951"},

Which yields:

awk -v RS="----------------------------------------\n" -v ORS= 'NF>1{sub(/.*@ /,"", $2); sub(/\..*/,"", $3); print "{\"value\":\""$1"\", \"dateAdded\":\""$3"\"},"}' input.txt | sed 's/},$/}]/'

Which produces:

{"value":"value1", "dateAdded":"@"},
{"value":"value2", "dateAdded":"@"},
{"value":"value3", "dateAdded":"@"},

Repeated requests merely yielded various refinements that never actually produced correct results.