The GDELT Project

Behind The Scenes: Comparing Bigtable's Python & Go Libraries & Using Gemini 2.5 Pro To Translate Python To Go

GDELT brings together myriad tools, APIs, libraries, scripts and binaries written across a range of programming languages, of which only some offer native libraries for GCP's Bigtable platform that we use as a digital twin over GCS. To facilitate all of the various languages and tools interacting with our Bigtable twin, we wrote a simple Python script that accepts a JSON-NL file containing a list of search prefixes, one per line and it searches the specified Bigtable instance and table and returns all of the matching results in a JSON-NL file. Any GDELT system can thus query our Bigtable digital twin by simply invoking this script on the command line. While performant once it begins streaming results in, there is typically a 1s cold startup time for the script. It also consumes a substantial amount of CPU, completely saturating the underlying VM when run one per core. Would converting the script to Go accelerate cold startup time and reduce CPU load?

Without knowing a line of Go, we handed Gemini 2.5 Pro our Python script and asked it to translate it into Go and provide us installation and execution instructions. The first output contained a number of errors and was unbuildable, but after providing those errors back to the model and asking it to correct them, the second iteration worked perfectly, demonstrating modern models' considerable improvements in code translation. The Go script was more than 70% faster than the Python script, while reducing CPU from 90% to just 10-20%! This offers a critical reminder that while ubiquitous, Python is not necessarily the most performant and that even the most trivial of GCP service interactions should be benchmarked across different languages.

For those interested in the technical details, below are the commands we used for benchmarking and you can find the scripts at the bottom of this blog post. All tests were conducted on a brand-new 64-core N1 GCE VM using a RAMDISK to eliminate storage overhead impacting the benchmarking.

First we tested a prefix that returns 15,413 results. We tested running it a single time and running it 1000 times using all 64 cores in parallel. Running 1000 times downloads 17GB of data in total. CPU totals are reported as the median from "top" over the course of the run.

#SEARCH FILE - search1.json
{"prefix": "2025040"

#PYTHON SINGLE QUERY - 1.76s
rm results.json;time ./bigtable_read.py --project-id [PROJECTID] --instance-id [INSTANCEID] --table-id [TABLE] --columns DOWN,DOWN_SYNC --input-file ./search1.json --output-file ./results.json --batch-size 50; wc -l results.json

#PYTHON BULK QUERY - 0m54.564s real / 28m4.740s user @ 90% CPU
rm -rf CACHE; mkdir CACHE
time seq 0 1000 | parallel './bigtable_read.py --project-id [PROJECTID] --instance-id [INSTANCEID] --table-id [TABLE] --columns DOWN,DOWN_SYNC --input-file ./search1.json --output-file ./CACHE/results.{}.json --batch-size 100'&

#GO SINGLE QUERY - 0.99s
rm results.json;time ./bigtable_query --project-id [PROJECTID] --instance-id [INSTANCEID] --table-id [TABLE] --columns DOWN,DOWN_SYNC --input-file ./search1.json --output-file ./results.json --batch-size 50; wc -l results.json

#GO BULK QUERY - 0m45.800s real / 7m29.565s user @ 10% CPU
rm -rf CACHE; mkdir CACHE
time seq 0 1000 | parallel './bigtable_query --project-id [PROJECTID] --instance-id [INSTANCEID] --table-id [TABLE] --columns DOWN,DOWN_SYNC --input-file ./search1.json --output-file ./CACHE/results.{}.json --batch-size 100'&

We also tested a query with a smaller result count that returns just 1712 results and returns 1.8GB in total:

#SEARCH FILE - search2.json
{"prefix": "20250409"

#PYTHON SINGLE QUERY - 0.94s 
rm results.json;time ./bigtable_read.py --project-id [PROJECTID] --instance-id [INSTANCEID] --table-id [TABLE] --columns DOWN,DOWN_SYNC --input-file ./search2.json --output-file ./results.json --batch-size 50; wc -l results.json

#PYTHON BULK QUERY - 0m20.343s real /14m18.881s user @ 70%CPU
rm -rf CACHE; mkdir CACHE
time seq 0 1000 | parallel './bigtable_read.py --project-id [PROJECTID] --instance-id [INSTANCEID] --table-id [TABLE] --columns DOWN,DOWN_SYNC --input-file ./search2.json --output-file ./CACHE/results.{}.json --batch-size 50'

#GO SINGLE QUERY - 0.5s
rm results.json;time ./bigtable_query --project-id [PROJECTID] --instance-id [INSTANCEID] --table-id [TABLE] --columns DOWN,DOWN_SYNC --input-file ./search2.json --output-file ./results.json --batch-size 50; wc -l results.json

#GO BULK QUERY - 0m7.841s real / 2m1.211s user @ 22% CPU
rm -rf CACHE; mkdir CACHE
time seq 0 1000 | parallel './bigtable_query --project-id [PROJECTID] --instance-id [INSTANCEID] --table-id [TABLE] --columns DOWN,DOWN_SYNC --input-file ./search2.json --output-file ./results.json --batch-size 50'

Below you can find the original Python script. While we originally attempted to have Gemini 1.5 Pro write this script for us when we built it earlier last year, after many hours of iterations, it was unable to produce code that actually functioned properly and thus we ended up writing this script by hand by adapting and blending numerous code snippets from the GCP documentation pages:

import argparse
import json
import logging
from google.cloud import bigtable
from google.cloud.bigtable.row_set import RowSet
from google.cloud.bigtable import row_filters
import re

def parse_arguments():
    parser = argparse.ArgumentParser(description="Process JSON-NL file and query Bigtable.")
    parser.add_argument('--project-id', required=True, help='GCP Project ID')
    parser.add_argument('--instance-id', required=True, help='Bigtable Instance ID')
    parser.add_argument('--table-id', required=True, help='Bigtable Table ID')
    parser.add_argument('--input-file', required=True, help='Path to the JSON-NL input file')
    parser.add_argument('--output-file', required=True, help='Path to the JSON-NL output file')
    parser.add_argument('--columns', required=False, help='Comma-separated list of columns to retrieve')
    parser.add_argument('--batch-size', type=int, default=100, help='Batch size for Bigtable reads')
    return parser.parse_args()

def read_input_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            yield json.loads(line)

def create_row_set(records):
    row_set = RowSet()
    for record in records:
        key = record.get('key')
        prefix = record.get('prefix')
        if key:
            row_set.add_row_key(key.encode('utf-8'))
        elif prefix:
            start_key = prefix.encode('utf-8')
            end_key = start_key[:-1] + bytes([start_key[-1] + 1])
            row_set.add_row_range_from_keys(start_key, end_key)
    return row_set


def query_bigtable(table, row_set, columns):
    if columns:
        col_regex = f"({'|'.join(re.escape(key) for key in columns)})"
        col_filter = row_filters.ColumnQualifierRegexFilter(col_regex.encode("utf-8"))
        rows = table.read_rows(row_set=row_set, filter_=col_filter)
    else:
        rows = table.read_rows(row_set=row_set)
    rows.consume_all()

    results = []
    for row_key, row in rows.rows.items():
        result = {'key': row_key.decode('utf-8')}
        for cf, cols in row.cells.items():
            for col, cells in sorted(cols.items()):
                result[col.decode('utf-8')] = cells[0].value.decode('utf-8')        
        results.append(result)
    return results


def process_batch(table, batch, columns, output_file):
    row_set = create_row_set(batch)
    try:
        results = query_bigtable(table, row_set, columns)
        found_keys = set(result['key'] for result in results)
        for record in batch:
            key_or_prefix = record.get('key') #or record.get('prefix')
            if key_or_prefix not in found_keys:
                results.append({"noMatches": key_or_prefix})
        for result in results:
            json.dump(result, output_file, ensure_ascii=False)
            output_file.write('\n')
    except Exception as e:
        logging.error(f"Error querying Bigtable: {e}")
        for record in batch:
            error_record = {"error": str(e)}
            key_or_prefix = record.get('key') or record.get('prefix')
            error_record["key_or_prefix"] = key_or_prefix
            json.dump(error_record, output_file, ensure_ascii=False)
            output_file.write('\n')

def main():
    args = parse_arguments()

    client = bigtable.Client(project=args.project_id, admin=True)
    instance = client.instance(args.instance_id)
    table = instance.table(args.table_id)

    if args.columns:
        columns = args.columns.split(',')
    else:
        columns = []
    batch_size = args.batch_size

    input_records = read_input_file(args.input_file)
    batch = []

    with open(args.output_file, 'w', encoding='utf-8') as outfile:
        for record in input_records:
            batch.append(record)
            if len(batch) >= batch_size:
                process_batch(table, batch, columns, outfile)
                batch = []

        # Process remaining records
        if batch:
            process_batch(table, batch, columns, outfile)

if __name__ == '__main__':
    main()

You can see the Gemini 2.5 Pro – ported version of the same script below:

package main

import (
        "bufio"
        "cloud.google.com/go/bigtable"
        "context"
        "encoding/json"
        "flag"
        "fmt"
        "log"
        "os"
        "regexp"
        "strings"
        "sync" // Used for safe access to shared maps in ReadRows callback
)

// InputRecord defines the structure expected in the input JSON-NL file
type InputRecord struct {
        Key    *string `json:"key"`    // Use pointers to distinguish between missing and empty
        Prefix *string `json:"prefix"`
        // Add other fields from input if needed, though they aren't used for querying
}

// OutputRecord defines the structure for successful Bigtable lookups
type OutputRecord map[string]string // Flexible map: "key": rowKey, "colName": value

// NoMatchRecord defines the structure when a key/prefix is not found
type NoMatchRecord struct {
        NoMatches string `json:"noMatches"`
}

// ErrorRecord defines the structure for errors during processing
type ErrorRecord struct {
        Error        string `json:"error"`
        KeyOrPrefix string `json:"key_or_prefix"`
}

func main() {
        projectID := flag.String("project-id", "", "GCP Project ID (required)")
        instanceID := flag.String("instance-id", "", "Bigtable Instance ID (required)")
        tableID := flag.String("table-id", "", "Bigtable Table ID (required)")
        inputFile := flag.String("input-file", "", "Path to the JSON-NL input file (required)")
        outputFile := flag.String("output-file", "", "Path to the JSON-NL output file (required)")
        columnsStr := flag.String("columns", "", "Comma-separated list of columns (qualifiers) to retrieve (e.g., cf1:col1,cf2:col2)")
        batchSize := flag.Int("batch-size", 100, "Batch size for Bigtable reads")

        flag.Parse()

        // Basic validation
        if *projectID == "" || *instanceID == "" || *tableID == "" || *inputFile == "" || *outputFile == "" {
           flag.Usage()
           log.Fatal("Missing required arguments")
        }
        if *batchSize <= 0 { log.Fatal("Batch size must be positive") } ctx := context.Background() // --- Bigtable Client Setup --- client, err := bigtable.NewClient(ctx, *projectID, *instanceID) if err != nil { log.Fatalf("Failed to create Bigtable client: %v", err) } defer client.Close() tbl := client.Open(*tableID) // --- Column Filter Setup --- var columns []string var colFilter bigtable.Filter if *columnsStr != "" { columns = strings.Split(*columnsStr, ",") // Escape special characters for regex and join with | escapedCols := make([]string, len(columns)) for i, c := range columns { // Basic parsing assuming "family:qualifier" or just "qualifier" // For simplicity, we'll filter only on qualifier part if no family provided, // though Bigtable filters often benefit from specifying the family. parts := strings.SplitN(c, ":", 2) qualifier := c if len(parts) == 2 { qualifier = parts[1] } escapedCols[i] = regexp.QuoteMeta(qualifier) } // Match exact qualifiers anchored to start/end. If you need partial matches, adjust regex. // Note: Bigtable regex is RE2. This matches full qualifier. colRegex := fmt.Sprintf("^(%s)$", strings.Join(escapedCols, "|")) colFilter = bigtable.ColumnFilter(colRegex) log.Printf("Using column filter regex: %s", colRegex) } else { log.Println("No column filter specified, retrieving all columns.") // PassAllFilter retrieves everything colFilter = bigtable.PassAllFilter() } // --- File Handling Setup --- inFile, err := os.Open(*inputFile) if err != nil { log.Fatalf("Failed to open input file %s: %v", *inputFile, err) } defer inFile.Close() outFile, err := os.Create(*outputFile) if err != nil { log.Fatalf("Failed to create output file %s: %v", *outputFile, err) } defer outFile.Close() writer := bufio.NewWriter(outFile) defer writer.Flush() // Ensure buffer is written at the end encoder := json.NewEncoder(writer) // Use encoder for efficient JSON writing scanner := bufio.NewScanner(inFile) // Increase buffer size for scanner if lines can be very long // const maxCapacity = 1024 * 1024 // 1MB for example // buf := make([]byte, maxCapacity) // scanner.Buffer(buf, maxCapacity) batch := make([]InputRecord, 0, *batchSize) // --- Processing Loop --- log.Println("Starting processing...") lineNum := 0 for scanner.Scan() { lineNum++ var record InputRecord line := scanner.Bytes() // Use Bytes() to avoid string conversion if possible // Skip empty lines if len(line) == 0 { continue } if err := json.Unmarshal(line, &record); err != nil { log.Printf("WARN: Skipping invalid JSON line #%d: %v (line: %s)", lineNum, err, string(line)) errorRec := ErrorRecord{ Error: fmt.Sprintf("Invalid JSON: %v", err), KeyOrPrefix: string(line), // Log the raw line on error } _ = encoder.Encode(errorRec) // Best effort write continue } // Basic validation of input record if record.Key == nil && record.Prefix == nil { log.Printf("WARN: Skipping record at line #%d with neither key nor prefix: %s", lineNum, string(line)) errorRec := ErrorRecord{ Error: "Record missing 'key' or 'prefix'", KeyOrPrefix: string(line), } _ = encoder.Encode(errorRec) // Best effort write continue } if record.Key != nil && record.Prefix != nil { log.Printf("WARN: Record at line #%d has both key and prefix, using key: %s", lineNum, string(line)) record.Prefix = nil // Prioritize key if both exist } batch = append(batch, record) if len(batch) >= *batchSize {
                processBatch(ctx, tbl, batch, colFilter, encoder)
                batch = make([]InputRecord, 0, *batchSize) // Reset batch
           }
        }

        // Process any remaining records in the last batch
        if len(batch) > 0 {
           processBatch(ctx, tbl, batch, colFilter, encoder)
        }

        if err := scanner.Err(); err != nil {
           // Check for buffer too small error specifically
           if err == bufio.ErrTooLong {
                log.Fatalf("Error reading input file: Line too long. Consider increasing scanner buffer size.")
           } else {
                log.Fatalf("Error reading input file: %v", err)
           }
        }

        log.Println("Processing complete.")
}

// processBatch queries Bigtable for a batch of records and writes results/errors
func processBatch(ctx context.Context, tbl *bigtable.Table, batch []InputRecord, colFilter bigtable.Filter, encoder *json.Encoder) {
        log.Printf("Processing batch of %d records", len(batch))

        rowList := bigtable.RowList{}
        rowRangeList := bigtable.RowRangeList{}
        originalKeys := make(map[string]string)    // Map original input -> original input (used for tracking misses)
        prefixInputs := make(map[string]struct{}) // Store which inputs were prefixes

        for _, record := range batch {
           if record.Key != nil {
                keyStr := *record.Key
                rowList = append(rowList, keyStr)
                originalKeys[keyStr] = keyStr // Track by the exact key
           } else if record.Prefix != nil {
                prefixStr := *record.Prefix
                startKey := []byte(prefixStr)
                prefixInputs[prefixStr] = struct{}{} // Mark this input as a prefix

                // Create end key for prefix scan
                var currentRange bigtable.RowRange
                if len(startKey) == 0 {
                        // Empty prefix means scan everything from the beginning
                        currentRange = bigtable.InfiniteRange("")
                } else {
                        // Calculate the end of the range for the prefix
                        // Find the smallest byte string strictly greater than the prefix
                        endKey := make([]byte, len(startKey))
                        copy(endKey, startKey)
                        i := len(endKey) - 1
                        for i >= 0 {
                                if endKey[i] < 0xff {
                                        endKey[i]++
                                        endKey = endKey[:i+1] // Truncate any theoretical rolled-over bytes
                                        break
                                }
                                // If byte is 0xff, conceptually reset it to 0x00 and carry over to the next byte
                                endKey[i] = 0x00
                                i--
                        }
                        // If i < 0, it means all bytes were 0xff. The range extends infinitely after the prefix.
                        // The most practical range for "starts with prefix" is PrefixRange.
                        // Otherwise, use the calculated endKey.
                        if i < 0 { currentRange = bigtable.PrefixRange(prefixStr) } else { currentRange = bigtable.NewRange(prefixStr, string(endKey)) } } rowRangeList = append(rowRangeList, currentRange) originalKeys[prefixStr] = prefixStr // Track by the prefix string } } foundResults := make(map[string]OutputRecord) // key -> formatted output record
        var resultsMutex sync.Mutex                   // Protect concurrent writes to foundResults map
        var batchErr error                            // Store error from reads

        // --- Query Bigtable (Separate calls for keys and ranges) ---

        // Callback function to process rows found
        processRowFunc := func(row bigtable.Row) bool {
           key := row.Key()
           outputRec := make(OutputRecord)
           outputRec["key"] = key // Always include the row key

           // Process cells for this row
           for cf, items := range row {
                _ = cf // cf is column family name
                for _, item := range items {
                        // item.Column is "family:qualifier"
                        // Only take the latest value (default behavior of ReadRows without version filters)
                        outputRec[item.Column] = string(item.Value)
                }
           }

           resultsMutex.Lock()
           foundResults[key] = outputRec
           resultsMutex.Unlock()
           return true // Continue processing rows
        }

        // Read exact rows first
        if len(rowList) > 0 {
           log.Printf("Reading %d specific keys", len(rowList))
           err := tbl.ReadRows(ctx, rowList, processRowFunc, bigtable.RowFilter(colFilter))
           if err != nil {
                log.Printf("ERROR: Failed to read rows for keys: %v", err)
                batchErr = fmt.Errorf("reading keys failed: %w", err) // Store the error
           }
        }

        // Read row ranges if no error occurred yet (or choose if partial results are OK)
        // If an error occurred reading keys, we skip ranges for this batch to report the first error.
        if batchErr == nil && len(rowRangeList) > 0 {
           log.Printf("Reading %d row ranges", len(rowRangeList))
           err := tbl.ReadRows(ctx, rowRangeList, processRowFunc, bigtable.RowFilter(colFilter))
           if err != nil {
                log.Printf("ERROR: Failed to read rows for ranges: %v", err)
                batchErr = fmt.Errorf("reading ranges failed: %w", err) // Store the error
           }
        }

        // --- Process Results and Write Output ---

        // Handle overall batch error first - if either read failed, report error for all inputs.
        if batchErr != nil {
           log.Printf("ERROR: Failing batch due to previous read error: %v", batchErr)
           for inputKeyOrPrefix := range originalKeys {
                errorRec := ErrorRecord{
                        Error:        batchErr.Error(),
                        KeyOrPrefix: inputKeyOrPrefix, // Report the original input key/prefix
                }
                if encErr := encoder.Encode(errorRec); encErr != nil {
                        log.Printf("ERROR: Failed to write error record to output: %v", encErr)
                }
           }
           return // Stop processing this batch
        }

        // Post-process to find misses and write results/misses
        foundInputKeys := make(map[string]bool) // Track which original keys/prefixes had at least one match

        // Write found results and mark corresponding inputs as found
        log.Printf("Processing %d results found for batch", len(foundResults))
        for key, outputRec := range foundResults {
           if encErr := encoder.Encode(outputRec); encErr != nil {
                log.Printf("ERROR: Failed to write result record to output: %v", encErr)
           }

           // Mark the corresponding input key/prefix as found
           // Iterate through original inputs to see which one this result satisfies
           for inputKeyOrPrefix := range originalKeys {
                _, inputWasPrefix := prefixInputs[inputKeyOrPrefix]

                if !inputWasPrefix && key == inputKeyOrPrefix {
                        // Exact match for a key input
                        foundInputKeys[inputKeyOrPrefix] = true
                        // Don't break here; a found key might satisfy multiple inputs
                        // if inputs were redundant (e.g., key "abc" and prefix "ab")
                } else if inputWasPrefix && strings.HasPrefix(key, inputKeyOrPrefix) {
                        // Match for a prefix input
                        foundInputKeys[inputKeyOrPrefix] = true
                }
           }
        }

        // Identify and write "noMatches" records
        missingCount := 0
        for keyOrPrefix := range originalKeys {
           if !foundInputKeys[keyOrPrefix] {
                missingCount++
                noMatchRec := NoMatchRecord{NoMatches: keyOrPrefix}
                if encErr := encoder.Encode(noMatchRec); encErr != nil {
                        log.Printf("ERROR: Failed to write noMatches record for '%s': %v", keyOrPrefix, encErr)
                }
           }
        }
        if missingCount > 0 {
           log.Printf("%d inputs in batch had no matching rows", missingCount)
        }
}

These were the installation instructions provided by Gemini 2.5 Pro that worked on the test VM:

apt-get -y install golang
apt-get -y update
apt-get -y upgrade

#go to https://go.dev/dl/ to download the latest version then follow instructions at https://go.dev/doc/install
rm -rf /usr/local/go
wget https://go.dev/dl/go1.24.2.linux-amd64.tar.gz
rm -rf /usr/local/go && tar -C /usr/local -xzf go1.24.2.linux-amd64.tar.gz
export PATH=$PATH:/usr/local/go/bin
ln -s /usr/local/go/bin/go /usr/bin/go

#compile/build...
vi bigtable_query.go
go mod init bigtable_query
go get cloud.google.com/go/bigtable
go mod tidy
go build -o bigtable_query bigtable_query.go