Generative AI Experiments: LLM Knowledge Freshness & GSUTIL vs GCLOUD – The Dangers Of LLMs For Fast-Moving Fields

As LLMs are increasingly positioned as "coding copilots" for developers, we've been exploring their potential utility to help with the fast-moving field of generative AI and the steadily-evolving landscape of cloud operations. For example, in just under one year, GCP has introduced multiple versions of multiple public LLM models Bison, Unicorn and Gemini Pro, with Gemini debuting an entirely new and disjoint API from the previous generation models and the underlying parameters, default settings and recommended configurations for all of its models changing innumerous times. Rather than constantly checking multiple documentation websites and wading through pages upon pages of technical documentation for each model, hunting for any changes relevant to each deployed use case, LLMs offer the potential to vastly reduce this workload, by allowing developers to always access the latest version of each model's official implementation best practices. Similarly, rather than keep track of the myriad weekly developments across the entire GCP platform, LLMs via RAG, RLHF and proprietary updating processes are a natural fit for always producing workflows with the latest guidance. In other words, a developer could simply ask for a code sample to use Gemini Pro and rest assured that the returned code reflects the latest best practices, including all of the newly-added or changed parameter settings or ask how to copy a file to GCS and know that the returned CLI snippet represents the latest recommendations from GCP. The reality is quite different.

Here we ask Bing Copilot, GPT-4 and even Google's own Bard the simple question "How do I copy a file to GCS from my local harddrive under Linux?" and follow up by asking whether gsutil or gcloud offers faster performance for copying files to GCS. All three models incorrectly recommend gsutil over gcloud, falsely and confidently asserting that gsutil offers superior performance to gcloud, despite Google itself claiming that gcloud achieves up to 94% faster transfers and applying a prominent warning label to the top of all GCS pages warning that "gsutil is not the recommended CLI for GCS" and that users should use gcloud instead.

In short, a year and a half after Google officially recommended that most GCS CLI use cases switch to gcloud and prominently touted its greatly improved performance over gsutil, all three models incorrectly recommended gsutil as the superior and more performant choice. Two of the advanced dedicated post-GPT-4 coding copilots we tested also incorrectly recommended gsutil over gcloud and produced gsutil examples for most CLI use cases. Even having GPT-4 perform a web search and summarize the results yielded the wrong answer, demonstrating the limitations of RAG for freshness in topics where an outdated answer dominates.

The reason for this is simple: after more than a decade of existence, the vast majority of all CLI code on the Internet for GCS uses gsutil, while only newer code over the past year and a half uses gcloud. Statistically, gsutil is deeply ingrained in all LLM models. Thus lies one of the most dangerous and underappreciated limitations of current LLMs: the very use case they would be so helpful at (keeping updated with all of the myriad updates to platforms) is the very area where they are most limited due to their fixed knowledge cutoffs and reliance on statistical frequency of the Internet, rather than reading documentation and summarizing what it says.

 

For a trivial example, let's look at the simple question: how do I copy a file to GCS?

How do I copy a file to GCS from my local harddrive under Linux?

Since the public debut of GCS, the answer to this question has always been "gsutil" – a Python-based CLI produced by Google as the official CLI for interacting with its GCS service. However, gcloud has now replaced gsutil as the officially recommended utility that GCP itself suggests users use for most CLI GCS use cases.

What does a simple Google web search turn up for this question? If we type that question into Google, the first result is actually for Filestore, uses gsutil and uses gsutil to copy files over. This actually reflects the complexities of updating vast legacies of documentation and nuances to specific edge use cases. The second search result, "cp – Copy files and objects" is the one we want. In fact, if we instead search "copy file to gcs" we get the latter as the first result.

At the very top of the page is the highlighted blue warning:

Important: gsutil is not the recommended CLI for Cloud Storage. Use gcloud storage commands in the Google Cloud CLI instead.

According to the October 7, 2022 announcement, gcloud offers "up to 94% faster data transfers".

Thus, for a simple copy copy, we would expect major LLMs to recommend the newer gcloud over gsutil (some edge cases may not be fully supported by gcloud, but simple file copying is a key use case for gcloud over gsutil).

Interestingly, if one asks Bing the same question, the answer is actually the gsutil documentation page and Bing helpfully excerpts the text below in a pull quote, but fails to include the highlighted blue warning that appears just above it repeating that gcloud is the replacement for gsutil for most use cases now:

The gsutil cp command allows you to copy data between your local file system and the cloud, within the cloud, and between cloud storage providers. For example, to upload all text files from the local directory to a bucket, you can run: gsutil cp *.txt gs://my-bucket You can also download data from a bucket.

Let's first see what Bing Copilot produces:

To copy a file to Google Cloud Storage (GCS) from your local hard drive under Linux, you need to install and configure the gsutil tool, which is a Python package that allows you to interact with GCS using the command line. You can install it using pip install gsutil or follow the instructions here.

Once you have gsutil installed and configured, you can use the gsutil cp command to copy files and objects between your local file system and the cloud, within the cloud, or between cloud storage providers. For example, to upload a file named myfile.txt from your current directory to a bucket named my-bucket, you can run:

gsutil cp myfile.txt gs://my-bucket

You can also use wildcards, recursive options, and parallel options to copy multiple files or directories. For more details and examples, see the documentation of the gsutil cp command.

I hope this helps you with your task. If you have any other questions, feel free to ask me. 😊

What about Bard, given that it is Google's own LLM and thus should be updated to reflect the company's latest guidance for its own products? It too falsely claims that gsutil offers higher performance than gcloud:

The choice between gsutil and gcloud for copying files to GCS depends on your specific needs and preferences. Here's a breakdown:

gsutil:

    • Pros:

      • Specialized for GCS: Offers a wider range of features and options specifically for managing GCS buckets and objects.
      • Mature and stable: Has been around longer and is generally considered more reliable.
      • Possibly faster: May offer slightly faster transfer speeds in some cases.
    • Cons:

      • Limited scope: Only works with GCS, not other GCP services.
      • Complex syntax: Can have a steeper learning curve due to its extensive set of commands and flags.
      • Less discoverable: Documentation and resources may be harder to find than for gcloud.

gcloud:

    • Pros:

      • Multi-service tool: Can manage multiple GCP services besides GCS, like Compute Engine, Kubernetes Engine, and Cloud SQL.
      • Simpler syntax: Generally easier to learn and use due to its unified command structure for different services.
      • Better integration: Integrates seamlessly with other GCP tools and services.
    • Cons:

      • Fewer GCS-specific features: Lacks some advanced features found in gsutil specifically for GCS management.
      • May be slower: In some cases, gsutil may offer faster transfer speeds for GCS tasks.
      • Potentially less stable: As a newer tool, it may encounter more bugs or unexpected behavior compared to gsutil.

Verdict:

In most cases, gcloud is the recommended choice for beginners and general use thanks to its simplicity and multi-service capabilities. It seamlessly integrates with other GCP tools and offers a more consistent experience.

However, if you need advanced features specifically for GCS management, or prioritize maximum performance and stability, gsutil might be a better option. It provides more granular control over GCS operations and may offer slight speed advantages, especially for complex tasks.

Ultimately, the best choice depends on your specific needs and preferences. If you're unsure, starting with gcloud for its ease of use and then switching to gsutil if you need more advanced features later is a perfectly valid approach.

I hope this clarifies the strengths and weaknesses of each tool and helps you make an informed decision!

What if we ask Bard explicitly which is better?

Which tool offers the fastest performance for copying files to GCS: gsutil or gcloud?

Unfortunately, it too falsely recommends gsutil as offering higher performance, showing that companies are not yet focusing on ensuring their own LLMs are properly updated to reflect their own products:

While both gsutil and gcloud are powerful tools for interacting with GCS, in terms of raw speed for file transfers, gsutil generally holds the edge. Here's why:

Factors favoring gsutil:

    • Dedicated functionality: gsutil is specifically designed for GCS, which allows it to be more optimized for transfer efficiency. It bypasses some of the general-purpose logic in gcloud that might add overhead.
    • Command-line efficiency: gsutil offers a wider range of specialized flags and options for optimizing transfer performance, like parallel uploads, multi-threaded downloads, and chunked transfers.
    • Direct path manipulation: gsutil can directly work with paths on the local filesystem and GCS buckets, potentially avoiding unnecessary data serialization or copying stages compared to gcloud's more abstract service-oriented approach.

However, the performance difference may not be significant in all cases. Several factors can influence transfer speed, including:

    • Network bandwidth: Your internet connection speed is the most important bottleneck in most cases. Uploading large files will likely be limited by your ISP's upload limit, regardless of the tool used.
    • GCS performance: GCS itself has various storage classes and regions with varying throughput capabilities. Choose the appropriate storage class and region for your needs to maximize transfer speeds.
    • File size and type: Small files generally transfer quickly regardless of the tool, while large files can benefit more from gsutil's optimization techniques. Additionally, transferring many small files might see less gain from gsutil due to overhead from individual file operations.

Overall:

    • For pure upload speed, gsutil is generally the better choice due to its specialized focus and optimization options.
    • For general use and ease of use, gcloud remains a good option, especially if you're managing multiple GCP services beyond GCS.
    • Benchmarking your specific file transfer scenario would give you the most accurate comparison of gsutil vs. gcloud performance.

Remember, optimizing other factors like network connectivity and GCS storage class could significantly impact your transfer speed regardless of the chosen tool.

I hope this helps you make an informed decision based on your specific needs and priorities!

Finally, let's test a RAG scenario, asking GPT-4 to perform a Bing search and summarize the results:

Conduct a web search and tell me whether gsutil or gcloud is better for copying a file to GCS from my local Linux computer.

Once again it incorrectly recommends gsutil over gcloud:

The comparison between `gsutil` and `gcloud` for copying files to Google Cloud Storage (GCS) from a local Linux computer primarily centers on their functionality and use cases.

`gsutil` is a command-line tool specifically designed for managing files and objects in Google Cloud Storage. It offers a wide range of commands for operations like uploading, downloading, moving, or copying objects, and setting object permissions. One of its notable features is the ability to perform parallel uploads and downloads using the `-m` flag, which can significantly improve performance for large files or numerous files. It's well-suited for batch operations and scripting, making it a powerful tool for automating tasks related to GCS.

On the other hand, `gcloud` is a more general command-line tool for interacting with various Google Cloud Platform services, including Cloud Storage. While it can handle some basic operations with Cloud Storage, its capabilities in this area are not as extensive as those of `gsutil`.

In terms of ease of use and functionality, `gsutil` tends to be the preferred choice for tasks specifically focused on Google Cloud Storage, such as copying files. Its specialized command set and support for advanced features like parallel transfers make it highly efficient for managing GCS objects.

For more detailed instructions and examples of using `gsutil` for copying files to GCS, you can refer to the guides provided by [bluexp.netapp.com](https://bluexp.netapp.com) and [cloudbooklet.com](https://www.cloudbooklet.com/developer/gsutil-cp-copy-and-move-files-on-google-cloud). These sources offer comprehensive information on using `gsutil` for various operations, including file uploads and downloads to and from GCS, as well as managing permissions and working with different storage buckets.