Generative AI Autonomous Agent Experiments: Automated VM Health Assessment

As we continue to explore the use of generative AI fully autonomous agents to manage our global fleet of VMs, we've thus far explored their application to error diagnosis and correction. An even more important use case lies in autonomous health monitoring: intelligent agents that can holistically monitor every VM in a global fleet and autonomously identify subtle signs of problems long before they create issues and either succinctly summarize them for human analysts or perform actions entirely on their own to remediate these issues. What might a trivial example of this look like, with an agent monitoring "df -h", "df -i" and "top" on a GCE VM and asking GPT-4 and Gemini Ultra to assess the overall health of the VM?

In the end, both GPT-4 and Gemini Ultra offered generic assessments of the VM that fixated on the available root disk space, while failing to recognize that that level of utilization on a 10GB root disk is not unusual – a surprising finding given the copilot-like use cases often recommended for both models and the sheer volume of admin content on the web surrounding this issue that is likely captured in their training data.

Here is the prompt we used with both GPT-4 and Gemini Ultra. It was generated by a simple script that ran each command, extracted their outputs and formatted into a prompt template for use with both LLMs, mimicking a production workflow. The underlying VM is a temporary 64-core N1 GCE VM with 416GB RAM and a 10GB root disk that was spun up to use as a RAM disk for a series of IO-intensive high-CPU workloads.

You are a Linux system monitoring agent. Examine the system data below and identify any problems.

Below is the output of "df -h" which returns the total disk space usage on the system:
-----------------------------------------------
user@node$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            205G     0  205G   0% /dev
tmpfs            41G  664K   41G   1% /run
/dev/sda1       9.7G  7.9G  1.3G  86% /
tmpfs           205G  112G   94G  55% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda15      124M   11M  114M   9% /boot/efi
tmpfs            41G     0   41G   0% /run/user/1001

Below is the output of "df -i" which returns the total inode usage on the system:
-----------------------------------------------
user@node$ df -i
Filesystem       Inodes   IUsed    IFree IUse% Mounted on
udev           53632691     291 53632400    1% /dev
tmpfs          53635078     483 53634595    1% /run
/dev/sda1        647168  168166   479002   26% /
tmpfs          53635078 6696621 46938457   13% /dev/shm
tmpfs          53635078       3 53635075    1% /run/lock
/dev/sda15            0       0        0     - /boot/efi
tmpfs          10727015      17 10726998    1% /run/user/1001

Below is the output of the first few lines of "top" which returns the total memory usage on the system:
-----------------------------------------------
user@node$ top
Tasks: 533 total,   1 running, 532 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.2 us,  0.2 sy,  0.0 ni, 99.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 419024.1 total, 241661.5 free,  11150.4 used, 166212.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 287380.2 avail Mem

GPT-4 focuses on the low raw disk space on the root disk, but fails to understand that on a 10GB root disk, after OS and basic utilities installation, this is not unusual at all. It also strangely flags the minimal temp inode usage:

From the provided system data, several insights and potential issues can be identified:

1. **Disk Usage Concerns:**
– The `/dev/sda1` partition, which is likely the root filesystem, is quite full with 86% utilization. This leaves only 1.3 GB free, which might not be sufficient for day-to-day operations, especially if the system is actively used or updated frequently. High disk usage on the root partition can lead to system instability and performance issues.

2. **Inode Usage:**
– The inode usage for `/tmpfs` on `/dev/shm` (shared memory) is at 13%, with 6,696,621 inodes used out of 53,635,078. This is not an immediate concern since it’s a temporary filesystem and will clear on reboot, but it indicates substantial usage of shared memory for temporary file storage or inter-process communication.

3. **Memory and CPU Utilization:**
– Memory utilization looks healthy with a large amount of free memory available (241,661.5 MiB free out of 419,024.1 MiB total). This indicates that the system is not currently under heavy memory load.
– CPU usage is also very low with 99.6% idle, which shows that the system is not currently performing intensive tasks.

**Recommendations:**
– **Monitor or Clean up `/dev/sda1`:** Consider cleaning up unnecessary files or expanding the partition size to avoid potential issues with lack of space. This can involve removing unused files, clearing cache, or uninstalling unused applications.
– **Review Shared Memory Usage (`/dev/shm`):** Check what processes are utilizing shared memory extensively and determine if this usage pattern is expected. Although not immediately concerning, it's good practice to understand why such a significant amount of inodes are being used.
– **Continuous Monitoring:** Keep an eye on the system’s resource utilization to ensure no bottlenecks or issues arise, especially given the heavy use of the root filesystem.

These steps will help maintain the system's health and ensure that it remains stable and efficient in handling tasks.

Gemini Ultra also flags the root disk usage, but strangely flags the root inode usage too:

The output of "df -h" shows that the disk partition mounted on '/' is 86% full. This is a high utilization rate and could cause problems if it continues to increase.

The output of "df -i" shows that the inode utilization on '/' is 26%. This is not a critical issue yet, but it is something to monitor.

The output of "top" shows that the memory utilization is low. There is plenty of free memory and no swap usage.

Given these observations, we will focus on the disk usage. We will use the output of "du" to find the largest directories on the system.

It seems that the command to get disk usage was not executed successfully due to a syntax error. We will try running the command again, this time without using the exclamation mark to run it in the shell.

There appears to be an issue running shell commands in this environment. It is possible that the shell is not available or is not configured correctly.

Based on the output of "df -h", the disk partition mounted on '/' is 86% full. This is a high utilization rate and could cause problems if it continues to increase. To address this, you can investigate which files or directories are consuming the most space and take action to free up space.

The output of "df -i" shows that the inode utilization on '/' is 26%. This is not a critical issue yet, but it is something to monitor. If the number of files on the system continues to increase, it could eventually lead to problems.

The output of "top" shows that the memory utilization is low. There is plenty of free memory and no swap usage, so memory is not a concern at this time.

Strangely, Gemini Ultra also provides a citation for its recommendations above, which points to this defunct GitHub repository:

https://github.com/GooseSAndboxx/DevEnvironment

Is there anything urgently concerning about the above? We'll use the following prompt as a followon to both of the queries above:

Are there any urgent problems with this node that need to be addressed soon?

GPT-4 again fixates on the root disk:

The most urgent issue with this system is the disk usage on the `/dev/sda1` partition, where the root filesystem is mounted. It's at 86% capacity, which leaves only about 1.3 GB of free space. This can become a critical issue soon, particularly on a system where updates, logs, or other files are regularly written to the disk.

Low disk space can lead to several problems including:
– **System Updates and Installs:** Insufficient space may prevent system updates or the installation of new packages, potentially leaving the system vulnerable if security updates cannot be applied.
– **System Stability:** If the system runs out of space entirely, it can become unstable or even non-operable. Some applications might crash if they cannot write to the disk.
– **Performance Degradation:** Systems can slow down as the available disk space diminishes, particularly if swapping is needed and insufficient space is available to handle temporary files.

**Immediate Action Recommended:**
– **Disk Cleanup:** Begin by cleaning up large, unnecessary files. Check for large log files, cached data, and unused packages that can be safely removed.
– **Disk Usage Analysis:** Tools like `du` and graphical analyzers such as `Baobab` (Disk Usage Analyzer on GNOME) or `kdirstat` can help identify where the disk space is being used, which can guide cleanup efforts.
– **Consider Expanding Disk Space:** If this is a virtual machine or there's an option to add more storage, consider increasing the size of `/dev/sda1` to accommodate future needs.

Addressing this disk space issue soon will help avoid potential system problems that could arise from running out of space.

Gemini Ultra also fixates on the root disk:

Yes, the disk partition mounted on '/' is 86% full. This is a high utilization rate and could lead to several urgent problems:

System instability: As the disk fills up, the system may become unstable and unresponsive. Applications may crash or fail to start, and the system may become slow or hang.

Inability to write data: When the disk is full, you won't be able to save new files, update existing files, or install new software. This can severely disrupt operations.

Loss of data: In extreme cases, a full disk can lead to data corruption or loss if the system is unable to write critical information to disk.

Security vulnerabilities: A full disk can prevent the system from writing log files, which are essential for security auditing and intrusion detection.

Recommendation: It is crucial to address the high disk utilization on '/' as soon as possible. You should investigate which files or directories are consuming the most space and take appropriate action to free up space. This may involve deleting unnecessary files, archiving old data, or moving data to a different storage location.