Generative AI Autonomous Agent Experiments: Automated Log Diagnosis & VM Healing – Fix "No Space Left" Error

What might it look like to deploy Generative AI as a form of fully autonomous intelligent agent that monitors cloud VMs and self-heals them? Such an agent would run on each VM, monitoring its log files in realtime and fully automatically identifying faults, diagnosing their root causes and writing and executing scripts to fix them in realtime. When a recently deployed brand new VM repeatedly failed after a few days with a strange error condition, we explored this concept by deploying GPT 4.0 as our intelligent agent assigned to identify the problem and write a script to fix it. While GPT 4.0 showed some degree of promise, there are major differences between its shell and Python script capabilities and it took multiple runs before it correctly diagnosed the issue, without identifying the root cause.

In the end, asked to produce Unix shell scripts, GPT 4.0 never produced a script that checked for inode exhaustion – the actual causative in this case. Only when asked to produce a textual checklist for human consumption or a Python script did it include inodes as a possible causative. This demonstrates just how much current code generation capabilities have been optimized for Python as opposed to other more traditional kinds of system administration tools like shell scripts and offers a reminder that despite the vast landscape of shell scripts in their training data, current SOTA LLMs are typically best at generating code in a limited number of languages in which they have been heavily tuned. Worse, even with Python, it took five tries before it output a Python script that checked for inode exhaustion. Even worse than that – in all outputs that included inode checks, it never provided any code to identify where on the filesystem the offending collection of inode-exhausting small files were, instead fixating on inode exhaustion being limited to /tmp/. To a novice Linux user unfamiliar with inodes, the resulting Python scripts and their explanations might lead that user to dismiss the inode warnings since /tmp/ was empty on this system, since the code and human-centered textual description described inodes in terms of directory-centric quotas, rather than filesystem-wide.

Recently we deployed a brand-new Debian VM on GCE to test a new codebase. When testing previous iterations of this particular application, we used "gsutil" to stage our GCS benchmark buckets. Given that gsutil has now been replaced by "gcloud storage", we updated our test code use gcloud. Strangely, after just 72 hours, the VM would become unresponsive and each time we rebooted it, it would simply freeze again after 5-10 minutes.

Log files showed the error was occurring when we invoked GNU parallel, but we saw similar errors with other applications attempting to create temp files:

Error in tempfile() using template /tmp/parXXXXX.arg: Could not create temp file /tmp/parkvVpU.arg: No space left on device at /usr/bin/parallel line 3889.

Strangely, disk space checks showed 100GB free on /dev/sda1, leaving a puzzle: how could a disk with 100GB free be out of disk? None of the system logs showed anything amiss, nor did they contain anything that might shed light on why the disk was considered out of space.

Thus, we decided to see what GPT 4.0 made of the error and asked it to write a shell script to fix it:

Write a Linux shell script for Debian to fix the following error:

"Error in tempfile() using template /tmp/parXXXXX.arg: Could not create temp file /tmp/parkvVpU.arg: No space left on device at /usr/bin/parallel line 3889."

This yielded the following script that merely attempted to delete files from /tmp/. Worse, it attempted to log to a disk that was already full, causing the machine to instantly becoming unresponsive:

#!/bin/bash

LOG_FILE="/var/log/cleanup_tmp_dir.log"

log_message() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> $LOG_FILE
}

log_message "Starting cleanup of /tmp directory."

find /tmp -type f -atime +1 -exec rm {} \; 2>> $LOG_FILE

log_message "Cleanup of /tmp directory completed."

df -h >> $LOG_FILE
log_message "Disk usage logged."

echo "Cleanup completed. Disk usage:"
df -h

log_message "Script completed."

Let's try again:

That didn't work. Try a different approach. Output a simple shell script without comments.

This yields similar results:

Here's a simplified and comment-free version of the shell script:

```bash
#!/bin/bash

LOG_FILE="/var/log/cleanup_tmp_dir.log"

echo "$(date '+%Y-%m-%d %H:%M:%S') - Starting cleanup of /tmp directory." >> $LOG_FILE
find /tmp -type f -atime +1 -exec rm {} \; 2>> $LOG_FILE
echo "$(date '+%Y-%m-%d %H:%M:%S') - Cleanup of /tmp directory completed." >> $LOG_FILE
df -h >> $LOG_FILE
echo "$(date '+%Y-%m-%d %H:%M:%S') - Disk usage logged." >> $LOG_FILE
echo "Cleanup completed. Disk usage:"
df -h
echo "$(date '+%Y-%m-%d %H:%M:%S') - Script completed." >> $LOG_FILE
```

This script performs all the necessary tasks without any commentary or extra function definitions.

Let's try to be more exhaustive:

Write a Linux shell script for Debian Linux to fix the error below. Write a single script that tests for every possible condition that could cause this error and have the script fix all of those conditions:

"Error in tempfile() using template /tmp/parXXXXX.arg: Could not create temp file /tmp/parkvVpU.arg: No space left on device at /usr/bin/parallel line 3889."

This yields nearly identical results:

The error message "No space left on device" indicates that the `/tmp` directory is out of space, which prevents the creation of temporary files. To address this error and ensure the system is functioning properly, the script needs to:

1. Check if there is enough disk space in `/tmp`.
2. Clean up old or unnecessary files in `/tmp`.
3. Check for and fix filesystem issues if the disk shows it has space, but the OS reports it as full.
4. Optionally, increase the size of `/tmp` if it is a mounted tmpfs.

Here's a shell script that addresses these potential issues:

```bash
#!/bin/bash

# Function to check available space and clean up /tmp directory
cleanup_tmp() {
    # Check available space on /tmp
    available_space=$(df /tmp | awk 'NR==2 {print $4}')  # in KB

    # If space is less than 1MB, attempt to clean up
    if [ "$available_space" -lt 1024 ]; then
        echo "Cleaning up old files in /tmp..."
        # Delete files older than 7 days
        find /tmp -type f -atime +7 -delete
    fi
}

# Function to check and repair filesystem issues
fix_filesystem() {
    echo "Checking for filesystem issues..."
    # Use touch to verify if /tmp can be written to
    touch /tmp/testfile.tmp &> /dev/null
    if [ $? -ne 0 ]; then
        echo "Failed to write to /tmp, trying to fix filesystem..."
        # Assuming /tmp is part of root filesystem
        umount /tmp
        fsck -p /  # Automatically repair filesystem issues
        mount /tmp
    else
        rm /tmp/testfile.tmp
    fi
}

# Function to increase tmpfs size
increase_tmpfs() {
    echo "Increasing size of /tmp..."
    mount -o remount,size=2G /tmp
}

# Main script logic
echo "Starting /tmp cleanup and fix script..."

# Check and clean /tmp
cleanup_tmp

# Check filesystem health and fix issues
fix_filesystem

# Optionally increase tmpfs size if needed
increase_tmpfs

echo "Script completed. /tmp should now be writable."
```

### How the Script Works:

1. **Cleanup Function:** It first checks if the available space in `/tmp` is less than 1MB. If so, it cleans up files older than 7 days.

2. **Fix Filesystem Function:** It attempts to create a file in `/tmp` to check write capability. If it fails, it assumes there might be a filesystem issue, unmounts `/tmp`, runs `fsck` to check and repair the root filesystem (assuming `/tmp` is not a separate partition), and then remounts `/tmp`.

3. **Increase Tmpfs Size:** If more space is still required, the script increases the size of the tmpfs mount at `/tmp` to 2GB.

### Important Notes:

- The script assumes `/tmp` is part of the root filesystem and not a separate partition. Adjust the `fsck` and `mount` commands if `/tmp` has a separate partition.
- Running `fsck` automatically could be risky, especially on mounted filesystems. Ensure that this script is run during a maintenance window or under supervision.
- Adjust the tmpfs size increase as per system requirements and available RAM.

This script should be run as root to ensure it has the necessary permissions for all operations.

Running ten times in total, we never received a shell script that came close to diagnosing the underlying issue.

Given that the major LLM vendors have emphasized Python support, what if we ask for a Python script?

Write a Python script for Debian Linux to fix the error below. Write a single script that tests for every possible condition that could cause this error and have the script fix all of those conditions:

"Error in tempfile() using template /tmp/parXXXXX.arg: Could not create temp file /tmp/parkvVpU.arg: No space left on device at /usr/bin/parallel line 3889."

This yields the following that doesn't do much better:

Here's a Python script designed to address various potential causes of the error you're seeing with creating a temp file in Debian Linux. The script attempts to:

1. Check if the `/tmp` directory is writable.
2. Check disk space to ensure there's enough space available.
3. Attempt to clean up old, unused files in `/tmp` if space is low.

This script requires `python3` and assumes that you have `sudo` privileges if necessary.

```python
import os
import shutil
import subprocess

def check_tmp_writable():
    """Check if the /tmp directory is writable."""
    if not os.access('/tmp', os.W_OK):
        print("The /tmp directory is not writable. Checking permissions.")
        try:
            os.chmod('/tmp', 0o1777)  # Sets sticky bit along with read, write, execute for all
            print("/tmp permissions fixed.")
        except PermissionError:
            print("Failed to set permissions on /tmp. Run the script with sudo.")
            exit(1)
    else:
        print("/tmp is writable.")

def check_disk_space():
    """Check disk space in the /tmp directory."""
    total, used, free = shutil.disk_usage("/tmp")
    print(f"Total: {total / (1024**3):.2f} GB, Used: {used / (1024**3):.2f} GB, Free: {free / (1024**3):.2f} GB")
    if free < 1:  # Less than 1GB free
        print("Low disk space on /tmp. Attempting to clean up.")
        try:
            subprocess.run(['sudo', 'rm', '-rf', '/tmp/*'], check=True)
            print("Temporary files cleaned up.")
        except subprocess.CalledProcessError:
            print("Failed to clean /tmp. Run the script with sudo.")
            exit(1)

def main():
    print("Checking /tmp directory status...")
    check_tmp_writable()
    check_disk_space()
    print("System check complete. /tmp should be ready for use.")

if __name__ == "__main__":
    main()
```

### How to Use This Script
- Save the script as `fix_tmp_issue.py`.
- Run it using Python 3. You may need to use `sudo` to ensure the script has the necessary permissions to change directory permissions and clean up `/tmp`:
  ```bash
  sudo python3 fix_tmp_issue.py
  ```

### Notes
- This script will delete all files in `/tmp`. This action is usually safe, as `/tmp` is meant for temporary files that should not be required after a reboot. However, ensure no critical operations are using `/tmp` when you run this script.
- Depending on your system's configuration and the nature of the error, additional steps might be necessary (such as investigating specific permissions or handling mount issues). This script addresses the most common scenarios.

And a second time yields similar results:

The error message you provided indicates a few possible issues that can occur on a Debian Linux system when trying to create a temporary file:

1. The `/tmp` directory is full (no space left on device).
2. Permissions issues with the `/tmp` directory.
3. Potential issues with the underlying file system.

To address these problems, I'll write a Python script that:
- Checks the available space in the `/tmp` directory.
- Verifies and corrects permissions for the `/tmp` directory.
- Checks the integrity of the file system (this part will need sudo access and should be handled with care).

Here's a Python script that performs these checks and attempts to resolve the issues:

```python
import os
import subprocess
import shutil

def check_disk_space(path):
    """ Check available disk space """
    total, used, free = shutil.disk_usage(path)
    print(f"Total: {total}, Used: {used}, Free: {free}")
    return free

def check_and_fix_permissions(path):
    """ Check and fix permissions of the temp directory """
    try:
        stat_info = os.stat(path)
        current_permissions = oct(stat_info.st_mode)[-3:]
        print(f"Current permissions on {path}: {current_permissions}")
        if current_permissions != '777':  # typically /tmp should be 1777, but Python handles it as '777'
            print("Fixing permissions...")
            os.chmod(path, 0o1777)
            print(f"Permissions set to 1777 on {path}")
    except Exception as e:
        print(f"Error checking/modifying permissions: {e}")

def clean_temp_directory(path):
    """ Clean up temporary files if necessary """
    if check_disk_space(path) < 100 * 1024 * 1024:  # less than 100MB free
        print("Not enough space, cleaning up temp...")
        for root, dirs, files in os.walk(path):
            for file in files:
                try:
                    os.remove(os.path.join(root, file))
                except Exception as e:
                    print(f"Failed to remove {file}: {e}")
        print("Temporary files cleaned.")

def filesystem_check(path):
    """ Perform filesystem check and repair if necessary """
    print("Checking filesystem (this may require admin privileges)...")
    try:
        subprocess.run(['sudo', 'fsck', '-p', path], check=True)
        print("Filesystem check completed.")
    except subprocess.CalledProcessError as e:
        print(f"Filesystem check failed: {e}")

def main():
    tmp_path = '/tmp'
    print("Starting diagnostics for /tmp directory...")
    
    # Check disk space
    free_space = check_disk_space(tmp_path)
    if free_space < 100 * 1024 * 1024:  # less than 100MB is concerning
        clean_temp_directory(tmp_path)
    
    # Check permissions
    check_and_fix_permissions(tmp_path)
    
    # Optional: Uncomment the next line to enable filesystem checking.
    # Warning: this operation might require the script to be run as sudo and can be risky.
    # filesystem_check(tmp_path)

if __name__ == "__main__":
    main()
```

### Important Notes:
- **Permissions**: `/tmp` should typically have permissions set to `1777` (sticky bit + read/write/execute for everyone). This script checks if it's `777` and adjusts accordingly since Python's `os.chmod` uses a different notation for sticky bits.
- **Filesystem Check**: The `fsck` command can be dangerous to run on mounted filesystems as it can cause data corruption if files are open and being written to during the check. It's commented out by default.
- **Running as sudo**: Some operations like `fsck` or cleaning protected files might require administrative privileges. Ensure that the script is run by a user with the appropriate permissions or through sudo for those operations.

Make sure to use this script with caution, especially when dealing with filesystem integrity checks and operations requiring elevated privileges.

And a third time:

The error message you've provided indicates a problem when a script tries to create a temporary file in the `/tmp` directory, but fails because there is no space left on the device. This is a common issue that could be caused by several underlying problems such as:

1. The `/tmp` directory is full.
2. The disk/partition containing `/tmp` is full.
3. User or process limits (like `ulimit`) have been reached.

A Python script to handle these issues could include the following steps:
- Checking the available space in `/tmp`.
- Cleaning up older files in `/tmp` if possible.
- Checking and adjusting `ulimit` settings.
- Providing an alternative directory for temporary files if the standard cleanup is not enough.

Here is a script that covers these aspects:

```python
import os
import shutil
import subprocess

def check_disk_usage(path):
    """ Check the available disk space. """
    total, used, free = shutil.disk_usage(path)
    print(f"Total: {total}, Used: {used}, Free: {free}")
    return free

def cleanup_old_tmp_files(tmp_dir='/tmp', days=7):
    """ Remove files older than a certain number of days from the specified tmp directory. """
    for root, dirs, files in os.walk(tmp_dir):
        for name in files:
            file_path = os.path.join(root, name)
            try:
                if os.stat(file_path).st_mtime < time.time() - days * 86400:
                    os.remove(file_path)
                    print(f"Removed old temporary file: {file_path}")
            except Exception as e:
                print(f"Error removing {file_path}: {e}")

def check_and_modify_ulimits():
    """ Check and modify ulimits if they are too low. """
    try:
        # Get current limit on number of open files
        soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
        print(f"Current limits: Soft = {soft}, Hard = {hard}")

        # If soft limit is less than 1024, try to increase it
        if soft < 1024: new_soft = 1024 resource.setrlimit(resource.RLIMIT_NOFILE, (new_soft, hard)) print(f"Updated soft limit of open files to {new_soft}") except Exception as e: print(f"Failed to set ulimits: {e}") def set_alternative_tmp_directory(): """ Set an alternative temporary directory if needed. """ alt_tmp_dir = "/var/tmp" if check_disk_usage(alt_tmp_dir) > 10**9:  # at least 1GB free
        os.environ["TMPDIR"] = alt_tmp_dir
        print(f"Set alternative tmp directory: {alt_tmp_dir}")
    else:
        print("No suitable alternative tmp directory available with enough free space.")

def main():
    if check_disk_usage('/tmp') < 10**7:  # less than 10MB free
        print("Insufficient disk space in /tmp.")
        cleanup_old_tmp_files()
        if check_disk_usage('/tmp') < 10**7:
            set_alternative_tmp_directory()
    check_and_modify_ulimits()

if __name__ == "__main__":
    main()
```

### How the script works:
- **Disk Usage Check**: First, it checks if there's enough free space in `/tmp`. If not enough (less than 10MB free), it tries to clean up.
- **Clean Up**: It cleans files older than a specified number of days (default is 7 days).
- **Ulimit Check**: It checks and potentially increases the limit of open files if it's set too low.
- **Alternative Directory**: If there is still not enough space in `/tmp` after cleanup, it attempts to switch to an alternative directory with enough free space.

### Notes:
- This script should be run with appropriate permissions; for example, root might be necessary to remove certain files or change ulimits.
- It is important to run this kind of maintenance script with caution, as removing files or changing system settings can have unintended consequences depending on the system setup and current usage.

But the fourth time we get something fascinating – this time it checks the number of free inodes. In fact, it turns out that our issue was that the available inodes had been exhausted on the disk. Unfortunately, while identifying inodes as an issue, the script fixates on /tmp/ instead of attempting to track down the source of the large number of files:

The error message you've provided indicates a failure in creating a temporary file due to a lack of space on the device. The Python script I'll provide will check and address multiple potential issues that could lead to this error:

1. **Disk Space**: Check if the `/tmp` directory or the device it resides on is full.
2. **Permissions**: Verify that the user has the necessary permissions to write to the `/tmp` directory.
3. **Inodes**: Check if the device has run out of inodes, which are needed to create new files and directories.
4. **Cleanup**: Optionally clear some temporary files if it's safe to do so.

Here is a Python script that performs these checks and actions:

```python
import os
import shutil
import sys

def check_disk_space(path):
    """Check the available disk space."""
    total, used, free = shutil.disk_usage(path)
    print(f"Total: {total}, Used: {used}, Free: {free}")
    return free

def check_inode_usage(path):
    """Check inode usage for the filesystem."""
    statvfs = os.statvfs(path)
    total_inodes = statvfs.f_files
    free_inodes = statvfs.f_ffree
    print(f"Total inodes: {total_inodes}, Free inodes: {free_inodes}")
    return free_inodes

def check_permissions(path):
    """Check if the user has write permissions to the directory."""
    return os.access(path, os.W_OK)

def cleanup_tmp_directory(path):
    """Clean up old temp files in the /tmp directory safely."""
    for root, dirs, files in os.walk(path):
        for file in files:
            file_path = os.path.join(root, file)
            try:
                if os.path.isfile(file_path):
                    os.unlink(file_path)
                    print(f"Deleted {file_path}")
            except Exception as e:
                print(f"Failed to delete {file_path}: {e}")

def main():
    tmp_path = '/tmp'
    
    # Check if there is enough disk space
    if check_disk_space(tmp_path) < 10 * 1024 * 1024:  # less than 10MB free
        print("Not enough disk space in /tmp. Attempting to clean up.")
        cleanup_tmp_directory(tmp_path)

    # Check inode availability
    if check_inode_usage(tmp_path) == 0:
        print("No free inodes on the device. Attempting to clean up.")
        cleanup_tmp_directory(tmp_path)

    # Check permissions
    if not check_permissions(tmp_path):
        print("Permission denied to write in /tmp.")
        sys.exit(1)

    print("All checks passed. /tmp is ready for use.")

if __name__ == '__main__':
    main()
```

### Explanation:
- **Disk Space**: The script checks if the available space is less than 10 MB, which you might adjust based on your needs, and tries to clean up if necessary.
- **Inodes**: If no inodes are free, it attempts to remove old files.
- **Permissions**: If the user does not have write permissions, the script will exit with a failure message.
- **Cleanup**: This function will attempt to remove files safely, although in a real-world scenario, you would want more sophisticated management of what gets deleted to avoid removing important files.

Before running this script, ensure you have the appropriate permissions, especially for deleting files. You might want to add more sophisticated logging and error handling depending on your environment's requirements.

Running multiple additional times alternates between an identical tmp-fixated inodes check and no inodes check.

How about a different approach? Instead of asking for a fully automated solution, what if we simply ask for a checklist?

What could possibly be causing this issue? Make an exhaustive list of things I should check and what values I should look for. Give me the specific commands I should run to check.

This time it also suggested checking for inode exhaustion:

The error message you're seeing indicates that a temporary file could not be created because there's no space left on the device. This typically happens when the partition where `/tmp` is located is full. Here are some steps and specific commands to investigate and potentially resolve this issue:

### 1. Check Disk Space Usage
First, check the overall disk space usage to confirm if the `/tmp` partition is indeed full.

- **Command:**
  ```bash
  df -h
  ```
  Look for the entry for `/tmp` if it's mounted as a separate partition, or the root partition (`/`) if `/tmp` is not separate.

### 2. Analyze Space Usage in `/tmp`
If the disk space check shows that the partition is full, look at what's consuming space in `/tmp`.

- **Command:**
  ```bash
  sudo du -sh /tmp/*
  ```
  This command shows the size of each file and directory in `/tmp`.

### 3. Clean Up `/tmp`
If there are many large files or a high volume of small files in `/tmp`, you might need to clean it up.

- **Command:**
  ```bash
  sudo rm -rf /tmp/*
  ```
  **Warning:** This will delete all files in `/tmp`. Ensure that no critical processes are using these files.

### 4. Check for Open File Handles
Sometimes, files in `/tmp` are deleted, but space is not freed because processes still have them open. Check for such files.

- **Command:**
  ```bash
  sudo lsof /tmp
  ```
  Look for files marked as `(deleted)` but still open.

### 5. Check Inode Usage
The filesystem might have free space, but it can run out of inodes, preventing new files from being created.

- **Command:**
  ```bash
  df -i
  ```
  Check the `IUse%` for high percentages, which indicate inode exhaustion.

### 6. Increase `/tmp` Size
If `/tmp` is a separate partition and consistently runs out of space, consider increasing its size or relocating it to a partition with more space.

- **Commands to increase space will vary** based on your partitioning and filesystem. Common tools include `resize2fs` for ext4, `lvextend` for LVM-managed volumes, etc.

### 7. Move `/tmp` to Another Partition
If increasing size is not feasible, you might relocate `/tmp` to another partition with more space.

- **Configuration:**
  Modify `/etc/fstab` to mount `/tmp` on another partition. This requires careful planning and system downtime.

### 8. Configure System Tools
Some systems use `systemd-tmpfiles` or similar tools to manage `/tmp`. Ensure these are correctly configured to clean up old files.

- **Command to clean up using `systemd-tmpfiles`:**
  ```bash
  systemd-tmpfiles --clean
  ```

### 9. Reboot
If unsure why `/tmp` is full and you've cleaned up files but still face issues, a reboot might help clear up `/tmp` if it's not a separate partition and release resources held by zombie processes.

- **Command:**
  ```bash
  sudo reboot
  ```

### Monitoring and Prevention
Consider setting up monitoring on the space usage of `/tmp` to alert before it becomes full. Tools like `cron` jobs running `df` and alerting if space is below a threshold, or more sophisticated monitoring systems, can be helpful.

By following these steps, you should be able to diagnose and resolve the space issues in `/tmp` that are leading to the error message you're experiencing.

Ultimately, it required human searching across the web to discover that, unlike gsutil, gcloud defaults to creating a massive archive of log files and when using to transfer large numbers of files, it can quickly create so many tens of millions of tiny log files that it quickly exhausts the available inodes – exactly what was happening here. Yet, despite testing a wide range of prompts, GPT 4.0 never suggested a script that would both test for inode exhaustion and attempt to locate the directories containing unusually high numbers of small files. In the end it simply required clearing out the temp files and setting a gcloud parameter to fix the machine – neither of which was suggested by GPT 4.0:

gcloud config set core/disable_file_logging True