Generative AI Autonomous Agent Experiments: Automated Log Diagnosis & VM Healing – "Martian Source" Error

As we continue to explore the use of generative AI fully autonomous agents managing our global fleet of VMs, around the same time as our previous example of a failing VM, we encountered another VM that had been running steadily for weeks, but suddenly would become unresponsive minutes after boot every single time it was rebooted. The only clue was a flood of error messages in the log each time about a networking issue involving a "martian source". Could GPT-4 or Gemini Ultra help us track down the root cause and resolve it?

Both GPT-4 and Gemini Ultra fixated on the network-related nature of the error and both produced a range of potential causatives and fixes in both shell and Python scripts. Unfortunately, none of the recommended solutions resolved the issue. It turns out the actual culprit was our change weeks earlier from "gsutil" to "gcloud storage" in a key diagnostic script that had over the previous weeks eventually maxed out the available inodes on the 100GB root disk (thus routine filesystem checks showed plenty of available disk even as it was actually unable to create new files). The ultimate fix was simply to delete all of gcloud's temporary log storage and configure it to no longer generate its verbose logs:

gcloud config set core/disable_file_logging True

Unfortunately, this once again demonstrates the limitations of LLMs as intelligent system administrator agents: system failures on production systems are rarely cut-and-dry straightforward solutions and often errors relate to something else on the system that requires creativity and thinking outside of the box.

Each time the VM became unresponsive we would see a flood of the following error messages in the system log (actual IPs replaced in the output below):

My Debian Linux VM is becoming unresponsive a few minutes after reboot. Occurs every time I reboot it. The following entries appear in /var/log/messages. Come up with a list of potential root causes and write a shell or Python script to fix each.

Apr 21 01:29:33 node kernel: [460681.175652] IPv4: martian source YYY.YYY.YYY.YYY from XXX.XXX.XXX.XXX, on dev ens4
Apr 21 01:29:33 node kernel: [460681.183170] ll header: 00000000: XYZ
Apr 21 01:29:33 node kernel: [460681.183178] IPv4: martian source YYY.YYY.YYY.YYY from XXX.XXX.XXX.XXX, on dev ens4
Apr 21 01:29:33 node kernel: [460681.190625] ll header: 00000000: XYZ
Apr 21 01:29:33 node kernel: [460681.190630] IPv4: martian source YYY.YYY.YYY.YYY from XXX.XXX.XXX.XXX, on dev ens4
Apr 21 01:29:33 node kernel: [460681.190631] ll header: 00000000: XYZ
Apr 21 01:29:33 node kernel: [460681.190633] IPv4: martian source YYY.YYY.YYY.YYY from XXX.XXX.XXX.XXX, on dev ens4
Apr 21 01:29:33 node kernel: [460681.190634] ll header: 00000000: XYZ
Apr 21 01:29:33 node kernel: [460681.190635] IPv4: martian source YYY.YYY.YYY.YYY from XXX.XXX.XXX.XXX, on dev ens4
Apr 21 01:29:33 node kernel: [460681.190636] ll header: 00000000: XYZ
Apr 21 01:29:33 node kernel: [460681.190637] IPv4: martian source YYY.YYY.YYY.YYY from XXX.XXX.XXX.XXX, on dev ens4
Apr 21 01:29:33 node kernel: [460681.190638] ll header: 00000000: XYZ
Apr 21 01:29:33 node kernel: [460681.190639] IPv4: martian source YYY.YYY.YYY.YYY from XXX.XXX.XXX.XXX, on dev ens4
Apr 21 01:29:33 node kernel: [460681.190643] ll header: 00000000: XYZ
Apr 21 01:29:33 node kernel: [460681.197925] ll header: 00000000: XYZ
Apr 21 01:29:38 node kernel: [460686.143782] net_ratelimit: 3312 callbacks suppressed

What does GPT-4 make of the following prompt coupled with the log excerpt above?

My Debian Linux VM is becoming unresponsive a few minutes after reboot. Occurs every time I reboot it. The following entries appear in /var/log/messages. Come up with a list of potential root causes and write a shell or Python script to fix each.

Unsurprisingly, all of GPT-4's solutions involve networking-related fixes that failed to resolve the issue:

The log entries you've provided indicate repeated "martian source" warnings in the kernel log. These are generated when a Linux system receives packets with source addresses that are unexpected or bogus according to the routing table, often suggesting a misconfiguration or a security issue. Below are potential root causes and corresponding remediation scripts:

### Potential Root Causes

1. **Misconfigured Network Settings**: The most common cause is incorrect network settings, such as a wrong subnet mask or gateway configured on your interfaces.

2. **Security Breach or Network Scan**: Unexpected packets could be a sign of a security breach or a network scan from an external source.

3. **Routing Table Issues**: The routing table may be incorrectly set up, directing traffic in an unexpected manner which causes these logs.

4. **IP Spoofing or Misbehaving Device**: Either an external device is spoofing its IP address, or a misconfigured device on the network is sending malformed packets.

### Remediation Scripts

Each script targets one of the potential root causes above. Before running these scripts, you should have administrative access and back up current settings.

#### 1. Check and Correct Network Configuration

This shell script checks your current IP configuration and shows you how to manually correct it:

"`bash
#!/bin/bash
echo "Current IP configuration:"
ip addr show
echo "Check the above output for any misconfiguration or unexpected IP addresses."
echo "Manually correct the network settings if needed, based on your network's requirements."
"`

#### 2. Scan for Malicious Traffic

This Python script uses `tcpdump` to monitor incoming traffic that might be causing the martian source logs. Adjust the interface and possibly the conditions based on actual network traffic.

"`python
#!/usr/bin/env python3
import subprocess

def monitor_suspicious_traffic(interface="ens4"):
print(f"Monitoring incoming traffic on {interface}…")
try:
subprocess.run(["tcpdump", "-n", "-i", interface, "not src net (your valid subnet) and not dst net (your valid subnet)"])
except KeyboardInterrupt:
print("Stopped monitoring.")

if __name__ == "__main__":
monitor_suspicious_traffic()
"`

Replace `(your valid subnet)` with your actual subnet.

#### 3. Check Routing Table

This shell script prints out the routing table and suggests checking for any unexpected routes.

"`bash
#!/bin/bash
echo "Current routing table:"
ip route show
echo "Review the routing table for any incorrect or unexpected entries."
"`

#### 4. Block Suspicious IPs

If you identify IPs that are consistently causing issues, you can block them using `iptables`. This script adds a firewall rule to block an IP:

"`bash
#!/bin/bash
suspicious_ip="XXX.XXX.XXX.XXX" # Replace with the actual suspicious IP

echo "Blocking suspicious IP: $suspicious_ip"
sudo iptables -A INPUT -s $suspicious_ip -j DROP
echo "$suspicious_ip is now blocked from accessing this machine."
"`

**Note:** Always review and test scripts in a controlled environment before deploying on a production system to avoid unintended network outages or configuration issues. Each script serves as a starting point and may need customization based on your specific environment and network setup.

Similarly, Gemini Ultra also fixates on network solutions that fail to resolve the issue:

Absolutely! Let's break down this Debian VM issue and provide potential fixes.

Potential Root Causes

The log messages indicate a "martian source" and "net_ratelimit" error. This strongly suggests a network issue. Here are the likely culprits:

    1. Incorrect Network Configuration: The VM might be misconfigured with an IP address or subnet mask that isn't valid on your network. The "martian source" error indicates the kernel is receiving packets from unexpected IP addresses, potentially due to a configuration mismatch.

    2. Network Interface Issues: The physical network interface (ens4) could be faulty or have driver problems. If the interface isn't working correctly, it could lead to the "martian source" error and the flood of messages that triggers the net_ratelimit mechanism.

    3. Network Loops or Broadcast Storms: There might be a network loop or a broadcast storm on the network segment where your VM is connected. These situations can cause a flood of packets, leading to the net_ratelimit error and potentially causing the VM to become unresponsive.

    4. Misconfigured Firewall Rules: Overly restrictive firewall rules on the VM or the network could be blocking legitimate traffic and causing network instability.

Scripts to Address Potential Root Causes

Here are shell and Python scripts to address each potential cause:

1. Check and Fix Network Configuration (Shell)

Bash
#!/bin/bash

# Check current network configuration
echo "Current network configuration for ens4:"
ip addr show ens4

# Ask user for correct configuration
read -p "Enter correct IP address: " ip_address
read -p "Enter correct subnet mask: " subnet_mask
read -p "Enter correct gateway: " gateway

# Update network configuration
echo "Updating network configuration..."
ip addr flush dev ens4
ip addr add $ip_address/$subnet_mask dev ens4
ip route add default via $gateway

# Restart network interface
systemctl restart networking

echo "Network configuration updated and restarted."

2. Check Network Interface (Shell)

Bash
#!/bin/bash

# Check interface status
echo "Checking ens4 interface status:"
ip link show ens4
ethtool ens4

# If the interface is down, try to bring it up
if [ "$(ip link show ens4 | grep 'state DOWN')" ]; then
    echo "Bringing up ens4 interface..."
    ip link set ens4 up
fi

3. Detect Network Loops (Python)

Python
#!/usr/bin/env python3
import subprocess

def detect_network_loops():
    result = subprocess.run(["bridge", "fdb", "show"], capture_output=True, text=True)
    if result.returncode == 0:
        entries = result.stdout.splitlines()
        for entry in entries:
            if entry.count(":") > 1:  # Check for multiple colons in MAC addresses
                print("Potential network loop detected:", entry)

if __name__ == "__main__":
    detect_network_loops()

Note: Resolving network loops or broadcast storms usually involves fixing physical network cabling or switch configuration, which is outside the scope of a script.

4. Check Firewall Rules (Shell)

Bash
#!/bin/bash

# Check firewall status
echo "Checking firewall status:"
systemctl status ufw  # Or iptables, depending on your firewall

# List current firewall rules
echo "Current firewall rules:"
ufw status numbered  # Or iptables -L, depending on your firewall

# Instructions for adjusting rules (example for ufw)
echo "To allow traffic, use 'ufw allow [port/protocol]'."
echo "To deny traffic, use 'ufw deny [port/protocol]'."

Important Notes:

    • Replace Placeholders: Replace the XXX.XXX.XXX.XXX, YYY.YYY.YYY.YYY, and XYZ placeholders with the actual values from your log messages.
    • Script Safety: Always review and test scripts in a safe environment before running them on your production VM.
    • Further Investigation: If these scripts don't resolve the issue, you might need to investigate more complex network issues like routing problems or deeper hardware problems.