Kubernetes Homelab Rescue: Troubleshooting with AI (and the Lessons Learned)
Recently I woke up to a fun set of issues with my homelab. In an effort to make more use of LLMs I turned to Claude for troubleshooting assistance, which did help but also once again reminded me of the risks of following AI instructions without appropriate knowledge of which suggestions are risky. The VPS in my Kubernetes cluster, jlpks8888, had powered itself off in the early morning without any warning or reason. Trying to power it back on through the admin console had no effect so a ticket was opened. Looking at my cluster status to ensure everything had failed over I had a single failing pod, diun1. It had failed over to arthur but it looked as though the PVC was failing to mount so it kept crashing. Diun being in failed state was not an issue right that moment, it acts on a schedule so any missed container version updates would be caught on the next run. The fact that any pod was in a failed state was what caught my eye. The machine had some pending updates so I ran them and rebooted, I probably would have done so in the next 24-36 hours anyway. Blackstaff, my Kubernetes master, had somehow lost track of the tailscale DNS server so it couldn’t resolve any of my other machines. sudo systemctl restart tailscaled fixed this issue. Arthur finished rebooting and I’m confronted with every single Longhorn related pod stuck in Crash Loop Backoff. Time to wake up fully and actually figure out what is going on. ...