Kubernetes Homelab Rescue: Troubleshooting with AI (and the Lessons Learned)

Recently I woke up to a fun set of issues with my homelab. In an effort to make more use of LLMs I turned to Claude for troubleshooting assistance, which did help but also once again reminded me of the risks of following AI instructions without appropriate knowledge of which suggestions are risky.

The VPS in my Kubernetes cluster, jlpks8888, had powered itself off in the early morning without any warning or reason. Trying to power it back on through the admin console had no effect so a ticket was opened.
Looking at my cluster status to ensure everything had failed over I had a single failing pod, diun¹. It had failed over to arthur but it looked as though the PVC was failing to mount so it kept crashing. Diun being in failed state was not an issue right that moment, it acts on a schedule so any missed container version updates would be caught on the next run. The fact that any pod was in a failed state was what caught my eye. The machine had some pending updates so I ran them and rebooted, I probably would have done so in the next 24-36 hours anyway.
Blackstaff, my Kubernetes master, had somehow lost track of the tailscale DNS server so it couldn’t resolve any of my other machines. sudo systemctl restart tailscaled fixed this issue.

Arthur finished rebooting and I’m confronted with every single Longhorn related pod stuck in Crash Loop Backoff. Time to wake up fully and actually figure out what is going on.

The Troubleshooting

Any logs or error messages I post are partial since I am writing this after the fact.

Start with pod logs on the longhorn-csi-plugin pod. It is what actually handles the connection between Longhorn and the host system (at least to my understanding) and if it fails everything else crashes.

1longhorn-liveness-probe I0703 13:16:36.853516       1 connection.go:253] "Still connecting" address="unix:///csi/csi.sock"
2longhorn-csi-plugin time="2025-07-03T13:16:38Z" level=info msg="CSI Driver: driver.longhorn.io version: v1.7.2, manager URL http://longhorn-backend:9500/
3node-driver-registrar I0703 13:16:26.752163       1 main.go:150] "Version" version="v1.12.0"
4node-driver-registrar I0703 13:16:26.752234       1 main.go:151] "Running node-driver-registrar" mode=""
5node-driver-registrar I0703 13:16:26.752241       1 main.go:172] "Attempting to open a gRPC connection" csiAddress="/csi/csi.sock"
6node-driver-registrar I0703 13:16:36.752479       1 connection.go:253] "Still connecting" address="unix:///csi/csi.sock"
7stream closed EOF for longhorn/longhorn-csi-plugin-6hng4 (longhorn-csi-plugin)
8node-driver-registrar I0703 13:16:46.753248       1 connection.go:253] "Still connecting" address="unix:///csi/csi.sock"
9longhorn-liveness-probe I0703 13:16:46.853347       1 connection.go:253] "Still connecting" address="unix:///csi/csi.sock"

Check that the various *scsi* services are running without error. These services allow Longhorn to create appropriate the appropriate file-system abstractions on top of the underlying hardware so that it can be used to create PersistentVolumes. The scsi services are the OS level interface to the csi related errors shown above.
```
1systemctl status \*scsi\*
```
Everything is up and running.
Turn to Claude with the issue and the various errors and use it as a sounding board to see what might be failing.

Claude sent me through a gamut of possible troubleshooting. Some of it I ignored or skipped because it didn’t seem applicable, other parts I took partial context from.

Check that the csi socket exists and has no issues.
Delete the various longhorn pods to reset everything.

I deleted the ones specifically related to arthur but didn’t touch the rest. They were working so resetting them seemed overkill.
Check for OS update issues.

Update logs show nothing of relevance. There were the usual application and kernel updates but they were minor version updates so should not have resulted in this massive a failure.
Restart all the services sequentially

Done, also rebooted for good measure and no change.
Delete the pods again, restart the services again, check the pod logs again.

If I do it often enough maybe it will start working?
Connect into the failing pod and run commands to make sure the CSI socket is writable and works.

I had to point out to Claude that if it is in an Error state (or Crash Loop Backoff) I can’t use the pod.
Check the csidriver and pod spec definitions for volume mounts

This took a couple tries because Claude told me to look for the wrong VolumeMount and then told me it was mis-configured. Then when I found the right VolumeMount everything was configured correctly. Claude wanted me to recreate the Daemonset and/or update the configuration (which had not changed in over 3 months) because it thought it was misconfigured.
Look at longhorn-manager pod and see what error it gets.

This was the first real hint at the actual issue. longhorn-conversion-webhook service could not be reached. This runs on the longhorn-manager pod so for it to not reach itself was pointing to a network issue.
Execute various commands from an erroring/crashing pod (again), but also check iptables. Claude also included a few other “temporarily disable XYZ” options.

iptables sparked a memory so I went with that. Claude tells me there are deny/blocking policies in place that are preventing non-ICMP traffic. I have my doubts because none of the recent updates should have touched that and it was working.
Look for Kubernetes network policies and kube-router configurations that are blocking.

No policies set, nothing changed on kube-router, Claude suggests creating a new network policy to allow all and see why kube-router is suddenly in a DENY by default mode.

At this point I finally wake up enough to remember why iptables rang a bell and I was able to get everything fixed. The solution is below but first a few thoughts on using Claude to assist with troubleshooting.

AI-Assisted Troubleshooting

Using Claude (or another LLM) to help with troubleshooting does save time if done properly. It can find and consolidate information quickly and point you towards possible issues. Unfortunately it also can point towards destructive solutions just as quickly or suggest the same thing in different terms.

Due to the actual root cause it would have likely never ended up coming to the correct answer because there was missing context in this case. However I have seem the same thing in previous cases as well when it did have all the relevant context at hand. It will suggest deleting or modifying known-good configuration in an attempt to fix the issue even when told there were no changes to that aspect. In this case k3s had not been updated, the Kubernetes resource definitions and system firewall rules had not been touched but it wanted me to rebuild them to see if that changed anything.

It suggested deleting the entire longhorn-csi-plugin daemonset multiple times to force the cluster to recreate the pods with the proper modules. Even asking it afterwards about the suggestion to delete it was of the opinion that some other Longhorn component would have noticed and recreated the missing resource. It has also (in other troubleshooting contexts) hallucinated it’s way into resource specs, api versions and shell commands that do not exist (or have different purposes). Simply following the actions it recommends without consideration would have led to a much more significant problem.

I also asked Claude to review all the suggestions for destructive ones as well as repeated ones. Self-review showed:

~10 potentially destructive actions, 5 to 8 of which that it classified as significant damage/security risk that would likely have resulted in a full reinstall.
~15-20 non-destructive diagnostic actions.

In addition it acknowledged that several were repeated even after I’d shown results, either direct results to the request or stated I did something that served the same purpose such as rebooting the machine rather than restarting 1 or 2 services. It also rapidly escalated to more and more destructive actions relative to purely diagnostic actions.

Even when changing from one focus to another it did not go back to purely diagnostic-mode, it immediately included more destructive actions (delete service endpoints, modify kube-router, …).

The Solution

With the reminder (or hint) to look at cluster networking I remembered having something similar happen previously. Inter-Node machine-level DNS had failed and even though the Kubernetes control plane could interact with every node the actual pods had issues. With that in mind I went back to my blackstaff ssh session, ran updates and rebooted the machine. I probably could have gotten away with just restarting the k3s service but I decided to run updates to not have to do so in a day or so. Machine came back up and within a minute or two the entire cluster was back to normal, the longest delay being the few pods that needed to wait out their back-off periods.

The Underlying Cause

I actually already knew this was an issue but troubleshooting first thing in the morning was not conducive to my remembering. When tailscaled restarted it changed the network stack state. Restarting k3s afterwards makes it re-learn all the routes based on the new network state. Since I forgot to restart the second service blackstaff was no longer properly routing to and from arthur so the pods ran into issues.

My agent nodes have the ‘correct’ command in their shell history so if the tailscale DNS issue occurs I fix everything at once:

1sudo systemctl restart tailscaled k3s-agent

Going forward I’m going to have to make sure to run the equivalent on blackstaff.

1sudo systemctl restart tailscaled k3s

I’m running diun as a way to watch for version updates of my self-hosted apps. ↩︎

The Troubleshooting#

AI-Assisted Troubleshooting#

The Solution#

The Underlying Cause#

The Troubleshooting

AI-Assisted Troubleshooting

The Solution

The Underlying Cause