Recently I woke up to a fun set of issues with my homelab. In an effort to make more use of LLMs I turned to Claude for troubleshooting assistance, which did help but also once again reminded me of the risks of following AI instructions without appropriate knowledge of which suggestions are risky.
- The VPS in my Kubernetes cluster, jlpks8888, had powered itself off in the early morning without any warning or reason. Trying to power it back on through the admin console had no effect so a ticket was opened.
- Looking at my cluster status to ensure everything had failed over I had a single failing pod, diun1. It had failed over to arthur but it looked as though the PVC was failing to mount so it kept crashing. Diun being in failed state was not an issue right that moment, it acts on a schedule so any missed container version updates would be caught on the next run. The fact that any pod was in a failed state was what caught my eye. The machine had some pending updates so I ran them and rebooted, I probably would have done so in the next 24-36 hours anyway.
- Blackstaff, my Kubernetes master, had somehow lost track of the tailscale DNS
server so it couldn’t resolve any of my other machines.
sudo systemctl restart tailscaled
fixed this issue.
Arthur finished rebooting and I’m confronted with every single Longhorn related
pod stuck in Crash Loop Backoff
. Time to wake up fully and actually figure out
what is going on.
The Troubleshooting
Any logs or error messages I post are partial since I am writing this after the fact.
- Start with pod logs on the
longhorn-csi-plugin
pod. It is what actually handles the connection between Longhorn and the host system (at least to my understanding) and if it fails everything else crashes.1longhorn-liveness-probe I0703 13:16:36.853516 1 connection.go:253] "Still connecting" address="unix:///csi/csi.sock" 2longhorn-csi-plugin time="2025-07-03T13:16:38Z" level=info msg="CSI Driver: driver.longhorn.io version: v1.7.2, manager URL http://longhorn-backend:9500/ 3node-driver-registrar I0703 13:16:26.752163 1 main.go:150] "Version" version="v1.12.0" 4node-driver-registrar I0703 13:16:26.752234 1 main.go:151] "Running node-driver-registrar" mode="" 5node-driver-registrar I0703 13:16:26.752241 1 main.go:172] "Attempting to open a gRPC connection" csiAddress="/csi/csi.sock" 6node-driver-registrar I0703 13:16:36.752479 1 connection.go:253] "Still connecting" address="unix:///csi/csi.sock" 7stream closed EOF for longhorn/longhorn-csi-plugin-6hng4 (longhorn-csi-plugin) 8node-driver-registrar I0703 13:16:46.753248 1 connection.go:253] "Still connecting" address="unix:///csi/csi.sock" 9longhorn-liveness-probe I0703 13:16:46.853347 1 connection.go:253] "Still connecting" address="unix:///csi/csi.sock"
- Check that the various
*scsi*
services are running without error. These services allow Longhorn to create appropriate the appropriate file-system abstractions on top of the underlying hardware so that it can be used to create PersistentVolumes. The scsi services are the OS level interface to the csi related errors shown above.Everything is up and running.1systemctl status \*scsi\*
- Turn to Claude with the issue and the various errors and use it as a sounding board to see what might be failing.
Claude sent me through a gamut of possible troubleshooting. Some of it I ignored or skipped because it didn’t seem applicable, other parts I took partial context from.
-
Check that the csi socket exists and has no issues.
-
Delete the various longhorn pods to reset everything.
I deleted the ones specifically related to arthur but didn’t touch the rest. They were working so resetting them seemed overkill.
-
Check for OS update issues.
Update logs show nothing of relevance. There were the usual application and kernel updates but they were minor version updates so should not have resulted in this massive a failure.
-
Restart all the services sequentially
Done, also rebooted for good measure and no change.
-
Delete the pods again, restart the services again, check the pod logs again.
If I do it often enough maybe it will start working?
-
Connect into the failing pod and run commands to make sure the CSI socket is writable and works.
I had to point out to Claude that if it is in an
Error
state (orCrash Loop Backoff
) I can’t use the pod. -
Check the csidriver and pod spec definitions for volume mounts
This took a couple tries because Claude told me to look for the wrong VolumeMount and then told me it was mis-configured. Then when I found the right VolumeMount everything was configured correctly. Claude wanted me to recreate the Daemonset and/or update the configuration (which had not changed in over 3 months) because it thought it was misconfigured.
-
Look at
longhorn-manager
pod and see what error it gets.This was the first real hint at the actual issue.
longhorn-conversion-webhook
service could not be reached. This runs on thelonghorn-manager
pod so for it to not reach itself was pointing to a network issue. -
Execute various commands from an erroring/crashing pod (again), but also check
iptables
. Claude also included a few other “temporarily disable XYZ” options.iptables
sparked a memory so I went with that. Claude tells me there are deny/blocking policies in place that are preventing non-ICMP traffic. I have my doubts because none of the recent updates should have touched that and it was working. -
Look for Kubernetes network policies and
kube-router
configurations that are blocking.No policies set, nothing changed on
kube-router
, Claude suggests creating a new network policy to allow all and see whykube-router
is suddenly in a DENY by default mode.
At this point I finally wake up enough to remember why iptables
rang a bell and
I was able to get everything fixed. The solution is below but first a few
thoughts on using Claude to assist with troubleshooting.
AI-Assisted Troubleshooting
Using Claude (or another LLM) to help with troubleshooting does save time if done properly. It can find and consolidate information quickly and point you towards possible issues. Unfortunately it also can point towards destructive solutions just as quickly or suggest the same thing in different terms.
Due to the actual root cause it would have likely never ended up coming to the
correct answer because there was missing context in this case. However I have
seem the same thing in previous cases as well when it did have all the relevant
context at hand. It will suggest deleting or modifying known-good configuration
in an attempt to fix the issue even when told there were no changes to that
aspect. In this case k3s
had not been updated, the Kubernetes resource
definitions and system firewall rules had not been touched but it wanted me to
rebuild them to see if that changed anything.
It suggested deleting the entire longhorn-csi-plugin
daemonset multiple times to
force the cluster to recreate the pods with the proper modules. Even asking it
afterwards about the suggestion to delete it was of the opinion that some other
Longhorn component would have noticed and recreated the missing resource. It
has also (in other troubleshooting contexts) hallucinated it’s way into
resource specs, api versions and shell commands that do not exist (or have
different purposes). Simply following the actions it recommends without
consideration would have led to a much more significant problem.
I also asked Claude to review all the suggestions for destructive ones as well as repeated ones. Self-review showed:
- ~10 potentially destructive actions, 5 to 8 of which that it classified as significant damage/security risk that would likely have resulted in a full reinstall.
- ~15-20 non-destructive diagnostic actions.
In addition it acknowledged that several were repeated even after I’d shown results, either direct results to the request or stated I did something that served the same purpose such as rebooting the machine rather than restarting 1 or 2 services. It also rapidly escalated to more and more destructive actions relative to purely diagnostic actions.
Even when changing from one focus to another it did not go back to purely diagnostic-mode, it immediately included more destructive actions (delete service endpoints, modify kube-router, …).
The Solution
With the reminder (or hint) to look at cluster networking I remembered having
something similar happen previously. Inter-Node machine-level DNS had failed and
even though the Kubernetes control plane could interact with every node the
actual pods had issues. With that in mind I went back to my blackstaff ssh
session, ran updates and rebooted the machine. I probably could have gotten away
with just restarting the k3s
service but I decided to run updates to not have to
do so in a day or so. Machine came back up and within a minute or two the
entire cluster was back to normal, the longest delay being the few pods that
needed to wait out their back-off periods.
The Underlying Cause
I actually already knew this was an issue but troubleshooting first thing in the
morning was not conducive to my remembering. When tailscaled
restarted it
changed the network stack state. Restarting k3s
afterwards makes it re-learn
all the routes based on the new network state. Since I forgot to restart the
second service blackstaff was no longer properly routing to and from arthur so
the pods ran into issues.
My agent nodes have the ‘correct’ command in their shell history so if the tailscale DNS issue occurs I fix everything at once:
1sudo systemctl restart tailscaled k3s-agent
Going forward I’m going to have to make sure to run the equivalent on blackstaff.
1sudo systemctl restart tailscaled k3s
-
I’m running
diun
as a way to watch for version updates of my self-hosted apps. ↩︎