Homelab Disaster Recovery: When Borg Backups Meet Longhorn Volumes

A few weeks ago my Kubernetes-based Homelab suffered a catastrophic failure. The internal routing no longer worked nor did any DNS resolution for in-cluster services. Since I had almost everything defined in yaml files and had all my persistent data stored in Longhorn Volumes, which were they themselves backed up, I felt safe destroying and rebuilding the cluster.

Unfortunately while doing so the Longhorn volumes turned out to either be missing (only 9 of roughly 17 were found) or were orphans that refused to attach to Kubernetes. I was able to find the remaining volumes in a backup so started restoring but when trying to mount them individually using a Longhorn Docker container the contents were corrupt. Working through my Borg backups of the volumes only found more corruption except for my PGDump backups, which luckily contained the vast majority of the data I would hate to lose.

In hindsight my approach had flaws. I should have tried to mount the Longhorn Volumes that were found before overwriting them with a backup, or better yet tried to copy the data off the volumes before I destroyed the cluster. I likely could have recovered more of the persistent data in less time had I been more careful in my attempts but then I would not have discovered that the backups were corrupt and only found out when there was more data that would have been critical.

I thought I had it right

Even though I’ve been working as a SysAdmin and Cloud Architect for over a decade I’ve never been the one responsible for data recovery during Disaster Recovery situations, but I do know the 3-2-1 rule for backups (3 copies on 2 different types of storage media and one copy offsite) and the importance of verifying that the backups work. I’d only had the homelab set up and in a state I considered functional for a week or two so hadn’t tried to recover these specific backups.

Longhorn Persistent Volumes had 2 replicas, one of which was supposed to be on a dedicated disk on one node so that backups could be consistently taken.
Backups were being stored in 2 locations: locally as well as on a remote backup server. To me this satisfied the 3-2-1 since I had 1 offsite Borg, one on-site Borg and one on-site/offsite Longhorn replica, either the one being backed up by Borg or the second node’s replica, of all content.
Other Borg backups had been validate in the past so I knew the backup process worked.
Scheduled backups included notifications via NTFY if they had failures.

Everything involved was working. I had occasional Borg failures on the backups but looking at the logs they were “File xyz has changed on filesystem” due to logs being rotated or updated while the backup was running. I’d seen those before on my other backups and knew they didn’t actually corrupt the data.

What I’m doing now

After getting everything up and running I went through and improved my backup process. While future failures are still possible, at this point I’m not sure there is much I can do within the constraints of a homelab.

No more Borg backups of the Longhorn Volumes directly. This obviously didn’t work so no point continuing.
Fix my Kubernetes StorageClass definitions to try and ensure I actually do have a copy of each Longhorn Volume on the expected disk.
Enable BTRFS timeline snapshots¹ of the Longhorn Volumes. Since my data is on BTRFS I can use snapshots to get point in time state. I only keep 2 weekly, 5 daily and 10 hourly copies since this is meant for quick rollbacks.
Add Minio to my list of self-hosted applications with HostPath storage and configure Longhorn to use this as the S3 backup endpoint.
Setup BTRFS snapshots on the Minio HostPath for quick rollbacks.
Setup Borg backups on the Minio HostPath to replace the Longhorn backup jobs.
Setup scheduled backup jobs in Longhorn for all volumes to ship to Minio.

So now I have Longhorn backups using the Longhorn backup process that are then backed up both by BTRFS snapshot and by Borg to both local and remote targets. I still have to figure out a manual process (or at least how often to perform the manual process) both for restoring a S3 backup to Longhorn and to restore a Borg backup to Minio and from there to Longhorn.

Postmortem: What I think went wrong

Longhorn Volumes are not meant to be accessed directly. When I was backing up the volumes they were actively mounted and accessible as SCSI devices², either on the machine being backed up or on the remote node that held the in-use replica. I wasn’t backing up the SCSI device but rather the underlying data while it was in use.

The only recoverable data was from the volumes used for my PGDump backups which were only mounted for the minutes that it took to take the database dump. By the time Borg backed it up the filesystem was no longer mounted and ‘in-use’.

The missing Longhorn Volumes weren’t necessarily missing, they just weren’t storing replicas on the disk I thought they were. I have 2 disk definitions on that node, one in the default location (/var/lib/longhorn) and the other on a separate storage array (/data/longhorn).

BTRFS allows for point-in-time snapshots to be taken of a BTRFS subvolume. Snapper automates the process by taking hourly snapshots and handling rotation to only keep the desired number. ↩︎
The contents are treated as high speed physical disks rather than just another folder. ↩︎

I thought I had it right#

What I’m doing now#

Postmortem: What I think went wrong#

I thought I had it right

What I’m doing now

Postmortem: What I think went wrong