Now that I’ve covered how I ensure my homelab stays connected and gets configured I can finally start talking about application workload orchestration. This will be spread over multiple posts to allow for more depth.

Kubernetes offers a convenient, albeit complex, platform for running containerized1 applications. Deploying a Kubernetes cluster in a cloud provider (Azure, Google, Amazon, etc) is fairly straightforward since they take care of ensuring the underlying compute, network and storage are configured. Deploying a full Kubernetes cluster manually can be significantly more complex since you need to ensure every required component is configured. A team of coworkers tried to do so 5-6 years ago and ran into numerous issues due to unclear documentation or unsupported configurations. Many of the improvements to Kubernetes in the intervening years have made things more straightforward but it is still more complex than I would like for a homelab.

Luckily there is also a fully certified Kubernetes distribution, k3s, that provides sane defaults for setting up a new Kubernetes Cluster with a simple CLI tool to setup Server nodes as well as add agent nodes to the cluster. There is also k3d, a k3s wrapper allowing you to create an arbitrary number of nodes as docker containers for experimenting.

In my case I run a 4 node cluster, blackstaff as the server node and arthur, jlpks8888 and dresden as the agents. If I add more nodes in the future I will probably end up with 3 server nodes to allow for quorum and redudancy and hopefully remove dresden by obtaining a GPU to dedicate to LLM workloads.

Since I’m using Salt I have all the configuration being set up by it and then all I did manually was run the installer. I could probably even do that last step through Salt but kept it manual to ensure I could validate everything. Common configuration for both server and agent nodes is kept in the top level /etc/rancher/k3s/config.yaml while /etc/rancher/k3s/config.yaml.d/server.yaml and /etc/rancher/k3s/config.d.yaml/agent.yaml cover the type-specific configuration. Individual node settings (labels and taints) are populated to /etc/rancher/k3s/config.yaml.d/<hostname>.yaml.

  • /etc/rancher/k3s/config.yaml

    I already have tailscale configured so I can just use the default flannel network plugin and associate it to the tailscale0 interface. I also use the k3s bundled binaries rather than OS binaries since I use multiple OSes that could end up with library differences between systems and cause breakages.

    1flannel-iface: "tailscale0"
    2prefer-bundled-bin: true
    
  • /etc/rancher/k3s/config.yaml.d/agent.yaml

    Re-running my Salt state after deploying k3s to blackstaff adds the cluster join token to my Salt mine. This allows every agent-node to auto-join on install without my ever having to look up the token.

    1server: "https://blackstaff:6443"
    2token: "[token redacted]"
    
  • /etc/rancher/k3s/config.yaml.d/server.yaml

    The server setup on blackstaff ensures secrets are encrypted in etcd and that the cluster initializes. It also removes the default traefik installation so that I can install the newer 3.0 version so that I can use the Kubernetes Gateway API.

    1cluster-init: true
    2write-kubeconfig-mode: 644
    3secrets-encryption: true
    4disable:
    5​  - traefik
    6kube-apiserver-arg:
    7​  - "audit-log-maxage=30"
    
  • Node labels

    I also include potentially useful machine metadata as node labels on each node. I am not currently using them for anything but they would allow more specific workload targeting if I ever found a specific need to do so.

    1node-label:
    2​  - homelab.leechpepin.com/os=EndeavourOS
    3​  - homelab.leechpepin.com/os_family=Arch
    4​  - homelab.leechpepin.com/type=server
    5​  - homelab.leechpepin.com/location=homelab
    

Controlling Node Workloads with Taints

I use 2 taints2 to control my workload deployments. The first is a public=true:NoSchedule on jlpks8888 that by default prevents anything from running on the VPS. I override that (and even add an affinity to prefer the VPS) on my monitoring stack since running those publicly makes sense, while those that are more likely to have sensitive data will never run there.

The second is gpu=true:NoSchedule on dresden combined with the label nvidia.com/gpu.present=true to allow LLM-related workloads (specifically Ollama) to run on my desktop since it works best with a GPU.

Securing secrets with Infisical

Most of the applications I run need some sort of Secrets, be it passwords, API Keys or other sensitive data. Keeping these secrets purely in Kubernetes is an obvious point of failure, they would be lost in the case of a cluster failure or mis-typed delete command. Instead I store them in Infisical and then use the Infisical Secret resource to map them to standard Kubernetes secrets.

The secrets within the cluster are fairly secure since they are encrypted within etcd as part of the cluster configuration. Storing the secrets in a third-party vault, in this case Infisical means that all the secrets can be recovered in case of failure without having to store them locally without encryption. The one exception to this is the actual bootstrap secret that allows the cluster to talk to Infisical, I use a shell script to insert it into the cluster prior to deploying anything else.

Handling Certificates with Cert-Manager

Originally I was going to skip adding cert-manager to my cluster since it is not publicly exposed. However I use Authentik for authentication so I don’t have to create local accounts in applications and it throws errors if not accessed over HTTPS. I could use a self-signed certificate but then I would need to load and/or trust the certificate on each of my machines that connect over tailscale. Instead I’m installing cert-manager using the helm chart and requesting wildcard certificates for *.leechpepin.com so that my direct connections are just as secure as my external ones since all the certificates are provided by Let’s Encrypt.

I could use individual certificates rather than a wildcard (which is what my public reverse proxy does), however the fact that this is purely for internal access the wildcard simplifies certificate renewal and ensures I don’t accidentally end up with invalid or missing certificates.


  1. The best known example is a Docker container. A Docker container encapsulates all the code and configuration to run the application so that it can be run anywhere in the exact same way. ↩︎

  2. Kubernetes uses Taints as a way to define criteria for scheduling a container onto a given node. The container needs an appropriate toleration to match the taint↩︎