GPU servers are expensive and cheaper option is to build your own GPU machine. When you prefer working on a different machine or you want to offer the GPU capabilities as server we can deploy it with docker or include it in a cluster with k3s or kubernetes. I encountered some difficulties in the process of setting up my GPU cluster, and some advice found on the internet and from LLMs is outdated, so here is a guide on what worked for me.

Bildschirmfoto%202024-09-09%20um%2021.21.56

Running a local GPU server with docker

To run computations on the GPU, you likely need CUDA (there are other options strictly speaking). Install the latest cuda toolkit from here on each GPU node. Confirm by running nvidia-smi and check if it gives you some output.

Docker needs a special back-end so the toolkit should also install nvidia-docker for you. There is no latest tag for the nvidia images so you can test if it works by running a particular version e.g.

nvidia-docker run --rm nvidia/cuda:12.5.0-devel-ubuntu22.04 nvidia-smi

You can enable Docker in your local network. This way, you can use Docker as if you have it installed locally, but it builds and runs on your server. To use it on your client, set the env var, e.g.: export DOCKER_HOST=tcp://hostname:2375 using the default socket port. Note that this runs the default Docker and not nvidia-docker. To change that, you can modify the default runner in a config on the server. In this file /etc/docker/daemon.json add

"default-runtime": "nvidia"

and restart docker with sudo systemctl restart docker.

Note that CUDA is not available in the runtime during the build. But that should not be a problem. In my case, I could ignore a warning that CUDA could not be found.

Using a home GPU server in a k3s cluster

For installing a node you need the token from cat /var/lib/rancher/k3s/server/node-token of the host server and install it with curl -sfL https://get.k3s.io | K3S_URL=https://portraittogo.com:6443 K3S_TOKEN=mynodetoken sh -s -

According to the documentation the config should reside in /etc/systemd/system/k3s.service on the server. For the agent it is in /etc/systemd/system/k3s-agent.service instead.

The nvidia container kit did not detect the OCI runtime, that is used by k3s by default, so I switched the k3s runtime to docker by adding the --docker flag. You can test it with

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
EOF

Tainting the server is not good because it prevents a lot of helm installs. You should taint the agent node so that normal deployments don't end up on the server. You can do this manually or automatically via --node-taint=key=value:NoSchedule during registration. Because of security reasons this is not allowed later. See docs.

Tailscale for the VPN

K3s documentation says tailscale is experimental. However I managed to make work. I later figured out that since switching my ISP from Telekom to M-net my home network is served with DS-lite. That means I only have an IPv6 adress and the IPv4 I get is the exit-node of the tunnel. This address can not be used to reach me directly as it is shared with multiple customers via the CGNAT. Furthermore I later noticed that my IP v6 DNS server was misconfigured. From experience with IPv6 I learned that it is sometimes not supported by some applications so the wireguard setup might indeed work with IPv6. Tailscale gives you the pain-free solution.

Using Wireguard

You can use a VPN directly by using wireguard. As explained above I can not make it work with my ISP but my notes might be handy.

By default k3s uses vxlan but this is not encrypted and hence not suitable for traffic over the internet. When running with k3s change this on the server. --node-external-ip=<SERVER_EXTERNAL_IP> --flannel-backend=wireguard-native --flannel-external-ip. Docs for the parameters

We can put this in the config in /etc/rancher/k3s/config.yaml. However I could not find the k3s dir on my installation. I added it in the /etc/systemd/system/k3s.service --node-external-ip=212.114.180.142

Most likely, you may need to add port forwarding to your NAT if your server is not available on the public internet. Also, disable the firewall on the node. The problem with that is that the IP address is probably dynamic, so every time you boot, you must update it. My address is also changing every day. This script updates it daily:

get_external_ip() {
    curl -s https://api.ipify.org
}
EXTERNAL_IP=$(get_external_ip)
sudo sed -i "s/--node-external-ip=.*/--node-external-ip=$EXTERNAL_IP'/" /etc/systemd/system/k3s-agent.service
sudo systemctl daemon-reload
sudo systemctl restart k3s-agent
echo "Updated k3s agent with new external IP: $EXTERNAL_IP"

To test if the agent can use the DNS: run a pod with this config (adapt the hostname first).

apiVersion: v1
kind: Pod
metadata:
  name: internet-check
spec:
  nodeSelector:
    kubernetes.io/hostname: ubuntufortress
  containers:
  - name: internet-check
    image: busybox
    command: 
      - "/bin/sh"
      - "-c"
      - "ping -c 4 google.com && wget -q --spider http://google.com"
  tolerations:
  - key: "key"
    operator: "Equal"
    value: "value"
    effect: "NoSchedule"
  restartPolicy: Never

You might need to install sudo apt install wireguard on both systems.