Bildschirmfoto%202024-09-09%20um%2021.21.56

Running a local GPU server with docker

I had some difficulties in the process of setting up my GPU cluster, and some advice found on the internet and from LLMs is outdated, so here is a guide on what worked for me.

For running computations on the GPU, you likely need CUDA. Install the latest cuda toolkit from here on each GPU node. Confirm by running nvidia-smi and check if it gives you some output.

Docker needs a special back-end so the toolkit should also install nvidia-docker for you. There is no latest tag for the nvidia images so you can test if it works by running a particular version e.g.

nvidia-docker run --rm nvidia/cuda:12.5.0-devel-ubuntu22.04 nvidia-smi

You can enable Docker in your local network. This way, you can use Docker as if you have it installed locally, but it builds and runs on your server. To use it on your client, set the env var, e.g.: export DOCKER_HOST=tcp://hostname:2375 Note that this runs the default Docker and not nvidia-docker. To change that, you can modify the default runner in a config on the server. In this file /etc/docker/daemon.json add

"default-runtime": "nvidia"

and restart docker with sudo systemctl restart docker.

Note that CUDA is not available in the runtime during the build. But that should not be a problem. In my case, I could ignore a warning that CUDA could not be found.

Using a home GPU server in a k3s cluster

For installing a node you need the token from cat /var/lib/rancher/k3s/server/node-token of the host server and install it with curl -sfL https://get.k3s.io | K3S_URL=https://portraittogo.com:6443 K3S_TOKEN=mynodetoken sh -s -

According to the docs the config should reside in /etc/systemd/system/k3s.service on the server. For the agent it is in /etc/systemd/system/k3s-agent.service instead.

The nvidia container kit did not detect OCI runtime, that is used by k3s by default, so I switched the k3s runtime to docker by adding the --docker flag. You can test it with

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
EOF

Tainting the server is not good because it prevents a lot of helm installs. You should taint the agent node so that normal deployments don't end up on the server. You can do this manually or automatically via --node-taint=key=value:NoSchedule during registration. Because of security reasons this is not allowed later. See docs.

Tailscale for the VPN

K3s documentation says tailscaire is experimental. However it was the only thing I managed to make work.

Using Wireguard

You can use a VPN directly by using wireguard. I did not manage to make it work but my notes might be handy.

By default k3s uses vxlan but this is not encrypted and hence not suitable for traffic over the internet. When running with k3s change this on the server. --node-external-ip=<SERVER_EXTERNAL_IP> --flannel-backend=wireguard-native --flannel-external-ip. Docs for the parameters

We can put this in the config in /etc/rancher/k3s/config.yaml. However I could not find the k3s dir on my installation. I added it in the /etc/systemd/system/k3s.service --node-external-ip=212.114.180.142

Most likely, you may need to add port forwarding to your NAT if your server is not available on the public internet. Also, disable the firewall on the node. The problem with that is that the IP address is probably dynamic, so every time you boot, you must update it. My address is also changing every day. This script updates it daily:

get_external_ip() {
    curl -s https://api.ipify.org
}
EXTERNAL_IP=$(get_external_ip)
sudo sed -i "s/--node-external-ip=.*/--node-external-ip=$EXTERNAL_IP'/" /etc/systemd/system/k3s-agent.service
sudo systemctl daemon-reload
sudo systemctl restart k3s-agent
echo "Updated k3s agent with new external IP: $EXTERNAL_IP"

To test if the agent can use the DNS: run a pod with this config (adapt the hostname first).

apiVersion: v1
kind: Pod
metadata:
  name: internet-check
spec:
  nodeSelector:
    kubernetes.io/hostname: ubuntufortress
  containers:
  - name: internet-check
    image: busybox
    command: 
      - "/bin/sh"
      - "-c"
      - "ping -c 4 google.com && wget -q --spider http://google.com"
  tolerations:
  - key: "key"
    operator: "Equal"
    value: "value"
    effect: "NoSchedule"
  restartPolicy: Never

You might need to install sudo apt install wireguard on both systems.