Running a local GPU server with docker
I had some difficulties in the process of setting up my GPU cluster, and some advice found on the internet and from LLMs is outdated, so here is a guide on what worked for me.
For running computations on the GPU, you likely need CUDA.
Install the latest cuda toolkit from here on each GPU node. Confirm by running nvidia-smi
and check if it gives you some output.
Docker needs a special back-end so the toolkit should also install nvidia-docker
for you.
There is no latest tag for the nvidia images so you can test if it works by running a particular version e.g.
nvidia-docker run --rm nvidia/cuda:12.5.0-devel-ubuntu22.04 nvidia-smi
You can enable Docker in your local network. This way, you can use Docker as if you have it installed locally, but it builds and runs on your server. To use it on your client, set the env var, e.g.:
export DOCKER_HOST=tcp://hostname:2375
Note that this runs the default Docker and not nvidia-docker
. To change that, you can modify the default runner in a config on the server.
In this file /etc/docker/daemon.json
add
"default-runtime": "nvidia"
and restart docker with sudo systemctl restart docker
.
Note that CUDA is not available in the runtime during the build. But that should not be a problem. In my case, I could ignore a warning that CUDA could not be found.
Using a home GPU server in a k3s cluster
For installing a node you need the token from cat /var/lib/rancher/k3s/server/node-token
of the host server and install it with
curl -sfL https://get.k3s.io | K3S_URL=https://portraittogo.com:6443 K3S_TOKEN=mynodetoken sh -s -
According to the docs the config should reside in /etc/systemd/system/k3s.service
on the server.
For the agent it is in /etc/systemd/system/k3s-agent.service
instead.
The nvidia container kit did not detect OCI runtime, that is used by k3s by default, so I switched the k3s runtime to docker by adding the --docker
flag.
You can test it with
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
restartPolicy: Never
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
EOF
Tainting the server is not good because it prevents a lot of helm installs.
You should taint the agent node so that normal deployments don't end up on the server. You can do this manually or automatically via --node-taint=key=value:NoSchedule
during registration.
Because of security reasons this is not allowed later. See docs.
Tailscale for the VPN
K3s documentation says tailscaire is experimental. However it was the only thing I managed to make work.
Using Wireguard
You can use a VPN directly by using wireguard. I did not manage to make it work but my notes might be handy.
By default k3s uses vxlan but this is not encrypted and hence not suitable for traffic over the internet.
When running with k3s change this on the server.
--node-external-ip=<SERVER_EXTERNAL_IP> --flannel-backend=wireguard-native --flannel-external-ip
.
Docs for the parameters
We can put this in the config in /etc/rancher/k3s/config.yaml
. However I could not find the k3s dir on my installation. I added it in the /etc/systemd/system/k3s.service
--node-external-ip=212.114.180.142
Most likely, you may need to add port forwarding to your NAT if your server is not available on the public internet. Also, disable the firewall on the node. The problem with that is that the IP address is probably dynamic, so every time you boot, you must update it. My address is also changing every day. This script updates it daily:
get_external_ip() {
curl -s https://api.ipify.org
}
EXTERNAL_IP=$(get_external_ip)
sudo sed -i "s/--node-external-ip=.*/--node-external-ip=$EXTERNAL_IP'/" /etc/systemd/system/k3s-agent.service
sudo systemctl daemon-reload
sudo systemctl restart k3s-agent
echo "Updated k3s agent with new external IP: $EXTERNAL_IP"
To test if the agent can use the DNS: run a pod with this config (adapt the hostname first).
apiVersion: v1
kind: Pod
metadata:
name: internet-check
spec:
nodeSelector:
kubernetes.io/hostname: ubuntufortress
containers:
- name: internet-check
image: busybox
command:
- "/bin/sh"
- "-c"
- "ping -c 4 google.com && wget -q --spider http://google.com"
tolerations:
- key: "key"
operator: "Equal"
value: "value"
effect: "NoSchedule"
restartPolicy: Never
You might need to install
sudo apt install wireguard
on both systems.