GPU servers are expensive and cheaper option is to build your own GPU machine. When you prefer working on a different machine or you want to offer the GPU capabilities as server we can deploy it with docker or include it in a cluster with k3s or kubernetes. I encountered some difficulties in the process of setting up my GPU cluster, and some advice found on the internet and from LLMs is outdated, so here is a guide on what worked for me.
Running a local GPU server with docker
To run computations on the GPU, you likely need CUDA (there are other options strictly speaking).
Install the latest cuda toolkit from here on each GPU node. Confirm by running nvidia-smi
and check if it gives you some output.
Docker needs a special back-end so the toolkit should also install nvidia-docker
for you.
There is no latest tag for the nvidia images so you can test if it works by running a particular version e.g.
nvidia-docker run --rm nvidia/cuda:12.5.0-devel-ubuntu22.04 nvidia-smi
You can enable Docker in your local network. This way, you can use Docker as if you have it installed locally, but it builds and runs on your server. To use it on your client, set the env var, e.g.:
export DOCKER_HOST=tcp://hostname:2375
using the default socket port.
Note that this runs the default Docker and not nvidia-docker
. To change that, you can modify the default runner in a config on the server.
In this file /etc/docker/daemon.json
add
"default-runtime": "nvidia"
and restart docker with sudo systemctl restart docker
.
Note that CUDA is not available in the runtime during the build. But that should not be a problem. In my case, I could ignore a warning that CUDA could not be found.
Using a home GPU server in a k3s cluster
For installing a node you need the token from cat /var/lib/rancher/k3s/server/node-token
of the host server and install it with
curl -sfL https://get.k3s.io | K3S_URL=https://portraittogo.com:6443 K3S_TOKEN=mynodetoken sh -s -
According to the documentation the config should reside in /etc/systemd/system/k3s.service
on the server.
For the agent it is in /etc/systemd/system/k3s-agent.service
instead.
The nvidia container kit did not detect the OCI runtime, that is used by k3s by default, so I switched the k3s runtime to docker by adding the --docker
flag.
You can test it with
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
restartPolicy: Never
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
EOF
Tainting the server is not good because it prevents a lot of helm installs.
You should taint the agent node so that normal deployments don't end up on the server. You can do this manually or automatically via --node-taint=key=value:NoSchedule
during registration.
Because of security reasons this is not allowed later. See docs.
Tailscale for the VPN
K3s documentation says tailscale is experimental. However I managed to make work. I later figured out that since switching my ISP from Telekom to M-net my home network is served with DS-lite. That means I only have an IPv6 adress and the IPv4 I get is the exit-node of the tunnel. This address can not be used to reach me directly as it is shared with multiple customers via the CGNAT. Furthermore I later noticed that my IP v6 DNS server was misconfigured. From experience with IPv6 I learned that it is sometimes not supported by some applications so the wireguard setup might indeed work with IPv6. Tailscale gives you the pain-free solution.
Using Wireguard
You can use a VPN directly by using wireguard. As explained above I can not make it work with my ISP but my notes might be handy.
By default k3s uses vxlan but this is not encrypted and hence not suitable for traffic over the internet.
When running with k3s change this on the server.
--node-external-ip=<SERVER_EXTERNAL_IP> --flannel-backend=wireguard-native --flannel-external-ip
.
Docs for the parameters
We can put this in the config in /etc/rancher/k3s/config.yaml
. However I could not find the k3s dir on my installation. I added it in the /etc/systemd/system/k3s.service
--node-external-ip=212.114.180.142
Most likely, you may need to add port forwarding to your NAT if your server is not available on the public internet. Also, disable the firewall on the node. The problem with that is that the IP address is probably dynamic, so every time you boot, you must update it. My address is also changing every day. This script updates it daily:
get_external_ip() {
curl -s https://api.ipify.org
}
EXTERNAL_IP=$(get_external_ip)
sudo sed -i "s/--node-external-ip=.*/--node-external-ip=$EXTERNAL_IP'/" /etc/systemd/system/k3s-agent.service
sudo systemctl daemon-reload
sudo systemctl restart k3s-agent
echo "Updated k3s agent with new external IP: $EXTERNAL_IP"
To test if the agent can use the DNS: run a pod with this config (adapt the hostname first).
apiVersion: v1
kind: Pod
metadata:
name: internet-check
spec:
nodeSelector:
kubernetes.io/hostname: ubuntufortress
containers:
- name: internet-check
image: busybox
command:
- "/bin/sh"
- "-c"
- "ping -c 4 google.com && wget -q --spider http://google.com"
tolerations:
- key: "key"
operator: "Equal"
value: "value"
effect: "NoSchedule"
restartPolicy: Never
You might need to install
sudo apt install wireguard
on both systems.