GPU servers are expensive and cheaper option is to build your own GPU machine. When you prefer working on a different machine or you want to offer the GPU capabilities as server we can deploy it with docker or include it in a cluster with k3s or kubernetes. I encountered some difficulties in the process of setting up my GPU cluster, and some advice found on the internet and from LLMs is outdated, so here is a guide on what worked for me.
Running a local GPU server with docker
To run computations on the GPU, you likely need CUDA (there are other options strictly speaking).
Install the latest cuda toolkit from here on each GPU node. Confirm by running nvidia-smi
and check if it gives you some output.
Docker needs a special back-end so the toolkit should also install nvidia-docker
for you.
There is no latest tag for the nvidia images so you can test if it works by running a particular version e.g.
nvidia-docker run --rm nvidia/cuda:12.5.0-devel-ubuntu22.04 nvidia-smi
You can enable Docker in your local network. This way, you can use Docker as if you have it installed locally, but it builds and runs on your server. To use it on your client, set the env var, e.g.:
export DOCKER_HOST=tcp://hostname:2375
using the default socket port.
Note that this runs the default Docker and not nvidia-docker
. To change that, you can modify the default runner in a config on the server.
In this file /etc/docker/daemon.json
add
"default-runtime": "nvidia"
and restart docker with sudo systemctl restart docker
.
Note that CUDA is not available in the runtime during the build. But that should not be a problem. In my case, I could ignore a warning that CUDA could not be found.
Using a home GPU server in a k3s cluster
For installing an agent node in your k3s you need the token from the server in the file cat /var/lib/rancher/k3s/server/node-token
of the host server and install it with
curl -sfL https://get.k3s.io | K3S_URL=https://portraittogo.com:6443 K3S_TOKEN=mynodetoken sh -s -
on the agent node.
According to the documentation the k3s config should reside in /etc/systemd/system/k3s.service
on the server.
For the agent it is in /etc/systemd/system/k3s-agent.service
instead.
To enable k8s in kubernetes you also need the nvidia device plugin for setting the available gpu resources in the node capacity.
I found that I installed multiple helm charts in parallel. After uninstalling them this guide to install the gpu operator with a simple helm install worked. The operator includes the device plugin.
Before I had the device plugin installed. It can be installed with
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.1/deployments/static/nvidia-device-plugin.yml
.
I found that I had
Fo my installation, the nvidia container kit did not detect the OCI runtime, that is used by k3s by default, so I switched the k3s runtime to docker by adding the --docker
flag to the config.
You can test the full setup with
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
restartPolicy: Never
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
tolerations:
- key: "key"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
EOF
How are pods placed correctly now? Tainting the server is not good because it prevents a lot of helm installs.
You could taint the agent node so that normal deployments don't end up on the server. You can do this manually or automatically via --node-taint=key=value:NoSchedule
during registration.
Because of security reasons this is not allowed later. See docs. Because of the problems from tains you might consider using node affinities instead. Tainting caused me a lot of frustration because on a cluster upgrade it did not upgrade the gpu node and then some networking related with flannel failed effictebly breaking the whole node.
Tailscale for the VPN
K3s documentation says tailscale is experimental. However I managed to make work. I later figured out that since switching my ISP from Telekom to M-net my home network is served with DS-lite. That means I only have an IPv6 adress and the IPv4 I get is the exit-node of the tunnel. This address can not be used to reach me directly as it is shared with multiple customers via the CGNAT. Furthermore I later noticed that my IP v6 DNS server was misconfigured. The wireguard setup might indeed work but from experience with IPv6 I learned that it is sometimes not supported by the applications. Tailscale gives you the pain-free solution.
You need to provide an auth key via the tailscaile settings in the k3s lauch settings. The local tailscale installation will be logged out as k3s manages it.
We can put the config in the file in /etc/rancher/k3s/config.yaml
. It was not there by default. Another option is to add the settings in the /etc/systemd/system/k3s.service
as launch arguments.
Using Wireguard
You can use a VPN directly by using wireguard. As explained above I can not make it work with my ISP but my notes might be handy.
By default k3s uses vxlan but this is not encrypted and hence not suitable for traffic over the internet.
When running with k3s change this on the server.
--node-external-ip=<SERVER_EXTERNAL_IP> --flannel-backend=wireguard-native --flannel-external-ip
.
Docs for the parameters
Most likely, you may need to add port forwarding to your NAT if your server is not available on the public internet. Also, disable the firewall on the node. The problem with that is that the IP address is probably dynamic, so every time you boot, you must update it. My address is also changing every day. This script updates it daily:
get_external_ip() {
curl -s https://api.ipify.org
}
EXTERNAL_IP=$(get_external_ip)
sudo sed -i "s/--node-external-ip=.*/--node-external-ip=$EXTERNAL_IP'/" /etc/systemd/system/k3s-agent.service
sudo systemctl daemon-reload
sudo systemctl restart k3s-agent
echo "Updated k3s agent with new external IP: $EXTERNAL_IP"
To test if the agent can use the DNS: run a pod with this config (adapt the hostname first).
apiVersion: v1
kind: Pod
metadata:
name: internet-check
spec:
nodeSelector:
kubernetes.io/hostname: ubuntufortress
containers:
- name: internet-check
image: busybox
command:
- "/bin/sh"
- "-c"
- "ping -c 4 google.com && wget -q --spider http://google.com"
tolerations:
- key: "key"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
restartPolicy: Never
You might need to install
sudo apt install wireguard
on both systems.