TL;DR: I built a virtual kubelet that lets Kubernetes offload GPU jobs to RunPod.io; Useful for burst scaling ML workloads without needing full-time cloud GPUs.

This project came out of a need while working on an internal ML-based SaaS (which didn’t pan out). Initially, we used the RunPod API directly in the application, as RunPod had the most affordable GPU pricing at the time. But I also had a GPU server at home and wanted to run experiments even cheaper. Since I had good experiences with Kubernetes jobs (for CPU workloads), I installed k3s and made the home GPU node part of the cluster.

The idea was simple: use the local GPU when possible, and burst to RunPod when needed. The app logic would stay clean. Kubernetes would handle the infrastructure decisions. Ideally, the same infra would scale from dev experiments to production workloads.

What Didn't Work

My first attempt was a custom controller written in Go, monitoring jobs and scheduling them on RunPod. I avoided CRDs to stay compatible with the native Job API. Go was the natural choice given its strong Kubernetes ecosystem.

The problem with the approach was that when overwriting pod values and creating virtual pods, this approach fought the Kubernetes scheduler constantly. Reconciliation with runpod and failed jobs lead to problems like loops. I also considered queuing stalled jobs and triggering scale-out logic, which increased the complexity further, but it became a mess. I wrote thousands of lines of Go and never got it stable.

What worked

The proper way to do this is with the virtual kubelet. I used the CNCF sandbox project virtual-kubelet, which registers as a node in the cluster. Then the normal scheduler can use taints, tolerations, and node selectors to place pods. When a pod is placed on the virtual node, the controller provisions it using a third-party API, in this case, RunPod's.

Current Status

The source code and helm chart are available here: Github It’s source-available under a non-commercial license for now — I’d love to turn this into something sustainable.

I’m not affiliated with RunPod. I shared the project with RunPod, and their Head of Engineering reached out to discuss potential collaboration. We had an initial meeting, and there was interest in continuing the conversation. They asked to schedule a follow-up, but I didn’t hear back to my follow ups. These things happen, people get busy or priorities shift. Regardless, I’m glad the project sparked interest and I’m open to revisiting it in the future.

Everyone is currently racing to add AI to their applications. The Model Context Protocol (MCP) solves the problem of writing an integration for each service by using a protocol that can be plugged into any system. It is gaining serious traction but it still is wild west.

The problem? The organization behind MCP (Anthropic) doesn't even implement their own specification properly. Also the specification is a vibe-coded mess. Writing good protocol specs can take many years e.g. hardware protocols like USB, or software-protocols like medical standard FHIR but Anthropic is speed-running this.

Why Everyone's Confused About MCP

The documentation reads like someone wrote the implementation in code first and then let the LLM write the documentation. First many people seem to be confused by the introduction that introduces the terminology of host, client and server. The terminology is reasonable but it could be probably better communicated. Here is what is actually happening: Your application bundles various clients that connect to external servers. Think of it like your app connecting to multiple SQL databases. You have multiple database clients, but from the user perspective, it's just your app talking to databases. When used for information retrievel it is not even far of from a database.

So that is not novel concept. The novel part is merely introducing that aspect as something special.

Anthropic Doesn't Follow Their Own Rules

Anthropic is pushing this protocol, however they do not fully implement the spec with Claude Desktop nor Claude Code.

HTTP Transport? Not supported .

Transport mode via HTTP is not supported, despite being part of the core specification. It is suprising that the organization that built this, is also not really supporting it properly. With billions pouring into AI I would expect that there is enough money to hire a technical writer.

The transport design has been critized before by Rasmus Holm in this great post.

For HTTP support in Claude Desktop there is a workaround to utilitize other tools to act as a proxy .

e.g. for the claude_desktop_config.json you configure

"your tool": {
      "command": "npx",
      "args": [
        "mcp-remote",
        "http://yourmcp.local/mcp",
        "--allow-http"
      ]
    }

On a side note about this config file: There is no documentation on the claude_desktop_config.json. After all it is application specific code and should not be part of the MCP spec but in the Claude documentation. However, there is some random documentation scattered aroud like here where it tells you how to add env variables.

MCP Resources are not implemented

I wanted to retrive simple values to the chat. According to the documentation this can be done with MCP resources. It works like a REST interface where you can pass identifiers in the URI. Claude doesn't support resources either. Everything gets forced through tool implementations instead, defeating the purpose of having resources as a separate concept.

MCP for Robotics and IoT: Early But Promising

Mittel%20%28MCP%20sketch%29 There are no official completed SDKs for C++ or Rust, languages that are commonly used for embedded development. Without the SDK the spec is too complex to quickly built it yourself per project. What can be done instead as of now is to run a bridge MCP server on your computer that uses then rest.

The ESP32 is a popular microcontroller with bluetooth and wifi support and GPIO pins. As you can run the async framework tokio used by the mcp rust SDK on an ESP32 you could run an MCP server on an ESP32. Smart Homes could indeed get smart and a central LLM intelligence can manage the home. The privacy aspect of using this in intimate spaces remains challenging as local LLM hosts are super expensive, energy-intensive and LLM scaling benefits like KV-caching and dissagregated serving can not be used in low-request environments. For now, economics force a combination of cloud computing and edge processing.

For a proof of concept, I built an integration to let an agentic system control my standing desk via chat messages. I am currently waiting for some additional electronics components to solder the prototype into a complete installation.

What Needs to Happen Next

MCP represents a critical standardization opportunity, but execution must match the vision:

For Anthropic: Lead by example. Fully implement your own specification in Claude, especially HTTP transport and resources. Hire a technical writer.

For the Community: The IoT opportunity is real, but it needs proper SDK support. Native embedded implementations will unlock applications we haven't even imagined yet. We need to write the software libraries.

The idea behind MCP is brilliant. The execution needs serious work.


Have you tried implementing MCP in your projects? I'd love to hear about your experience, especially if you've encountered the limitations I've described.

The AI revolution is coming home in the form of open "Source" AI models. What began with hackers experimenting with open-source AI models is evolving into a more mature ecosystem of software and hardware solutions that make local AI deployment increasingly accessible. This growing interest led me to explore practical, cost-effective solutions for building a local AI workstation. It is still early days as most online resources focus on SaaS AI solutions. Following I will talk about NPUs as the tasks for AI are not graphics tasks but matrix multiplications which have been replaced with tensor cores (TPU) or neural processing units (NPU).

Apple Silicone has been my spot of interest for a while as it offers low power, cheap devices with a lot of unified memory (memory that can be used by CPU and NPU) not seen before with traditional NPUs. DeepSeeks model release has also questioned the narrative that you need cloud compute. Let's take a look at what we can get with some consumer budget. The entry point for capable consumer AI hardware still sits in the four-digit range – placing thes following solutions firmly in the mid-range category – the performance-to-price ratio continues to improve.

NVIDIA's recent announcement of their DIGITS system has reignited my interest in comparing performance across platforms. However, making meaningful comparisons in the AI hardware space presents unique challenges. Traditional benchmarks like Passmark scores or gaming FPS prove inadequate for AI workloads, while manufacturers often withhold or selectively publish detailed performance metrics across different data types. Recent work showed that the data type FP8 or INT8 does not matter much, so this makes it easier to use raw TOPS or FLOPS interchangebly to compare expected performance. Fortunately, emerging benchmarking tools like Geekbench's AI test suite are beginning to fill the gap of comparable numbers, offering more relevant metrics for AI applications. The most meaningful comparison would be for token/s for a model that runs on all devices. A list for Macs can be found here. We will use TOPS (Trillion Operations Per Second) in this article as rough estimate. When talking about AI in 2025, however, mostly LLMs are meant. LLMs are mostly memory bound.

Here is my selection of computers or NPUs that I consider relevant. When I talk about a NPU I mean a neural processing unit either as dedicated hardware or other hardware that can be used to accelerate neural networks. We also take a look at the best rated LLM we can run on each device. This is depending on the memory where bigger memory allows more model weights in higher precision. To run best in class SOTA like chatGPT 4o you need 671 B models, with 4bit quantization this needs around 336 GB NPU memory.

Mac Mini M4 Pro

Source

max specs: Apple M4 Pro Chip with 14‑core CPU, 20‑core GPU, 16‑core Neural Engine

  • Price for 1TB build: 2800 €, $ 2199
  • GPU (FP32): 8.1 T(FL)OPS
  • NPU (INT8): 38 TOPS
  • NPU Memory: 64 GB
  • best possible LLM model: 100B, quantized

My setup: Mac Mini M2

Not a good choice if you are looking to buy something but for completeness and myself I included it here. 8-core CPU, 10‑core GPU, 16-core Neural Engine

  • Current Price: 650€, $ 700
  • GPU (FP32): 3.41 T(FL)OPS
  • NPU: 15.8 TOPS (numbers found online, chip config not specified)
  • NPU memory: 16 GB
  • best possible LLM model: Llama 3B 8bit quant.

Mac Studio

The upgrade from M2 to M3 is here: Expensive and speed is slower than the M4 Pro but it allows big models with the high unified memory and reasonable memory bandwidth.

M3 Ultra: 32-core CPU, 80-core GPU, 32-core Neural Engine

  • Price: 11.874 € or $9,499
  • NPU (INT8): >31.6 TOPS (found online for previous M2)
  • based on benchmarks provided here and the apple list above I would estimate the M3 to be at 54 TOPS maximum.
  • NPU Memory (unified memory): 512 GB, 819,2 GB/s memory bandwith
  • best LLM model: 671 B (deepseek, quantized)

Comparing this with with the geekbench scores this seems to be about accurate.

When I derive it myself using a clock speed of 1Ghz I get 256 TOPS (OPS = (ops per cycle) × (cores) × (frequency) / (10^12) = 8 × 32 × 1,000,000,000 / 10^12 = 256 TOPS) so either the NPU clock speed is differing from online sources or my calculation has a mistake. To make it more compliacted you would get different numbers whether calculations are done on the CPU or GPU.

Custom Nvidia RTX PC

  • Price: This card is rather cheap and offers a good performance/price deal for some ML training and gaming.

I own a build with a RTX 3060. I bought it used for 240€ on eBay in 2023. The market price rose since then and new it goes for 500€. In addition you need the rest of the PC so for a whole build you will also end up in the four digit range.

  • GPU (FP32): 12.74 T(FL)OPS
  • NPU Memory: 12 GB
  • best possible LLM model: Llama-2 7B 4bit quantization

RTX 4090

  • Price: $ 2300
  • NPU: ~125 TOPS

The geekbench comparison shows that the enormous number of 1,3 PetaOPS (1321 TOPS) is only true for single precision. On transformer workloads for FP32, FP16 it is comparable to the Mac Studio M2 with its 31 TOPS. This device gets you high speed for little money but small models only. The number of 125 TOPS of the NPU seems questionable when comparing it with the performance in this benchmark's task. The limiting factor can not be explained by the memory bandwith as it has 1 TB/s transfer rate enabled by DDR6 compared to Apple's chip with 800 GB/s powered by DDR5. Indeed you can get 44 token/s output on a 3090 running gemma2:27b (Source). The software is an important factor in the performance, which could explain the discrepancies here.

  • NPU memory: 24 GB
  • best model: 27B

Tinybox

A reader informed me about this computer. They bundle six GPUs like AMD Radeon 7900XTX or RTX 4090 in one computer. So total price is more in the high-end range.

  • Price: $ 15,000
  • NPU (FP16): 738 FP16 TFLOPS
  • disk: 4 TB
  • NPU memory: 144GB
  • best possible LLM model: 200B

Nvidia DGX Spark (DIGITS)

NVIDIA Project Digits

Source

The performance is only given for FP4 using "sparsity". It is mostly used for inference, so the numbers on training performance will be lower. It has low 273 GB/s memroy bandwith. You can connect two with ConnectX. It will be available starting March this year.

  • Price: 3.689€
  • NPU (FP4): 1 PetaFlop (1000 TOPS)
  • disk: 4 TB
  • NPU (unified) memory: 128 GB
  • best possible LLM model: 200B

Summary

Bildschirmfoto%202025-03-19%20um%2014.35.46

The NVIDIA DIGITS system stands out with its impressive inference performance and substantial memory capacity. While its price point slightly exceeds the Mac Mini M4 Pro, it compensates with significantly more memory and faster inference speeds. Note that the RTX 4090 has a big difference in FP32 performance over quantized performance. It's important to approach these performance metrics cautiously – the published TOPS numbers represent peak performance under ideal conditions rather than sustained computational workloads, and manufacturers often don't provide detailed testing methodologies for their benchmarks.

Further reading: HN-Comment thread

Drowning in events? Here's how to build a scalable rule engine that handles 500,000+ events daily without breaking a sweat.

Why fast reaction is essential

In modern business environments, processing and reacting to events in real-time has become crucial. Whether you're dealing with IoT sensor data, user interactions, or business transactions, the ability to automatically apply business logic to incoming events at scale is essential. Each business has unique data sources and own business logic and this solution is flexible to deal with them all.

What Exactly is a Rule Engine?

A rule engine is a system that evaluates conditions against incoming events and executes predefined actions when those conditions are met. Think of it as an enterprise-scale version of email filtering rules, but with far more sophisticated capabilities and scalability requirements.

For example, when a temperature sensor reports a value above 30°C, the rule engine automatically:

  1. Sends an alert to the facility manager
  2. Activates additional cooling systems
  3. Logs the incident for compliance reporting

All of this happens automatically without human intervention.

Types of Inference

Rule engines typically use one of two main inference strategies. In the example above we encountered the more common type.

Forward Inference (Forward Chaining) starts with incoming data/events and applies rules to reach conclusions. This is like a production line: events come in, rules process them, and actions are triggered. For example, when a temperature sensor reports high values, the system triggers cooling system rules.

Backward Inference (Backward Chaining) works in reverse - it starts with a desired goal and works backward to determine what conditions would prove it, like a detective solving a case. While powerful for diagnostic systems and AI applications, backward chaining requires complex state management and isn't typically needed for event processing systems.

This guide focuses on forward inference, as it's sufficient for most event-driven use cases and provides better performance for real-time processing.

Key Components:

architecture

  • Event producers: Systems that generate events (sensors, applications, user interactions)
  • Consumers put them into a queue
  • Event workers: Services that process events and apply rules
  • Rule definitions: Conditions and actions specified in a structured format
  • Processing infrastructure: Distributed system for handling events at scale, priority based job processing like Huey

Architecture Design

Infrastructure Overview

The recommended architecture leverages managed Kubernetes (K8s) for scalability and operational efficiency. Here are the main components:

Event Ingestion

  • Load-balanced pods accept events via webhooks (REST API) or WebSocket connections
  • Events are buffered in Redis/RedPanda to handle high-volume scenarios
  • Horizontal pod autoscaling manages processing capacity

Rule Processing

  • Distributed processing nodes evaluate rules against events
  • Redis/RedPanda acts as a distributed cache and message queue
  • Kubernetes manages pod lifecycle and scaling

Storage and Logging

  • Persistent storage for rule definitions
  • Audit logs for compliance and troubleshooting
  • Performance metrics and execution history

Frontend Management Interface

The system requires a management interface for defining and managing rules. While Django is suggested for its rapid development capabilities, any full-stack framework can work. Key considerations include:

  • Integration with Identity and Access Management (IAM)
  • Rule creation and editing interface
  • Monitoring and troubleshooting dashboards
  • Audit log viewer

Building the Rule Definition System

Domain Specific Language (DSL)

rule

To illustrate the core concepts, let's start with a simplified rule format. While real-world implementations often require more complex structures, this basic format brings us already far:

  1. Conditions: A basic condition can be expressed as:

    [condition]
    variable = "x"
    operator = ">"
    value = 2

    This is enough to evaluate a value in an incoming event with an operator. So here we check if "x > 2".

  2. Actions: A simple action definition might be defined as:

    [action]
    type = "sendPN"
    sendTo = "${event.user}"
    message = "x value reached ${event.x}"

    Note the use of basic templating for dynamic values. When executing this action it will send a push notification to the user specified in the event with the message refering some value of the event

We can combine them in a rule. Combined this means whenever a value x is above 2 we will send the user a push notification.

Keeping track on business logic: Change Management

While the rule engine itself uses a database to store and evaluate rules, I strongly recommend managing changes through a Version Control System (VCS):

  • Store configurations as TOML files in Git (better readability than JSON, fewer special characters to escape, and simpler syntax for the shallow hierarchies typical in rule definitions)
  • Use branches to represent different environments (dev, staging, prod)
  • Manage changes through pull requests between these branches
  • Leverage automated testing and deployment pipelines
  • Start with a text-based UI, then build domain-specific interfaces for business teams
  • Use a bot user to automatically commit UI-based changes to Git

This approach aligns with modern GitOps practices - when changes are merged into an environment branch, automated pipelines deploy the updated configurations to that environment's database. Business users can modify rules through the UI while the system maintains proper version control behind the scenes.

Advanced Rule Capabilities

Single-condition rules quickly hit their limits in real applications. Business logic often requires combinations of conditions or time-based triggers. For example, you might need to alert when sensor readings exceed a threshold AND maintenance is due, or when multiple related events occur within a time window.

  1. Composite Conditions
    • Combine multiple conditions using logical operators
    • Support for nested conditions: any(all(A, B), C)
    • Complex event pattern matching
  2. Periodic Rules
    • Cron-based scheduling
    • Kubernetes CronJobs for execution
    • Time-based condition evaluation
  3. Delayed Conditions
    • Evaluate conditions with a time delay (e.g., check if value remains high after 5 minutes)
    • Pull current state at the time of delayed evaluation
    • Useful for confirming persistent conditions rather than reacting to momentary changes

Real-time performance monitoring

While tools like Grafana are popular for monitoring, they work best with pre-computed metrics. For rule engine performance analysis, implement a custom endpoint that gives you a graph of processing duration over time for each event. This can be impemented by putting every incoming objectinto Redis with the timestamp in a sorted list (ZSET). Then the same can be done when the computation is completed. This is very useful for troubleshooting of the rule engine as well the producers.

The Secret Sauce: Three Advanced Techniques

The base rule engine can be extended with several advanced features that help handle real-world complexities.

Handling Downtime

Even with a robust system, you need to account for potential downtime. Implement a periodic reconciliation process [1] that:

  1. Queries event sources for a specific time range
  2. Compares received event IDs against processed event records
  3. Processes any missed events

The event source must provide an API to query events by time range. To optimize this process, maintain identifiers of processed events to quickly detect gaps. Add some time offset to cover overlapping time ranges. Some events are still in some buffer.

[1]: The term reconcilation service is not something I have encoutered often IT. It it often also called synchronization.

Processing Lanes

Rather than processing all events through the same path, consider implementing hardcoded processing lanes that handle events differently. These lanes provide optimization opportunities and help manage different types of events efficiently. Common examples include:

  • Fast lane: that skips logging for high-throughput, low-risk events
  • Void lane: where certain events can be safely ignored
  • Synching lane: Propagate changes to another system

Distributed Debouncing

When dealing with event floods, particularly during batch synchronizations, you need a way to detect this behavior across a distributed system. For example, during a data sync, you might transition from "no data syncing" to "data is being synced" states. The challenge is detecting these state changes without triggering on single events.

I introduced a novel solution to this problem that draws inspiration from spiking neural networks. Using Redis as a distributed state store:

  1. Calculate an "activation level" for each incoming event based on timing
  2. Store these activation timestamps in a Redis sorted list (ZSET) to maintain ordering and avoid race conditions
  3. When the activation calculated by integrating the action potentials minus the passed time from the list reaches a threshold, the "neuron fires" - triggering the debouncer event.
  4. For subsequent incoming events, check if enough time has passed without events to switch back to the original state. Clear events from the list.

This approach effectively helps managing opening and closing "flood gates" in a distributed environment while maintaining system consistency.

Medical Applications with FHIR

Rule engines are particularly valuable in healthcare applications using the standard Fast Healthcare Interoperability Resources (FHIR).

The rule engine can integrate with FHIR systems through the native Subscription model. The Subscription framework already has some rule engine specified.

Note that this is FHIR server implementation dependent. HAPI FHIR works with the REST API. The Google implementation is not that advanced and the implementation delayed. To save network overhead you want with each event the payload in the body and not only the change event to then fetch it from the server. The payload is the actual value of e.g. an Observation.

GPU servers are expensive and cheaper option is to build your own GPU machine. When you prefer working on a different machine or you want to offer the GPU capabilities as server we can deploy it with docker or include it in a cluster with k3s or kubernetes. I encountered some difficulties in the process of setting up my GPU cluster, and some advice found on the internet and from LLMs is outdated, so here is a guide on what worked for me.

Bildschirmfoto%202024-09-09%20um%2021.21.56

Running a local GPU server with docker

First you need the nvidia drivers installed. On ubuntu you can get the available ones with sudo ubuntu-drivers install --gpgpu. I had to install them with dpkg and not via ubuntu-drivers install to make them fully work. I also had to load them with a command (sudo modprobe) or you can reboot. Confirm by running nvidia-smi and check if it gives you some output. To run computations on the GPU from source code, you likely need CUDA (there are other options strictly speaking). Install the latest CUDA toolkit from here on each GPU node. This is not required if you only use containers.

Docker needs a special back-end so the nvidia container toolkit (sudo apt-get install nvidia-container-toolkit) it should also install nvidia-docker for you. When running containers you either need to run nvidia-docker or docker --gpus all. There is no latest tag for the nvidia images so you can test if it works by running a particular version e.g.

nvidia-docker run --rm nvidia/cuda:12.5.0-devel-ubuntu22.04 nvidia-smi

Next, you can enable Docker in your local network. This way, you can use Docker as if you have it installed locally, but it builds and runs on your server. To use it on your client, set the env var, e.g.: export DOCKER_HOST=tcp://hostname:2375 using the default socket port. Note that this runs the default Docker and not nvidia-docker. To change that, you can modify the default runner in a config on the server. In this file /etc/docker/daemon.json add

"default-runtime": "nvidia",
"runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }

and restart docker with sudo systemctl restart docker. Just setting the default runtime was not enough in a later installation where I had to run to get the above ouput sudo nvidia-ctk runtime configure --runtime=docker.

Note that CUDA is not available in the runtime during the build, which should not be a problem. In my case, I could simply ignore a warning that CUDA could not be found.

Using a home GPU server in a k3s cluster

For installing an agent node in your k3s you need the token from the server in the file cat /var/lib/rancher/k3s/server/node-token of the host server and install it with curl -sfL https://get.k3s.io | K3S_URL=https://portraittogo.com:6443 K3S_TOKEN=mynodetoken sh -s - on the agent node.

According to the documentation the k3s config should reside in /etc/systemd/system/k3s.service on the server. For the agent it is in /etc/systemd/system/k3s-agent.service instead.

To enable k8s in kubernetes you also need the nvidia device plugin for setting the available gpu resources in the node capacity. I found that I installed multiple helm charts in parallel. After uninstalling them this guide to install the gpu operator with a simple helm install worked. The operator includes the device plugin. Before I had the device plugin installed. It can be installed with kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.1/deployments/static/nvidia-device-plugin.yml. I found that I had

Fo my installation, the nvidia container kit did not detect the OCI runtime, that is used by k3s by default, so I switched the k3s runtime to docker by adding the --docker flag to the config. You can test the full setup with

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: "key"
    operator: "Equal"
    value: "gpu"
    effect: "NoSchedule"
EOF

How are pods placed correctly now? I would advice against tainting the serverbecause it prevents a lot of helm installs. You could taint the agent node instead so that normal deployments don't end up on the GPU server. You can do this manually or automatically via --node-taint=key=value:NoSchedule during registration. Because of security reasons this is not allowed later after the start. See docs. Because of the problems from tains you might consider using node affinities instead. Tainting caused me a lot of frustration because on a cluster upgrade it did not upgrade the GPU node and then some networking related with flannel failed effectibely breaking the whole node.

Tailscale for the VPN

K3s documentation says tailscale is experimental. However I managed to make work. I later figured out that since switching my ISP from Telekom to M-net my home network is served with DS-lite. That means I only have an IPv6 adress and the IPv4 I get is the exit-node of the tunnel. This address can not be used to reach me directly as it is shared with multiple customers via the CGNAT. Furthermore I later noticed that my IP v6 DNS server was misconfigured. The wireguard setup might indeed work but from experience with IPv6 I learned that it is sometimes not supported by the applications. Tailscale gives you the pain-free solution.

You need to provide an auth key via the tailscaile settings in the k3s lauch settings. The local tailscale installation will be logged out as k3s manages it.

We can put the config in the file in /etc/rancher/k3s/config.yaml. It was not there by default. Another option is to add the settings in the /etc/systemd/system/k3s.service as launch arguments.

Using Wireguard

You can use a VPN directly by using wireguard. As explained above I can not make it work with my ISP but my notes might be handy.

By default k3s uses vxlan but this is not encrypted and hence not suitable for traffic over the internet. When running with k3s change this on the server. --node-external-ip=<SERVER_EXTERNAL_IP> --flannel-backend=wireguard-native --flannel-external-ip. Docs for the parameters

Most likely, you may need to add port forwarding to your NAT if your server is not available on the public internet. Also, disable the firewall on the node. The problem with that is that the IP address is probably dynamic, so every time you boot, you must update it. My address is also changing every day. This script updates it daily:

get_external_ip() {
    curl -s https://api.ipify.org
}
EXTERNAL_IP=$(get_external_ip)
sudo sed -i "s/--node-external-ip=.*/--node-external-ip=$EXTERNAL_IP'/" /etc/systemd/system/k3s-agent.service
sudo systemctl daemon-reload
sudo systemctl restart k3s-agent
echo "Updated k3s agent with new external IP: $EXTERNAL_IP"

To test if the agent can use the DNS: run a pod with this config (adapt the hostname first).

apiVersion: v1
kind: Pod
metadata:
  name: internet-check
spec:
  nodeSelector:
    kubernetes.io/hostname: ubuntufortress
  containers:
  - name: internet-check
    image: busybox
    command: 
      - "/bin/sh"
      - "-c"
      - "ping -c 4 google.com && wget -q --spider http://google.com"
  tolerations:
  - key: "key"
    operator: "Equal"
    value: "gpu"
    effect: "NoSchedule"
  restartPolicy: Never

You might need to install sudo apt install wireguard on both systems.