I recently got a 3d printer. It is fascinating that you can just imagine things and turn them into real physical objects. I always imagined this and some recent cleanup in my big city, small flat, brought up the idea again for getting a new tool.

The air they produce is rather toxic so I looked into model with some air filtration system. I got a 5M Pro model used for around 250€. In addition to the printer you need various filaments. For ecological reasons I pick mostly PLA, which can be grown from plants.

Which tool to use in the age of AI?

I looked into zoo.dev. I once was convinced that this is the future of manufacturing wiht AI. They are still rather early in their product and ran into performance issues. The problem with their approach as a user is that it runs on their servers and the programming langauge is closed source. So even though I have a powerful computer I am limited by the performance of their servers. Their AI has a lot of potential but at the moment it is very slow and it needs hours till you turn a prompt into an object.

I also tried using OpenSCAD. Claude is familiar with it but it could not resolve some issues I had with the model. Another upside is that it is faster and runs locally.

I also tried using a python script to modify the export that I gout out of zoo design studio. In the end I let Claude write a python script that writes the coordinates in an .obj directly.

TL;DR: I built a virtual kubelet that lets Kubernetes offload GPU jobs to RunPod.io; Useful for burst scaling ML workloads without needing full-time cloud GPUs.

This project came out of a need while working on an internal ML-based SaaS (which didn’t pan out). Initially, we used the RunPod API directly in the application, as RunPod had the most affordable GPU pricing at the time. But I also had a GPU server at home and wanted to run experiments even cheaper. Since I had good experiences with Kubernetes jobs (for CPU workloads), I installed k3s and made the home GPU node part of the cluster.

The idea was simple: use the local GPU when possible, and burst to RunPod when needed. The app logic would stay clean. Kubernetes would handle the infrastructure decisions. Ideally, the same infra would scale from dev experiments to production workloads.

What Didn't Work

My first attempt was a custom controller written in Go, monitoring jobs and scheduling them on RunPod. I avoided CRDs to stay compatible with the native Job API. Go was the natural choice given its strong Kubernetes ecosystem.

The problem with the approach was that when overwriting pod values and creating virtual pods, this approach fought the Kubernetes scheduler constantly. Reconciliation with runpod and failed jobs lead to problems like loops. I also considered queuing stalled jobs and triggering scale-out logic, which increased the complexity further, but it became a mess. I wrote thousands of lines of Go and never got it stable.

What worked

The proper way to do this is with the virtual kubelet. I used the CNCF sandbox project virtual-kubelet, which registers as a node in the cluster. Then the normal scheduler can use taints, tolerations, and node selectors to place pods. When a pod is placed on the virtual node, the controller provisions it using a third-party API, in this case, RunPod's.

Current Status

The source code and helm chart are available here: Github It’s source-available under a non-commercial license for now — I’d love to turn this into something sustainable.

I’m not affiliated with RunPod. I shared the project with RunPod, and their Head of Engineering reached out to discuss potential collaboration. We had an initial meeting, and there was interest in continuing the conversation. They asked to schedule a follow-up, but I didn’t hear back to my follow ups. These things happen, people get busy or priorities shift. Regardless, I’m glad the project sparked interest and I’m open to revisiting it in the future.

Everyone is currently racing to add AI to their applications. The Model Context Protocol (MCP) solves the problem of writing an integration for each service by using a protocol that can be plugged into any system. It is gaining serious traction but it still is wild west.

The problem? The organization behind MCP (Anthropic) doesn't even implement their own specification properly. Also the specification is a vibe-coded mess. Writing good protocol specs can take many years e.g. hardware protocols like USB, or software-protocols like medical standard FHIR but Anthropic is speed-running this.

Why Everyone's Confused About MCP

The documentation reads like someone wrote the implementation in code first and then let the LLM write the documentation. First many people seem to be confused by the introduction that introduces the terminology of host, client and server. The terminology is reasonable but it could be probably better communicated. Here is what is actually happening: Your application bundles various clients that connect to external servers. Think of it like your app connecting to multiple SQL databases. You have multiple database clients, but from the user perspective, it's just your app talking to databases. When used for information retrievel it is not even far of from a database.

So that is not novel concept. The novel part is merely introducing that aspect as something special.

Anthropic Doesn't Follow Their Own Rules

Anthropic is pushing this protocol, however they do not fully implement the spec with Claude Desktop nor Claude Code.

HTTP Transport? Not supported .

Transport mode via HTTP is not supported, despite being part of the core specification. It is suprising that the organization that built this, is also not really supporting it properly. With billions pouring into AI I would expect that there is enough money to hire a technical writer.

The transport design has been critized before by Rasmus Holm in this great post.

For HTTP support in Claude Desktop there is a workaround to utilitize other tools to act as a proxy .

e.g. for the claude_desktop_config.json you configure

"your tool": {
      "command": "npx",
      "args": [
        "mcp-remote",
        "http://yourmcp.local/mcp",
        "--allow-http"
      ]
    }

On a side note about this config file: There is no documentation on the claude_desktop_config.json. After all it is application specific code and should not be part of the MCP spec but in the Claude documentation. However, there is some random documentation scattered aroud like here where it tells you how to add env variables.

MCP Resources are not implemented

I wanted to retrive simple values to the chat. According to the documentation this can be done with MCP resources. It works like a REST interface where you can pass identifiers in the URI. Claude doesn't support resources either. Everything gets forced through tool implementations instead, defeating the purpose of having resources as a separate concept.

MCP for Robotics and IoT: Early But Promising

Mittel%20%28MCP%20sketch%29 There are no official completed SDKs for C++ or Rust, languages that are commonly used for embedded development. Without the SDK the spec is too complex to quickly built it yourself per project. What can be done instead as of now is to run a bridge MCP server on your computer that uses then rest.

The ESP32 is a popular microcontroller with bluetooth and wifi support and GPIO pins. As you can run the async framework tokio used by the mcp rust SDK on an ESP32 you could run an MCP server on an ESP32. Smart Homes could indeed get smart and a central LLM intelligence can manage the home. The privacy aspect of using this in intimate spaces remains challenging as local LLM hosts are super expensive, energy-intensive and LLM scaling benefits like KV-caching and dissagregated serving can not be used in low-request environments. For now, economics force a combination of cloud computing and edge processing.

For a proof of concept, I built an integration to let an agentic system control my standing desk via chat messages. I am currently waiting for some additional electronics components to solder the prototype into a complete installation.

What Needs to Happen Next

MCP represents a critical standardization opportunity, but execution must match the vision:

For Anthropic: Lead by example. Fully implement your own specification in Claude, especially HTTP transport and resources. Hire a technical writer.

For the Community: The IoT opportunity is real, but it needs proper SDK support. Native embedded implementations will unlock applications we haven't even imagined yet. We need to write the software libraries.

The idea behind MCP is brilliant. The execution needs serious work.


Have you tried implementing MCP in your projects? I'd love to hear about your experience, especially if you've encountered the limitations I've described.

The AI revolution is coming home in the form of open "Source" AI models. What began with hackers experimenting with open-source AI models is evolving into a more mature ecosystem of software and hardware solutions that make local AI deployment increasingly accessible. This growing interest led me to explore practical, cost-effective solutions for building a local AI workstation. It is still early days as most online resources focus on SaaS AI solutions. Following I will talk about NPUs as the tasks for AI are not graphics tasks but matrix multiplications which have been replaced with tensor cores (TPU) or neural processing units (NPU).

Apple Silicone has been my spot of interest for a while as it offers low power, cheap devices with a lot of unified memory (memory that can be used by CPU and NPU) not seen before with traditional NPUs. DeepSeeks model release has also questioned the narrative that you need cloud compute. Let's take a look at what we can get with some consumer budget. The entry point for capable consumer AI hardware still sits in the four-digit range – placing thes following solutions firmly in the mid-range category – the performance-to-price ratio continues to improve.

NVIDIA's recent announcement of their DIGITS system has reignited my interest in comparing performance across platforms. However, making meaningful comparisons in the AI hardware space presents unique challenges. Traditional benchmarks like Passmark scores or gaming FPS prove inadequate for AI workloads, while manufacturers often withhold or selectively publish detailed performance metrics across different data types. Recent work showed that the data type FP8 or INT8 does not matter much, so this makes it easier to use raw TOPS or FLOPS interchangebly to compare expected performance. Fortunately, emerging benchmarking tools like Geekbench's AI test suite are beginning to fill the gap of comparable numbers, offering more relevant metrics for AI applications. The most meaningful comparison would be for token/s for a model that runs on all devices. A list for Macs can be found here. We will use TOPS (Trillion Operations Per Second) in this article as rough estimate. When talking about AI in 2025, however, mostly LLMs are meant. LLMs are mostly memory bound.

Here is my selection of computers or NPUs that I consider relevant. When I talk about a NPU I mean a neural processing unit either as dedicated hardware or other hardware that can be used to accelerate neural networks. We also take a look at the best rated LLM we can run on each device. This is depending on the memory where bigger memory allows more model weights in higher precision. To run best in class SOTA like chatGPT 4o you need 671 B models, with 4bit quantization this needs around 336 GB NPU memory.

Mac Mini M4 Pro

Source

max specs: Apple M4 Pro Chip with 14‑core CPU, 20‑core GPU, 16‑core Neural Engine

  • Price for 1TB build: 2800 €, $ 2199
  • GPU (FP32): 8.1 T(FL)OPS
  • NPU (INT8): 38 TOPS
  • NPU Memory: 64 GB
  • best possible LLM model: 100B, quantized

My setup: Mac Mini M2

Not a good choice if you are looking to buy something but for completeness and myself I included it here. 8-core CPU, 10‑core GPU, 16-core Neural Engine

  • Current Price: 650€, $ 700
  • GPU (FP32): 3.41 T(FL)OPS
  • NPU: 15.8 TOPS (numbers found online, chip config not specified)
  • NPU memory: 16 GB
  • best possible LLM model: Llama 3B 8bit quant.

Mac Studio

The upgrade from M2 to M3 is here: Expensive and speed is slower than the M4 Pro but it allows big models with the high unified memory and reasonable memory bandwidth.

M3 Ultra: 32-core CPU, 80-core GPU, 32-core Neural Engine

  • Price: 11.874 € or $9,499
  • NPU (INT8): >31.6 TOPS (found online for previous M2)
  • based on benchmarks provided here and the apple list above I would estimate the M3 to be at 54 TOPS maximum.
  • NPU Memory (unified memory): 512 GB, 819,2 GB/s memory bandwith
  • best LLM model: 671 B (deepseek, quantized)

Comparing this with with the geekbench scores this seems to be about accurate.

When I derive it myself using a clock speed of 1Ghz I get 256 TOPS (OPS = (ops per cycle) × (cores) × (frequency) / (10^12) = 8 × 32 × 1,000,000,000 / 10^12 = 256 TOPS) so either the NPU clock speed is differing from online sources or my calculation has a mistake. To make it more compliacted you would get different numbers whether calculations are done on the CPU or GPU.

Custom Nvidia RTX PC

  • Price: This card is rather cheap and offers a good performance/price deal for some ML training and gaming.

I own a build with a RTX 3060. I bought it used for 240€ on eBay in 2023. The market price rose since then and new it goes for 500€. In addition you need the rest of the PC so for a whole build you will also end up in the four digit range.

  • GPU (FP32): 12.74 T(FL)OPS
  • NPU Memory: 12 GB
  • best possible LLM model: Llama-2 7B 4bit quantization

RTX 4090

  • Price: $ 2300
  • NPU: ~125 TOPS

The geekbench comparison shows that the enormous number of 1,3 PetaOPS (1321 TOPS) is only true for single precision. On transformer workloads for FP32, FP16 it is comparable to the Mac Studio M2 with its 31 TOPS. This device gets you high speed for little money but small models only. The number of 125 TOPS of the NPU seems questionable when comparing it with the performance in this benchmark's task. The limiting factor can not be explained by the memory bandwith as it has 1 TB/s transfer rate enabled by DDR6 compared to Apple's chip with 800 GB/s powered by DDR5. Indeed you can get 44 token/s output on a 3090 running gemma2:27b (Source). The software is an important factor in the performance, which could explain the discrepancies here.

  • NPU memory: 24 GB
  • best model: 27B

Tinybox

A reader informed me about this computer. They bundle six GPUs like AMD Radeon 7900XTX or RTX 4090 in one computer. So total price is more in the high-end range.

  • Price: $ 15,000
  • NPU (FP16): 738 FP16 TFLOPS
  • disk: 4 TB
  • NPU memory: 144GB
  • best possible LLM model: 200B

Nvidia DGX Spark (DIGITS)

NVIDIA Project Digits

Source

The performance is only given for FP4 using "sparsity". It is mostly used for inference, so the numbers on training performance will be lower. It has low 273 GB/s memroy bandwith. You can connect two with ConnectX. It will be available starting March this year.

  • Price: 3.689€
  • NPU (FP4): 1 PetaFlop (1000 TOPS)
  • disk: 4 TB
  • NPU (unified) memory: 128 GB
  • best possible LLM model: 200B

Summary

Bildschirmfoto%202025-03-19%20um%2014.35.46

The NVIDIA DIGITS system stands out with its impressive inference performance and substantial memory capacity. While its price point slightly exceeds the Mac Mini M4 Pro, it compensates with significantly more memory and faster inference speeds. Note that the RTX 4090 has a big difference in FP32 performance over quantized performance. It's important to approach these performance metrics cautiously – the published TOPS numbers represent peak performance under ideal conditions rather than sustained computational workloads, and manufacturers often don't provide detailed testing methodologies for their benchmarks.

Further reading: HN-Comment thread

Drowning in events? Here's how to build a scalable rule engine that handles 500,000+ events daily without breaking a sweat.

Why fast reaction is essential

In modern business environments, processing and reacting to events in real-time has become crucial. Whether you're dealing with IoT sensor data, user interactions, or business transactions, the ability to automatically apply business logic to incoming events at scale is essential. Each business has unique data sources and own business logic and this solution is flexible to deal with them all.

What Exactly is a Rule Engine?

A rule engine is a system that evaluates conditions against incoming events and executes predefined actions when those conditions are met. Think of it as an enterprise-scale version of email filtering rules, but with far more sophisticated capabilities and scalability requirements.

For example, when a temperature sensor reports a value above 30°C, the rule engine automatically:

  1. Sends an alert to the facility manager
  2. Activates additional cooling systems
  3. Logs the incident for compliance reporting

All of this happens automatically without human intervention.

Types of Inference

Rule engines typically use one of two main inference strategies. In the example above we encountered the more common type.

Forward Inference (Forward Chaining) starts with incoming data/events and applies rules to reach conclusions. This is like a production line: events come in, rules process them, and actions are triggered. For example, when a temperature sensor reports high values, the system triggers cooling system rules.

Backward Inference (Backward Chaining) works in reverse - it starts with a desired goal and works backward to determine what conditions would prove it, like a detective solving a case. While powerful for diagnostic systems and AI applications, backward chaining requires complex state management and isn't typically needed for event processing systems.

This guide focuses on forward inference, as it's sufficient for most event-driven use cases and provides better performance for real-time processing.

Key Components:

architecture

  • Event producers: Systems that generate events (sensors, applications, user interactions)
  • Consumers put them into a queue
  • Event workers: Services that process events and apply rules
  • Rule definitions: Conditions and actions specified in a structured format
  • Processing infrastructure: Distributed system for handling events at scale, priority based job processing like Huey

Architecture Design

Infrastructure Overview

The recommended architecture leverages managed Kubernetes (K8s) for scalability and operational efficiency. Here are the main components:

Event Ingestion

  • Load-balanced pods accept events via webhooks (REST API) or WebSocket connections
  • Events are buffered in Redis/RedPanda to handle high-volume scenarios
  • Horizontal pod autoscaling manages processing capacity

Rule Processing

  • Distributed processing nodes evaluate rules against events
  • Redis/RedPanda acts as a distributed cache and message queue
  • Kubernetes manages pod lifecycle and scaling

Storage and Logging

  • Persistent storage for rule definitions
  • Audit logs for compliance and troubleshooting
  • Performance metrics and execution history

Frontend Management Interface

The system requires a management interface for defining and managing rules. While Django is suggested for its rapid development capabilities, any full-stack framework can work. Key considerations include:

  • Integration with Identity and Access Management (IAM)
  • Rule creation and editing interface
  • Monitoring and troubleshooting dashboards
  • Audit log viewer

Building the Rule Definition System

Domain Specific Language (DSL)

rule

To illustrate the core concepts, let's start with a simplified rule format. While real-world implementations often require more complex structures, this basic format brings us already far:

  1. Conditions: A basic condition can be expressed as:

    [condition]
    variable = "x"
    operator = ">"
    value = 2

    This is enough to evaluate a value in an incoming event with an operator. So here we check if "x > 2".

  2. Actions: A simple action definition might be defined as:

    [action]
    type = "sendPN"
    sendTo = "${event.user}"
    message = "x value reached ${event.x}"

    Note the use of basic templating for dynamic values. When executing this action it will send a push notification to the user specified in the event with the message refering some value of the event

We can combine them in a rule. Combined this means whenever a value x is above 2 we will send the user a push notification.

Keeping track on business logic: Change Management

While the rule engine itself uses a database to store and evaluate rules, I strongly recommend managing changes through a Version Control System (VCS):

  • Store configurations as TOML files in Git (better readability than JSON, fewer special characters to escape, and simpler syntax for the shallow hierarchies typical in rule definitions)
  • Use branches to represent different environments (dev, staging, prod)
  • Manage changes through pull requests between these branches
  • Leverage automated testing and deployment pipelines
  • Start with a text-based UI, then build domain-specific interfaces for business teams
  • Use a bot user to automatically commit UI-based changes to Git

This approach aligns with modern GitOps practices - when changes are merged into an environment branch, automated pipelines deploy the updated configurations to that environment's database. Business users can modify rules through the UI while the system maintains proper version control behind the scenes.

Advanced Rule Capabilities

Single-condition rules quickly hit their limits in real applications. Business logic often requires combinations of conditions or time-based triggers. For example, you might need to alert when sensor readings exceed a threshold AND maintenance is due, or when multiple related events occur within a time window.

  1. Composite Conditions
    • Combine multiple conditions using logical operators
    • Support for nested conditions: any(all(A, B), C)
    • Complex event pattern matching
  2. Periodic Rules
    • Cron-based scheduling
    • Kubernetes CronJobs for execution
    • Time-based condition evaluation
  3. Delayed Conditions
    • Evaluate conditions with a time delay (e.g., check if value remains high after 5 minutes)
    • Pull current state at the time of delayed evaluation
    • Useful for confirming persistent conditions rather than reacting to momentary changes

Real-time performance monitoring

While tools like Grafana are popular for monitoring, they work best with pre-computed metrics. For rule engine performance analysis, implement a custom endpoint that gives you a graph of processing duration over time for each event. This can be impemented by putting every incoming objectinto Redis with the timestamp in a sorted list (ZSET). Then the same can be done when the computation is completed. This is very useful for troubleshooting of the rule engine as well the producers.

The Secret Sauce: Three Advanced Techniques

The base rule engine can be extended with several advanced features that help handle real-world complexities.

Handling Downtime

Even with a robust system, you need to account for potential downtime. Implement a periodic reconciliation process [1] that:

  1. Queries event sources for a specific time range
  2. Compares received event IDs against processed event records
  3. Processes any missed events

The event source must provide an API to query events by time range. To optimize this process, maintain identifiers of processed events to quickly detect gaps. Add some time offset to cover overlapping time ranges. Some events are still in some buffer.

[1]: The term reconcilation service is not something I have encoutered often IT. It it often also called synchronization.

Processing Lanes

Rather than processing all events through the same path, consider implementing hardcoded processing lanes that handle events differently. These lanes provide optimization opportunities and help manage different types of events efficiently. Common examples include:

  • Fast lane: that skips logging for high-throughput, low-risk events
  • Void lane: where certain events can be safely ignored
  • Synching lane: Propagate changes to another system

Distributed Debouncing

When dealing with event floods, particularly during batch synchronizations, you need a way to detect this behavior across a distributed system. For example, during a data sync, you might transition from "no data syncing" to "data is being synced" states. The challenge is detecting these state changes without triggering on single events.

I introduced a novel solution to this problem that draws inspiration from spiking neural networks. Using Redis as a distributed state store:

  1. Calculate an "activation level" for each incoming event based on timing
  2. Store these activation timestamps in a Redis sorted list (ZSET) to maintain ordering and avoid race conditions
  3. When the activation calculated by integrating the action potentials minus the passed time from the list reaches a threshold, the "neuron fires" - triggering the debouncer event.
  4. For subsequent incoming events, check if enough time has passed without events to switch back to the original state. Clear events from the list.

This approach effectively helps managing opening and closing "flood gates" in a distributed environment while maintaining system consistency.

Medical Applications with FHIR

Rule engines are particularly valuable in healthcare applications using the standard Fast Healthcare Interoperability Resources (FHIR).

The rule engine can integrate with FHIR systems through the native Subscription model. The Subscription framework already has some rule engine specified.

Note that this is FHIR server implementation dependent. HAPI FHIR works with the REST API. The Google implementation is not that advanced and the implementation delayed. To save network overhead you want with each event the payload in the body and not only the change event to then fetch it from the server. The payload is the actual value of e.g. an Observation.