The AI revolution is coming home in the form of open "Source" AI models. What began with hackers experimenting with open-source AI models is evolving into a more mature ecosystem of software and hardware solutions that make local AI deployment increasingly accessible. This growing interest led me to explore practical, cost-effective solutions for building a local AI workstation. It is still early days as most online resources focus on SaaS AI solutions. Following I will talk about NPUs as the tasks for AI are not graphics tasks but matrix multiplications which have been replaced with tensor cores (TPU) or neural processing units (NPU).
Apple Silicone has been my spot of interest for a while as it offers low power, cheap devices with a lot of unified memory (memory that can be used by CPU and NPU) not seen before with traditional NPUs. DeepSeeks model release has also questioned the narrative that you need cloud compute. Let's take a look at what we can get with some consumer budget. The entry point for capable consumer AI hardware still sits in the four-digit range – placing thes following solutions firmly in the mid-range category – the performance-to-price ratio continues to improve.
NVIDIA's recent announcement of their DIGITS system has reignited my interest in comparing performance across platforms. However, making meaningful comparisons in the AI hardware space presents unique challenges. Traditional benchmarks like Passmark scores or gaming FPS prove inadequate for AI workloads, while manufacturers often withhold or selectively publish detailed performance metrics across different data types. Recent work showed that the data type FP8 or INT8 does not matter much, so this makes it easier to use raw TOPS or FLOPS interchangebly to compare expected performance. Fortunately, emerging benchmarking tools like Geekbench's AI test suite are beginning to fill the gap of comparable numbers, offering more relevant metrics for AI applications. The most meaningful comparison would be for token/s for a model that runs on all devices. A list for Macs can be found here. We will use TOPS (Trillion Operations Per Second) in this article as rough estimate.
Here is my selection of computers or NPUs that I consider relevant. When I talk about a NPU I mean a neural processing unit either as dedicated hardware or other hardware that can be used to accelerate neural networks. We also take a look at the best rated model we can run on each device. This is depending on the memory where bigger memory allows more model weights in higher precision. To run best in class SOTA like chatGPT 4o you need 671 B models, with 4bit quantization this needs around 336 GB NPU memory.
Mac Mini M4 Pro
max specs: Apple M4 Pro Chip with 14‑core CPU, 20‑core GPU, 16‑core Neural Engine
- Price for 1TB build: 2800€, 2400$
- GPU (FP32): 8.1 T(FL)OPS
- NPU (INT8): 38 TOPS
- NPU Memory: 64 GB
- best possible LLM model: 100B, quantized
My setup: Mac Mini M2
8-core CPU, 10‑core GPU, 16-core Neural Engine
- Current Price: 650€
- GPU (FP32): 3.41 T(FL)OPS
- NPU: 15.8 TOPS (numbers found online, chip config not specified)
- NPU memory: 16 GB
- best possible LLM model: Llama 3B 8bit quant.
Mac Studio
Expensive and speed is slower than the M4 Pro as the M2 chip is outdated but it allows big models with the high unified memory. Benchmarks shows that the huge bandwith allow higher LLM speeds. Look out for an upgrade soon.
M2 Ultra: 24-core CPU, 76-core GPU, 32-core Neural Engine
- Price: 7.789 €
- NPU (INT8): 31.6 TOPS (found online)
- NPU Memory (unified memory): 192 GB, 819,2 GB/s memory bandwith
- best LLM model: 200 B
Comparing this with with the geekbench scores this seems to be about accurate.
When I derive it myself using a clock speed of 1Ghz I get 256 TOPS (OPS = (ops per cycle) × (cores) × (frequency) / (10^12) = 8 × 32 × 1,000,000,000 / 10^12 = 256 TOPS) so either the NPU clock speed is differing from online sources or my calculation has a mistake.
Custom Nvidia RTX PC
- Price: This card is rather cheap and offers a good performance/price deal for some ML training and gaming.
I own a build with a RTX 3060. I bought it used for 240€ on eBay in 2023. The market price rose since then and new it goes for 500€. In addition you need the rest of the PC so for a whole build you will also end up in the four digit range.
- GPU (FP32): 12.74 T(FL)OPS
- NPU Memory: 12 GB
- best possible LLM model: Llama-2 7B 4bit quantization
RTX 4090
- Price: 2300$
- NPU: ~125 TOPS
The geekbench comparison shows that the enormous number of 1,3 PetaOPS (1321 TOPS) is only true for single precision. On transformer workloads for FP32, FP16 it is comparable to the Mac Studio with its 31 TOPS. This device gets you high speed for little money but small models only. The number of 125 TOPS of the NPU seems questionable when comparing it with the performance in this benchmark's task. The limiting factor can not be explained by the memory bandwith as it has 1 TB/s transfer rate enabled by DDR6 compared to Apple's chip with 800 GB/s powered by DDR5. Indeed you can get 44 token/s output on a 3090 running gemma2:27b (Source). The software is an important factor in the performance, which could explain the discrepancies here.
- NPU memory: 24 GB
- best model: 27B
Tinybox
A reader informed me about this computer. They bundle six GPUs like AMD Radeon 7900XTX or RTX 4090 in one computer. So total price is more in the high-end range.
- Price: 15,000 $
- NPU (FP16): 738 FP16 TFLOPS
- disk: 4 TB
- NPU memory: 144GB
- best possible LLM model: 200B
Nvidia DIGITS
The performance is only given for FP4. It is mostly used for inference, so the numbers on training performance will be lower. You can connect two with ConnectX. It will be available starting March this year.
- Price: 3000$
- NPU (FP4): 1 PetaFlop (1000 TOPS)
- disk: 4 TB
- NPU memory: 128 GB
- best possible LLM model: 200B
Summary
The NVIDIA DIGITS system stands out with its impressive inference performance and substantial memory capacity. While its price point slightly exceeds the Mac Mini M4 Pro, it compensates with significantly more memory and faster inference speeds. Note that the RTX 4090 has a big difference in FP32 performance over quantized performance. It's important to approach these performance metrics cautiously – the published TOPS numbers represent peak performance under ideal conditions rather than sustained computational workloads, and manufacturers often don't provide detailed testing methodologies for their benchmarks.
Further reading: HN-Comment thread