Structured Engineering Case Studies

Linux & Virtualization Engineering Portfolio

Current role Linux & Virtualization Engineer Deutsche Pfandbriefbank AG · Madrid
Published posts 49 Case-study driven technical notes
Last update Archive-first publishing flow

At a Glance

Full Profile →

Production Spotlight

App Store →
● Live on Google Play & Web

📱 IntelliFlow: AI Budget Tracker

A production-grade personal finance application serving real users. Features an AI-powered financial coach, offline-first architecture, and cross-platform syncing.

Domains

Browse all

Featured Projects

View GitHub →

llamacpp-workbench

Local LLM inference workbench for RK3588 and edge devices

Python · JavaScript

IntelliFlow

AI-powered personal finance app with offline-first architecture

Flutter · Production

Ansible Playbooks

Infrastructure automation for enterprise Linux environments

Ansible · YAML

Recent Case Studies

All posts
5 min read

Optimizing DeepSeek KV Cache for Serverless AI Pipelines

How splitting a monolithic system prompt into static and per-session layers improved estimated KV cache hit rates from ~42% to ~76% and reduced input costs by an estimated 57% on a Firebase Functions app running DeepSeek V4 Flash.

AI Kotlin
LLMDeepSeekFirebaseOptimization
11 min read

RX 7800 XT 16GB: Running 35B MoE at 128K Context with llama.cpp + ROCm

Full benchmark data on running MoE and dense LLMs on AMD consumer hardware — quantization comparison, power cap analysis, KV cache tuning, and context limits on 16GB VRAM.

Local AI Infrastructure
Issue Consumer GPUs have hard VRAM ceilings. Running 23-35B parameter models on 16GB requires aggressive quantization, KV cache compression, and precise build flags. The noise-to-signal ratio in online benchmarking is high — most people test on NVIDIA, not AMD RDNA3, and few test MoE architectures with context windows above 32K.
Solution Systematically benchmarked 8+ models across 5 quantization levels, swept GPU power caps from 30W to 190W, tested 3 KV cache configurations, and pushed context limits to 256K. Documented the exact llama.cpp build flags and runtime parameters that make 128K inference on 16GB VRAM stable and fast.
local-aillama.cpprocmamd
8 min read

Git Branch Splitting: Untangling Mixed Feature Branches

A practical guide to splitting an oversized Git PR into clean, topic-focused branches using path-based checkout from a fresh branch off main.

Automation Infrastructure
Issue Mixed branches make PRs unreviewable, increase blast radius, and risk dragging unrelated changes into production. When one branch contains role code, host variables, certificate files, and inventory updates together, reviewers cannot isolate what changed or why.
Solution Split the oversized branch into multiple clean, topic-focused branches by checking out only the relevant paths from the mixed branch into new branches created fresh off main.
gitdevopsansibleworkflow
10 min read

14 Models Benchmarked on RK3588: The Definitive CPU vs NPU Ranking

Benchmarked every viable local LLM (350M to 26B, CPU and NPU) through a live Discord agent pipeline on RK3588. Found NPU beats CPU at same quality, code is solved at any size, and 4B+ models are slower AND worse than 2B on this board.

Local AI
Issue Previous benchmarks measured raw llama.cpp throughput but not real quality through the agent pipeline. Models that looked fast synthetically failed at reasoning, refused tool calls, or got intercepted by workspace routing before reaching the model.
Solution Built a 14-test, 6-dimension benchmark harness that tests every model through the live Discord pipeline with quality validation: reasoning, factual accuracy, code generation, instruction following, tool calling, and math. Tested 14 models (9 CPU GGUF + 3 NPU RKLLM + 2 large MoE) with BENCHMARK_MODE to isolate pure model performance.
rk3588radxarock-5b-plusllama.cpp
6 min read

llamacpp-workbench: Remote llama.cpp Control and REAP Model Serving on RK3588

Publishing a practical local-AI control plane for llama.cpp: remote model loading, runtime tuning, streaming chat, and real REAP model serving on a Radxa ROCK 5B+.

Local AI
Issue Most local model UIs either abstract away the runtime details that actually matter on constrained hardware or assume desktop-class GPUs. On RK3588, that makes it harder to tune context, KV cache quantization, reasoning behavior, and model selection credibly.
Solution Built and published `llamacpp-workbench`, a remote llama.cpp workbench with explicit runtime controls, model presets, markdown chat rendering, streaming responses, and benchmark-backed defaults for REAP and dense GGUF models.
llama.cpprk3588radxarock-5b-plus
15 min read

Qwen3.5 on RK3588 with llama.cpp: Real Benchmarks from a Radxa ROCK 5B+

An advanced benchmark report for running Qwen3.5 locally on RK3588 with source-built llama.cpp: prefill speed, decode speed, stable context, tool-calling behavior, and the practical model choices that actually work on a Radxa ROCK 5B+.

Local AI
Issue The usual local-AI advice overemphasizes parameter count and underexplains bandwidth, context budget, KV cache policy, and interactive latency. On RK3588, that leads to bad defaults: models that technically load but feel broken in real chat and tool-calling workloads.
Solution I ran a corrected Qwen3.5 sweep on RK3588 using source-built llama.cpp, quantized KV cache, and task-pass validation. Then I compared prefill, decode, stable context, average latency, and tool-calling behavior to determine the right model for each workload.
rk3588radxarock-5b-plusllama.cpp