Structured Engineering Case Studies

Linux & Virtualization Engineering Portfolio

Current role Linux & Virtualization Engineer Deutsche Pfandbriefbank AG · Madrid

Published posts 49 Case-study driven technical notes

Last update Jun 18, 2026 Archive-first publishing flow

At a Glance

Linux and virtualization engineer documenting real delivery patterns: clear issue statements, implementation choices, and production outcomes.

🐧 24 posts

Linux Infrastructure

RHEL lifecycle management, kernel tuning, Satellite, enterprise Linux operations

Infrastructure Automation

⚡ 13 posts

Platform Automation

Ansible playbooks, IaC, Git workflows, CI/CD pipelines, VMware automation

Automation Cloud

🤖 17 posts

Applied AI & Edge

Azure AI Foundry, local LLM inference, RK3588 edge deployment, document intelligence

AI Local AI

Production Spotlight

App Store →

● Live on Google Play & Web

📱 IntelliFlow: AI Budget Tracker

A production-grade personal finance application serving real users. Features an AI-powered financial coach, offline-first architecture, and cross-platform syncing.

Get it on Google Play Open Web App

Domains

Browse all

Infrastructure 24

RHEL lifecycle, automation, virtualization, and production operations.

Automation 13

Ansible playbooks, task automation, and configuration management.

AI 17

Applied AI across cloud services, local inference, and practical delivery lessons.

Cloud 2

Azure architecture, infrastructure design, and delivery practices.

Local AI 9

Running models on local hardware with privacy-first workflows.

Kotlin 7

Kotlin projects, notes, and engineering experiments.

Snippets 4

Quick commands and reusable building blocks for day-to-day work.

Featured Projects

View GitHub →

llamacpp-workbench

Local LLM inference workbench for RK3588 and edge devices

Python · JavaScript

IntelliFlow

AI-powered personal finance app with offline-first architecture

Flutter · Production

Ansible Playbooks

Infrastructure automation for enterprise Linux environments

Ansible · YAML

Recent Case Studies

All posts

Jun 18, 2026 • 5 min read

Optimizing DeepSeek KV Cache for Serverless AI Pipelines

How splitting a monolithic system prompt into static and per-session layers improved estimated KV cache hit rates from ~42% to ~76% and reduced input costs by an estimated 57% on a Firebase Functions app running DeepSeek V4 Flash.

AI Kotlin

LLMDeepSeekFirebaseOptimization

Jun 10, 2026 • 11 min read

RX 7800 XT 16GB: Running 35B MoE at 128K Context with llama.cpp + ROCm

Full benchmark data on running MoE and dense LLMs on AMD consumer hardware — quantization comparison, power cap analysis, KV cache tuning, and context limits on 16GB VRAM.

Local AI Infrastructure

Issue Consumer GPUs have hard VRAM ceilings. Running 23-35B parameter models on 16GB requires aggressive quantization, KV cache compression, and precise build flags. The noise-to-signal ratio in online benchmarking is high — most people test on NVIDIA, not AMD RDNA3, and few test MoE architectures with context windows above 32K.

Solution Systematically benchmarked 8+ models across 5 quantization levels, swept GPU power caps from 30W to 190W, tested 3 KV cache configurations, and pushed context limits to 256K. Documented the exact llama.cpp build flags and runtime parameters that make 128K inference on 16GB VRAM stable and fast.

local-aillama.cpprocmamd

Apr 16, 2026 • 8 min read

Git Branch Splitting: Untangling Mixed Feature Branches

A practical guide to splitting an oversized Git PR into clean, topic-focused branches using path-based checkout from a fresh branch off main.

Automation Infrastructure

Issue Mixed branches make PRs unreviewable, increase blast radius, and risk dragging unrelated changes into production. When one branch contains role code, host variables, certificate files, and inventory updates together, reviewers cannot isolate what changed or why.

Solution Split the oversized branch into multiple clean, topic-focused branches by checking out only the relevant paths from the mixed branch into new branches created fresh off main.

gitdevopsansibleworkflow

Apr 11, 2026 • 10 min read

14 Models Benchmarked on RK3588: The Definitive CPU vs NPU Ranking

Benchmarked every viable local LLM (350M to 26B, CPU and NPU) through a live Discord agent pipeline on RK3588. Found NPU beats CPU at same quality, code is solved at any size, and 4B+ models are slower AND worse than 2B on this board.

Local AI

Issue Previous benchmarks measured raw llama.cpp throughput but not real quality through the agent pipeline. Models that looked fast synthetically failed at reasoning, refused tool calls, or got intercepted by workspace routing before reaching the model.

Solution Built a 14-test, 6-dimension benchmark harness that tests every model through the live Discord pipeline with quality validation: reasoning, factual accuracy, code generation, instruction following, tool calling, and math. Tested 14 models (9 CPU GGUF + 3 NPU RKLLM + 2 large MoE) with BENCHMARK_MODE to isolate pure model performance.

rk3588radxarock-5b-plusllama.cpp

Mar 30, 2026 • 6 min read

llamacpp-workbench: Remote llama.cpp Control and REAP Model Serving on RK3588

Publishing a practical local-AI control plane for llama.cpp: remote model loading, runtime tuning, streaming chat, and real REAP model serving on a Radxa ROCK 5B+.

Local AI

Issue Most local model UIs either abstract away the runtime details that actually matter on constrained hardware or assume desktop-class GPUs. On RK3588, that makes it harder to tune context, KV cache quantization, reasoning behavior, and model selection credibly.

Solution Built and published `llamacpp-workbench`, a remote llama.cpp workbench with explicit runtime controls, model presets, markdown chat rendering, streaming responses, and benchmark-backed defaults for REAP and dense GGUF models.

llama.cpprk3588radxarock-5b-plus

Mar 28, 2026 • 15 min read

Qwen3.5 on RK3588 with llama.cpp: Real Benchmarks from a Radxa ROCK 5B+

An advanced benchmark report for running Qwen3.5 locally on RK3588 with source-built llama.cpp: prefill speed, decode speed, stable context, tool-calling behavior, and the practical model choices that actually work on a Radxa ROCK 5B+.

Local AI

Issue The usual local-AI advice overemphasizes parameter count and underexplains bandwidth, context budget, KV cache policy, and interactive latency. On RK3588, that leads to bad defaults: models that technically load but feel broken in real chat and tool-calling workloads.

Solution I ran a corrected Qwen3.5 sweep on RK3588 using source-built llama.cpp, quantized KV cache, and task-pass validation. Then I compared prefill, decode, stable context, average latency, and tool-calling behavior to determine the right model for each workload.

rk3588radxarock-5b-plusllama.cpp