Stefan Wagen

Projects

Things I’ve built

Personal agentic-AI build cluster from April 2026. Four systems to shipped-or-near-shipped state in roughly two months: a multi-user learning platform serving a live cohort, a tournament-management app, a stringer/client app, and the multi-tenant platform the others now run on. The pace is the signal — sustained build under a roster, not a single end-state snapshot.

One operating mode runs across the cluster: product owner, business representative, domain-expert requirements source, content reviewer. The rosters build the systems; the editorial and architectural calls are mine.

Four learnings across the cluster.

  1. The pushback is where the design happens. Business-analyst and system-architect agents challenge requirements they think don’t hold. Skip the loop and you build the wrong thing.
  2. The discipline scales beyond engineering. Content authoring under a roster (curriculum + pedagogy + platform agents) is its own production system — same hiring, scope-boundary, and review-loop disciplines as a human team.
  3. Manual testing was the gap — and I closed it. Two mechanisms: simulated-user agents that walk a system as a novice (one ran across all 126 days of the learning curriculum), and a real test framework plus an agent-authored verification harness on the stringer/client app — agents derive the workflows-under-test from the spec, not me. See the Racket Book entry.
  4. Look at the cluster holistically. Cost, substrate, deploys, and tenant onboarding sit in the same operating envelope as application code. The third project broke the cost line; the fourth absorbed all the others.

The cards below are the artefacts.

The cluster

Agentic-AI Learning Platform

Production multi-agent learning platform. Eighteen-week curriculum from LLM fundamentals through MCP authoring, content authored under a curriculum / pedagogy / platform agent roster, content gaps closed by a simulated-learner agent run across all 126 days. Started one-user; now serves a live cohort, ten-plus users onboarded.

Production-grade learning platform built end-to-end under an agent roster. I sat as product owner, business representative, and content reviewer; the roster authored the curriculum, the platform, and the companion agent.

Content authoring as a production system.

  • Eighteen-week structured curriculum written from scratch by a roster of role-specialised agents — curriculum-design, pedagogy, platform-engineering — each with a written profile and a scoped mandate.
  • Same hire-through-one-channel, propose-with-reasoning, scope-boundary disciplines as a human team.
  • Output: 126 daily modules, mini-projects at the week-3 and week-12 inflections, free-tier-only constraint enforced at design time, DE-primary with an EN mirror.

Closing the manual-testing gap with a simulated-learner agent. Standard “review everything” prompts missed real user friction — content/video mismatches, incomplete EN translations, unintroduced prerequisites. The fix: a simulated-learner agent with a fixed persona (Swiss HR director, mid-forties, no programming background, ChatGPT at “rewrite this email” level).

  • Walks the curriculum one day at a time as a novice would; flags any term used before it’s introduced; routes each friction point to the right specialist (content → curriculum agent, pacing/tone → pedagogy, broken links → platform engineer).
  • Each pass is a fresh instance carrying only the cumulative knowledge profile from prior days — no memory of last critique, so unfixed friction is caught again. Loop runs until a pass comes back empty, then the day locks.
  • Encoded as a runnable orchestrator skill (dispatch → verdict → route → revise → fresh pass → validate → commit → next day). Ran across all 126 days.

One half of how I closed the manual-testing gap across the cluster; the test framework and agent-runnable harness on Racket Book are the other.

Platform.

  • Next.js 15 App Router, server-side auth, per-user progress tracking, gamification (XP, badges, streaks), offline-capable service worker, admin progress view, multi-device sync.
  • Dedicated Postgres with row-level security, raw-SQL migration ladder via a separate migrator container.
  • In-site companion agent with its own API routes and a written motivation playbook.
  • Originally on its own VPS + managed database; later migrated onto the application platform — see Keystone — to bring the cost line down.

Scope evolution. Designed and shipped for a single primary user; multi-user shape came as the population grew. Today a live cohort, ten-plus users onboarded.

Application Platform — Cost-Driven Re-Architecture

Multi-tenant application platform built when the cluster's cost line broke. Pre-migration: three Supabase Pro projects + two dedicated VPSes + GitLab Premium = CHF 76–86/month. Post-migration: one Ubuntu host with shared Postgres, self-hosted GitLab Runner, Caddy ingress, per-tenant deploy contract = ~CHF 18/month. Saving: CHF 58–68/month. Tenants migrated, management tooling alongside.

The platform under the rest of the build cluster.

Hard call — mitigate at the substrate, not the project. The cost line broke at the third project. Cheap fix: trim each app’s bill in isolation. The call under constraint: collapse three independently-billed stacks onto one shared host with a per-tenant contract, so the next tenant inherits the saving at no extra cost instead of re-solving cost three more times. That’s what made it a platform, not a one-off cost cut.

The cost case. Pre-migration CHF 76–86/month — three Supabase Pro projects (~CHF 40), GitLab Premium (~CHF 26, held for small-team CI), two per-app VPSes (CHF 10–20). Post-migration ~CHF 18/month — one Hetzner host (CHF 14.50) + 1 TB backup (CHF 3.10). Saving CHF 58–68/month. The Premium subscription was refunded — the self-hosted Runner made its CI features redundant.

The discipline. The agent surface lets you look at the cluster holistically — one infrastructure problem with one mitigation shape, not N application bills. Planned the re-architecture with the roster, modelled the steady-state, executed. Substrate decisions, cost modelling, and infra planning sit in the same operating envelope as application code.

Publishing in the open. The repo you read from is a curated public snapshot, not the live operational history. Each release runs through two scoped tools: a security-sweep gating disclosure-readiness (secrets, infra, operational detail), and a publish-snapshot step that squashes the release-tagged state into one commit on the public mirror, strips the deploy CI, and prepends a provenance prelude. The one-commit history is deliberate.

Platform contract per tenant.

  • Each tenant: assigned port, UID, GitLab slug, Caddy snippet wiring <tenant>.wagen.io + <tenant>-test.wagen.io to host loopback; per-tenant Postgres role with RLS, optional realtime channel.
  • Onboarding and deploys run via separate scripted contracts. Tenants honour ${PORT} listen + /healthz returning 200 ok + the assigned UID. The contract is named in every tenant repo so apps stay portable.

Platform internals.

  • Bootstrap, hardening, ingress, shared Postgres, monitoring, backups, self-hosted CI runner, scheduled cron timers — each a numbered idempotent script that re-runs safely against an existing host.
  • CI lint gates on tenant promotion, disk garbage collection, nightly backups to a separate storage box.

Tenants. AI Learning Journey and Courtside Desk migrated from per-app VPSes; Racket Book was the first net-new tenant built against the contract from a clean codebase; this site sits on the platform too. Each migration improved the platform; each improvement made the next tenant cleaner.

Management tool. A small companion tool for tenant onboarding and operational tasks, surfacing the same contracts in a usable shape.

Courtside Desk — Tournament Management App

The first project of the cluster. Tennis-tournament management web app — the build I always wanted, having shipped a basic Python version a year prior at roughly ten percent of the scope. Restarted with an agent roster: I brought the requirements as an experienced tournament organiser and competitive player; the business-analyst and system-architect agents pulled scope and design out of me.

First project of the build cluster. Twelve months earlier I’d shipped a basic Python version — roughly ten percent of the functionality I had in mind. Restarting under an agent roster changed the operating shape.

Domain expertise was the input.

  • Years organising tennis tournaments as a club official; competitive player since 2018. I know the formats, the draw shapes, the day-of-tournament workflow, where existing tools fail.
  • The roster didn’t need to learn tennis — it needed to extract a shippable specification from me.

BA and architect agents pulled the scope out.

  • The loop I now run on every project: describe the goal, let the BA agent challenge the requirements, let the architect challenge the system shape, push back where they don’t land, accept where they do.
  • The friction is the work — a “load-bearing” requirement turns out not to be; an “obvious” journey needs three more screens; an unnoticed constraint drives the model. Skip the loop and you build the wrong thing.

Hard call — the scoring domain took iteration, not one shot.

  • Load-bearing design problem: the scoring model — multiple match formats and the tiebreak logic each carries. Didn’t land on the first pass.
  • A format rule I stated as obvious under-specified the tiebreak; an edge case I’d never had to write down as an organiser forced a rebuild around it. Ended up as its own shared TypeScript package — the part that had to be right before anything above it could be trusted.

Stack chosen by the architect agent, not by me.

  • Asked for an architecture + deployment proposal under constraints (free-tier substrate where possible, modern frontend, type-safe shared domain). Reviewed and approved.
  • Turborepo monorepo, Next.js 15 App Router frontend, Hono + tRPC API in the same Next.js process, three shared TypeScript packages (scoring domain, DB schema, tRPC router definitions).

Status. Substantial real work, deployed on the application platform (see Keystone — one of two tenants migrated from a dedicated VPS in the cost re-architecture). Still in active development.

Racket Book — Stringer/Client app

Server-rendered stringer/client app — clients, rackets, stringing jobs, payments, history. First net-new tenant onboarded onto Keystone. Roughly eighty percent built, in active development, not yet in production. Carries the test framework and agent-runnable verification harness that closed the cluster's manual-testing gap.

Server-rendered FastAPI app for the stringer’s working loop: clients, rackets, stringing jobs, payments, history. Both sides of the stringer/client relationship sit on one record — operational view and client-facing slice.

Status. ~80% built, in active development, not yet in production.

Where it fits. Greenfield when the application platform came up — see the Keystone entry — so it was the natural first net-new tenant: no migration baggage, deploy contract (port, UID, healthcheck, DB role, CI) built against a clean codebase from the first commit.

Testing — where the manual-testing gap got closed.

  • Real test framework: pytest with pytest-asyncio, ~1,138 test functions across 94 files. Real-Postgres fixtures (no mocks), per-test transactional isolation, tenancy-regression enumeration.
  • Agent-authored walkthrough harness (.claude/skills/walkthrough): the agents derive the workflows-under-test from the requirements and spec themselves — I authored none of them. Mode A in-process ASGI against the working tree, Mode B Playwright browser against the deployed env. Asserts structural invariants — locale binding, auth header on every authed page, HTMX-fragment correctness, no dead links.
  • Playwright E2E suite, CI-gated: the e2e job gates prod deploys.

Personal Operations Bridge — Telegram to Agent Roster

Different flavour, same family. Production personal-ops system that predates much of the cluster: a Python bridge dispatches Telegram messages to a roster of role-specialised Claude Code agents, each with persistent state and scheduled jobs. Voice notes, image generation, calendar sync, time-zone handling — all running as a single systemd-managed service on a home Ubuntu host.

Where the other entries are web apps dispatched through a code-authoring roster, this one runs the same discipline through a chat interface: the front door is Telegram, an operating-model roster is the engine, persistent per-role state is the substrate.

Architecture.

  • Python systemd unit on a home Ubuntu host. One Telegram bot per role; one Claude Code subprocess spawned per incoming message with the matching role.
  • File-system contract per role: inbox/, outbox/, attachments/, state/<domain>/ for long-lived context, monthly-bucketed archive/.
  • Role profiles live as source-of-truth markdown, compiled into the agent-runtime form by a pre-commit hook so source and runtime never drift.

Operating model.

  • One role orchestrates requests, runs scheduled digests, owns the backlog. A separate role designs new roles — same hire-through-one-channel discipline as the rest of the cluster.
  • Specialist roles for systems work, scheduling, and other domains, each scope-bounded with its own tone profile. No cross-role writes to internals.

Capabilities.

  • Scheduled jobs: morning digest, weekly review, flight-watch with pluggable providers, calendar sync, post-travel time-zone resync.
  • Voice replies via a TTS layer with provider abstraction and per-role voice profile; image generation across multiple providers.
  • Async-dispatch tracing, idle-mode handling, latency reporting — each a separate instrumented module.

Predates most of the others; same operating-model discipline through a chat interface rather than a web app. The architecture is the artefact; the personal-domain content stays private. No repository link for that reason — the body carries the scope.

This Site — The Loop Closer

The CV and portfolio surface you are reading. Authored end-to-end by an agent roster — orchestrator, HR role, career editor, frontend engineer, portfolio curator — each with a written profile and a scoped mandate. Astro 5 static build, per-locale content collections with the missing-translation-hides invariant enforced at the loader. The last project so far to build on the application platform.

The site that closes the cluster loop — built on the application platform once it was ready, the surface that makes the rest of the cluster legible to a reader.

Authored end-to-end by an agent roster.

  • Career editor wrote the CV bullets; frontend engineer shaped layout and rendering rules; portfolio curator wrote these entries; the orchestrator dispatches and synthesises; the HR role is the only role permitted to design new roles.
  • Each agent has a written profile in team/ — role, reporting line, scope boundaries, operating instructions, anti-patterns, escalation. The profiles read like employment contracts. I review and approve; the rosters build.

Architecture.

  • Astro 5 static site, Tailwind v4 with CSS-variable theme tokens, MDX content collections under src/content/<section>/<locale>/.
  • Per-locale collections are separate Astro collections, not one collection with a locale discriminator — the missing-translation-hides rule is enforced by the loader itself, not by per-page filters. That closes off the class of bug where one forgotten filter leaks EN content into the DE site.
  • Static-first with a small server-rendered surface (contact endpoint + healthcheck), on the application platform’s tenant contract (port, UID, healthcheck).

Hiring Calibration — Cold Reader vs. Inside Editor

Completed three-round multi-agent calibration exercise on this very CV. Eight role profiles drafted by the career editor; a separate hiring-side agent reads only the public site plus the profiles, scores the candidate cold, and writes the verdict; the career editor then reviews the cold read as a calibration check. Each round closed the gap the previous round named and surfaced the next — existence, then duration, then inspectable depth. Closes with a clean terminal finding: the remaining Stretch on the Senior-IC agentic roles is the intended trade-off of an EM-calibrated CV, not a gap to close.

A completed three-round exercise against the agent roster — round 1 on 2026-05-16, rounds 2 and 3 on 2026-05-29. The question: how does this CV read to a hiring manager who’s never met me, and how does that cold read compare against the inside editor’s grounded view? The re-runs made it an instrument — each round I changed one thing the cold reader asked for, held everything else constant, and re-ran to see whether the verdict moved.

Setup.

  • Roster: orchestrator, career editor, HR/role-scoping, frontend engineer, portfolio curator, and — added here — a hiring-side CV-screening role. HR drafted its profile; the orchestrator dispatched. Each re-run used a fresh hiring-side instance, no memory of the round before.
  • Cold reader ran in strict blind / external review mode. Input lock: the public site, the public repos it links to (/projects repo → links), the eight role profiles. Forbidden: every other internal artefact (sister profiles, source notes, grounding dossier, personal-context stores, private repos). The deliverable listed artefacts read, so mode-bleed would show. It scraped the site, wrote a candidate brief, scored each role in bands (strong / plausible / stretch / no fit) with rationale, closed on a verdict.
  • Eight role profiles drafted by the editor to triangulate the next-rung landscape — Senior-IC AI-shaped, EM, Director-leaning, across financial services and big-tech, mostly Zürich-anchored, one Singapore-shaped, one EU-remote. Archetypes only. Deliberately wider than the CV’s calibrated audience. Held constant across all three rounds — that’s the control.

Round 1. Two strong fits (the wealth-management EM shapes the CV is calibrated for); three stretches on the Senior-IC agentic roles, all turning on one thing — no externally verifiable AI-build artefact on the site; two stretches at Director-leaning scope (the manager-of-managers tour the CV doesn’t show); one no-fit (a tech-native platform-org Director role where the substrate language doesn’t transfer). Highest-leverage edit named: a public link from the AI-build cluster to the shipped code.

Between rounds. Four build-cluster repos published as curated public snapshots; /projects cards gained repo → links. The exact edit round 1 named.

Round 2. The fresh reader followed the links and read the repos. On round 1’s question — does a verifiable artefact exist? — the answer flipped: the cluster read as real, shipped, openly-licensed work. Yet the Senior-IC roles held at Stretch. The reader walked past existence to how long has the work existed? — the mirrors are single squash-snapshot commits (publication hygiene, not history), so the build reads months old, not years, and the Senior-IC bar turns on sustained time-on-task. EM shapes held strongest; Director roles honest stretches; tech-native role weakest transfer. Same gap, one layer deeper. Highest-leverage edit: signal time-on-task and density (window duration, work density), not development history. Landed as a sourced density edit on the build-cluster copy.

Round 3. A fresh reader re-ran the eight after the density edit. It worked — this reader accepted the duration as fact and stopped contesting it. Again the Senior-IC roles held at Stretch; the gap moved from how long? to is the depth inspectable? The reader now wanted rigour shown in code, not asserted in prose — eval harnesses, observability, cost telemetry, legible in the repos. The rest of the gradient held, the two EM shapes strengthening into the strongest fits across all three rounds.

Terminal finding. The arc is one gap walking down through layers: existence → duration → inspectable depth. At the third layer the question is no longer answerable by an edit, because it’s no longer about the surface. The build is genuinely only a few months old — since April 2026 — and personal-scale; a “2+ years agentic / ≥50% code” bar doesn’t clear on months of time-on-task, no matter how inspectable. By round 3 the surface reads the candidate correctly: the remaining Stretch is not a gap in the artefact but the intended trade-off of an EM-calibrated CV — tuned to land the wealth-management EM shapes, letting the reader infer IC-capability without claiming a Senior-IC agentic bar it was never calibrated to clear. That’s where a calibration instrument stops. The exercise closes here.

The inside editor. After each cold read, the career editor reviewed it as a calibration check — public surface against her grounded read of actual-me, after the cold read, not alongside. Round 1 verdict: the cold read tracked the inside read closely, including on the gaps — the surface under-represented three things (working-business language proof, the density of the current concurrent-role load, external verifiability of the build cluster) and the reader caught all three. A gap that walks down two clean layers under three controlled edits and then bottoms out in design intent is her strongest evidence the instrument has real resolution.

The mode discipline is the load-bearing seam. Both views are of the same person; what makes it useful is that one is input-locked and the other isn’t. The cold reader sees the surface as a surface; the editor sees through it to the candidate she’s been writing for. The re-runs find the next layer of the same gap and tell you when it’s no longer a fix. Her one-line summary: “the cold reader does the thing I can’t — she sees the surface as a surface.”

Cross-model replication. A self-contained prompt is published for any chat model with web access (Gemini, ChatGPT, Claude.ai) — same inputs, same rules, same output shape. The input lock admits the public repos the site links to, matching round 2; every other source stays forbidden. The inside-editor step is deliberately omitted — reproducing it outside the local roster would defeat the comparison. Prompt at /hiring-calibration-prompt.md; copy it into a fresh browsing chat and run the same blind read.

What it is and isn’t. A calibration instrument for one candidate’s public material against a defined audience — not a generic CV-grader, not a product, not advice. The discipline (input-locked cold reader against a separate grounded reviewer, re-run as a control after a named edit) transfers; the specific roster does not.

Dashboard — The Read Surface for Life-Ops

Invite-only multi-user life-ops dashboard for family use. Next.js 16 App Router on the application platform, per-tenant gotrue for password + magic-link auth, Postgres RLS as the isolation mechanism. The read counterpart to the personal-ops bridge: the home-NUC daemon POSTs curated streams over an authenticated ingest API; the dashboard presents them as cards — calendar + weather, agent-curated feed, watchlist, todos, health snapshot.

The read surface paired with the personal-ops bridge. The bridge takes intakes through Telegram and pushes work into a roster on the home host; the dashboard is the inverse — a web app the bridge writes into through an authenticated ingest API, where curated streams land as collapsible cards. Next tenant onboarded onto the application platform after this site.

Status. In active development. Single-user today, not yet opened to invitees. Substantial roadmap ahead.

Architecture invariants.

  • Next.js 16 App Router. @supabase/ssr used only for the auth flow; data access bypasses the supabase client and goes through pg with a per-tenant Postgres role that does not bypass RLS. auth.uid() resolves via a request.jwt.claim.sub setting applied per transaction by a withUser() helper.
  • Postgres RLS is the isolation mechanism: every user-owned table carries user_id with an auth.uid() = user_id policy, no service-role bypass. Generalises the single-then-multi-user pattern from the Learning Platform to N family-sized users.
  • Invite-only by design: no sign-up UI, signup disabled at the auth-substrate layer too; users provisioned via the gotrue admin API.
  • Ingest auth kept simpler than a JWT exchange: the daemon sends a 32-byte random hex bearer token; the server compares its SHA-256 against the stored hash. Tokens rotatable/revocable from Settings, optional per-user IP allow-list in middleware, every ingest call audited to api_audit_log.

Sections in v1. Today (calendar + Open-Meteo weather), Breaking (unread items flagged by the curator-agent classifier), Feed (topic-filterable, pinned-first, three sort modes), Watchlist (unwatched podcasts + YouTube), Source Approvals (feed-source proposal queue), Todo (Google Tasks round-trip), Health (Garmin daily aggregates — sleep, weight trend, training-load TSB). Per-user collapsed state persisted in user_settings.

Deployment. Tenant on the application platform (see Keystone) — full-substrate (DB + per-tenant gotrue), assigned port and UID. CI runs two web builds per push (prod/test differ on the gotrue host, inlined into NEXT_PUBLIC_* at build time); one migrator image, idempotent SQL ladder on both environments, single manual gate at prod-migrate.

Where it sits. First multi-user app on the platform, generalising the RLS pattern from the Learning Platform’s single-user shape to N family-grade invitees. Roster scoped and built; I sat as product owner.