About

I spent 15 years as a cognitive behavioural therapist before I switched to software engineering. The through-line is the same problem: people (and now machines) produce confident, coherent output that is sometimes completely wrong, and the interesting work is building systems that catch it.

I write TypeScript, Python, and Go. I’ve shipped production code at EDITED (retail analytics), Brandwatch (social intelligence), Telesoft (network security), and School Business Services. 5 years in engineering, 15 in psychology.

What I’m building

Sortie — async adversarial multi-model code review. Runs Claude, Codex, and Gemini against your diff in parallel, synthesises findings through a 4th-model debrief, and gates merges on convergent severity. Built after surveying 40+ tools and 30+ papers. 106 tests, Python.

Pidgeon — carrier-agnostic shipping rate integration built twice from the same spec. Once with outside-in TDD and 48 cross-model adversarial reviews (18 hours). Once with a parallel agent swarm (1 hour). Case study →

Tells — a field taxonomy of LLM output patterns caught in the wild. Epistemic theatre, paper guardrails, analytical lullabies. 60+ entries with detection heuristics and programmatic detectors.

The Pit — the largest project. A multi-agent evaluation platform: structured contests between agent configurations with observable traces, rubric-based scoring, failure taxonomy, and cost visibility. 1,300+ commits, 1,503 tests, ~28k LOC, 329 architectural decisions. Currently building the run/evaluation engine. The Gauntlet, Darkcat, Tells, and Sortie all originated as process infrastructure for this project. Project page →

oceanheart.ai — this site. Learning in public about what happens when you give LLMs real responsibilities and then hold them to account.

Contact