bootcamp / II: Agentic Practices / step 10 of 11

Governance, Process, and Enterprise Integration

Step 10 of 11 in Bootcamp II: Agentic Engineering Practices.


Why This Step Exists

Enterprise engineering does not happen in isolation. There are existing workflows, existing review cultures, existing compliance requirements. There are teams of people who have spent years building processes that work. When agentic engineering arrives, it must integrate with those processes or it will be rejected - not because the technology is wrong, but because it violates expectations that exist for good reasons.

Step 6 established that the quality gate is survival. Step 7 established that the human at L12 is the irreducible verification layer. This step addresses the question that follows from both: how do you structure the work so that agents, humans, gates, and existing enterprise processes compose into a system that ships code with confidence?

The failure mode this step equips you to prevent is not a technical failure. It is an organisational one. A team adopts agentic tools, generates code faster, and ships bugs faster. Pull requests grow to 67 files touching 6 unrelated concerns. Reviews degrade into rubber stamps because the reviewer cannot hold the entire diff in working memory. The CI/CD pipeline passes because the tests were generated by the same system that generated the code. The team is moving at machine speed with human-speed verification, and nobody notices the drift until something breaks in production.

The controls in this step are not new inventions. Most of them are established engineering practice: atomic commits, code review, CI/CD gates, Lean value streams, Kanban pull, architecture decision records. What is new is their application to a workflow where a non-human participant generates code at a rate and volume that existing processes were not designed to handle. The engineering loop, the bearing check, the HOTL/HODL spectrum, the muster format, and the named anti-patterns (stowaway commit, review hydra) are operational instruments developed during 200+ sessions of daily agentic engineering practice. They address failure modes specific to human-agent collaboration.

FIELD MATURITY: EMERGING. The component technologies are established: CI/CD quality gates (standard DevOps), atomic commits (standard git practice), PR-based review (GitHub/GitLab standard), CRM readback protocol (Helmreich 1999, 40+ years empirical validation), Kanban pull (Lean, Womack & Jones 1996), architecture decision records (Nygard 2011). Application to agentic engineering is emerging. Novel from this project: the bearing check as a repeatable governance unit with defined checks and cost budget, the HOTL/HODL spectrum as an explicit dial tied to the verifiable/taste-required distinction, stowaway commit and review hydra as named anti-patterns, the muster format for O(1) decision throughput, and the macro workflow as a complete cadence from bearing check through merge. These are engineering instruments, not research findings. If you recognise the failure modes in your own work, the instruments apply. If you do not, they are a vocabulary waiting for experience to activate.

The goal: build a governance discipline that lets teams ship agent-assisted work at machine speed without sacrificing the verification, auditability, and review culture that enterprise engineering requires.


Table of Contents

  1. The Engineering Loop (~30 min)
  2. Atomic Changes (~35 min)
  3. Review Protocols (~40 min)
  4. The HOTL/HODL Spectrum (~40 min)
  5. Integration with Existing CI/CD (~30 min)
  6. Team Adoption Patterns (~35 min)
  7. Change Management: Readback and Decision Tables (~40 min)
  8. Bearing Checks (~30 min)
  9. Pull-Based Review (~25 min)
  10. Challenges (~60-90 min)
  11. Key Takeaways
  12. Recommended Reading
  13. What to Read Next

1. The Engineering Loop

Estimated time: 30 minutes

Every change to a codebase - whether made by a human, an agent, or a human-agent pair - follows the same five-step cycle:

Read -> Verify -> Write -> Execute -> Confirm

This is not a suggestion. It is the governing discipline. Each step has a specific purpose, and skipping a step has specific consequences.

Read

Understand existing code and patterns before changing anything. This applies equally to humans opening a codebase for the first time and agents loading context for a task. An agent that writes code without reading the surrounding implementation is generating from its training distribution, not from the project’s actual patterns. A human who skips reading the existing tests before adding a feature will write tests that duplicate, contradict, or miss the existing test strategy.

For agents, “Read” is the context engineering problem from Step 4. The working set - the minimum context for the current job - must be loaded before the agent writes a single line. If it is not loaded, the agent is in the dumb zone: syntactically valid output, semantically disconnected from the project.

Verify

Confirm assumptions with idempotent commands before acting on them. Do not infer what you can verify. This is the single most important sentence in the loop.

# Instead of assuming a tool is installed:
node --version

# Instead of assuming a directory exists:
ls -la src/lib/auth/

# Instead of assuming a service is running:
curl -sf http://localhost:3000/health || printf 'Service not running\n'

# Instead of assuming the branch is up to date:
git fetch origin && git log --oneline origin/main..HEAD

When an agent skips Verify, it generates based on what it assumes to be true. When the assumption is wrong, the generated code is structurally coherent but factually incorrect - it references a function signature that changed last week, writes to a database table that was renamed, or uses an API endpoint that no longer exists. These defects are invisible to the gate because the gate tests what the code does, not whether the code’s assumptions about its environment are correct.

Write

Implement changes following existing conventions. For agents, this means the generated code should look like it belongs in the codebase - same patterns, same error handling style, same naming conventions, same test structure. The context quality loop from Step 4 applies here: clean code produces better context for future agent runs, which produces cleaner code. Slop introduced at this step compounds.

One critical constraint: commits are atomic with conventional messages. Each commit captures one logical change. This is Section 2 of this step.

Execute

Run the gate. Always. Not “run the tests that seem relevant.” Run the gate.

# The gate in this project:
pnpm run typecheck && pnpm run lint && pnpm run test

# Your project's gate might be:
make gate
# or:
npm run check
# or:
./bin/gate

If the gate fails, the change is not ready. Do not negotiate with the gate. Do not skip a failing test because “it is unrelated.” Do not merge with a plan to fix the test later. The gate is survival; everything else is optimisation.

Confirm

Verify the output matches intent. This is the step that separates mechanical execution from engineering. The gate confirms the code is technically correct. Confirm checks whether the code does what was actually intended.

For verifiable tasks, the gate is sufficient confirmation. For taste-required tasks, a human must review the output. This distinction - verifiable vs taste-required - is the governance dial introduced in Step 6 and operationalised in Section 4 of this step.

AGENTIC GROUNDING: The engineering loop applies to agents with a specific asymmetry: agents are good at Write and Execute. They can generate code and run tests at machine speed. They are weak at Read (they depend on whatever context was loaded), weak at Verify (they tend to assume rather than check), and incapable of Confirm for taste-required work. The Operator’s job is to compensate for the weak steps. If you find yourself only reviewing finished output, you have delegated Read, Verify, Write, Execute, and Confirm to the agent - and you have no independent verification of four of those five steps.


2. Atomic Changes

Estimated time: 35 minutes

One PR = one concern. This is standard git practice with decades of history. It exists because atomic changes are individually revertible, individually reviewable, and individually understandable. A commit that touches 5 files in one domain for one reason can be reviewed in minutes. A commit that touches 67 files across 6 domains for 6 reasons cannot be reviewed at all - it can only be approved.

Why Atomic Matters More for Agents

Experienced human developers self-limit scope naturally. After years of practice, a senior developer working on a login feature does not also refactor the payment module because it happens to be open in the editor. The developer has learned, through years of painful reviews and reverted merges, that scope creep kills velocity.

Agents have no such learned constraint. An agent given the instruction “fix the login error handling” may also:

  • Rename variables it considers poorly named
  • Add JSDoc comments to functions that lacked them
  • Update import ordering to match its preferred style
  • Fix a typo it noticed in an unrelated error message
  • Refactor a utility function it called into during the fix

Each of these changes may be individually correct. Together, they produce a diff that mixes the intentional fix with five unrelated improvements, making it impossible to review the actual fix in isolation and impossible to revert the fix without also reverting the improvements.

The Stowaway Commit

The stowaway commit is the named anti-pattern for this failure mode: multiple unrelated concerns smuggled into a single commit.

Detection heuristic: three or more comma-separated concerns in the commit message, or 40+ files across unrelated directories.

# Stowaway commit - what you see in git log:
commit a1b2c3d
Author: agent-pipeline
Date:   Mon Mar 10 14:32:17 2026 +0000

    fix: login error handling, update docs, refactor utils, add missing
    types, clean up imports, fix typo in payment error

 67 files changed, 1247 insertions(+), 892 deletions(-)

This commit is unreviewable. It is unrevertible as a unit. It violates every principle of atomic change. And it is the default output of an unconstrained agent given a broad task.

The fix is structural, not instructional. Telling the agent “make atomic commits” is a paper guardrail - a rule stated without an enforcement mechanism. The effective fix is to decompose the task before dispatch:

# Task decomposition before dispatch:
tasks:
  - id: login-error-handling
    scope: lib/auth/login.ts, lib/auth/login.test.ts
    constraint: "Touch ONLY these files. No refactoring, no style changes."
  - id: docs-update
    scope: docs/auth/
    constraint: "Update documentation to reflect new error handling."
  - id: utils-refactor
    scope: lib/common/utils.ts, lib/common/utils.test.ts
    constraint: "Separate PR. Do not bundle with other changes."

Each task produces one commit. Each commit is reviewable. Each is revertible. The decomposition is the discipline. The agent is the executor.

Decomposition as a Skill

Breaking a feature into atomic PRs is a skill that improves with practice. The heuristic:

  1. One domain. If the change touches files in lib/auth/ and lib/payments/, it is at least two PRs.
  2. One type of change. Bug fix, feature addition, refactor, documentation update, and dependency upgrade are each their own PR. Mixing types makes review harder and reverts riskier.
  3. One reviewer can hold it. If the diff is too large for a single reviewer to hold in working memory (typically 400 lines of meaningful change), it is too large for one PR.
  4. One sentence describes it. If the PR description requires multiple paragraphs to explain what was done, the PR is doing too many things.

HISTORY: The atomic commit convention predates git. CVS and Subversion communities documented “one logical change per commit” in the early 2000s. Git made it practical by making branching cheap and rebasing non-destructive. The innovation was not the principle but the tooling that made the principle costless to follow. Agents reverse this: the tooling makes it costless to violate the principle, which means the principle must be enforced externally.

AGENTIC GROUNDING: Before dispatching an agent to implement a feature, decompose the feature into atomic tasks. Each task should name the files it may touch, the type of change (fix, feature, refactor, docs), and a constraint on scope. The agent does not need to agree with the decomposition. The agent needs to execute within it. If the agent produces a PR that violates the scope constraint, the correct response is not to merge it. It is to reject the PR, tighten the constraint, and re-dispatch. The cost of a rejected PR is minutes. The cost of a merged stowaway commit is hours of archaeological review when something breaks.


3. Review Protocols

Estimated time: 40 minutes

Reviewer != author. This is the foundational review principle and it applies to agent-generated code with additional force. When a human writes code and another human reviews it, the reviewer brings different experience, different assumptions, and different blind spots. When an agent writes code and a human reviews it, the asymmetry is even greater - the agent has no understanding of intent, only of pattern, and the human must verify not just correctness but whether the correct-looking output actually serves the intended purpose.

When the same agent that wrote the code also writes the review summary - “I have verified that all tests pass and the implementation matches the specification” - you have zero additional information. The agent’s assessment of its own output has the same statistical properties as the output itself. Self-review is the degenerate case (N=1 ensemble), not an edge case.

The Review Checklist

Every review of agent-generated code should address these questions, in order:

  1. Does it do what it says? Read the PR description. Read the diff. Do they match? An agent may describe its changes inaccurately - not because it is lying, but because the description was generated before or after the code, not from the code.

  2. Does it do anything it does not say? This is where stowaway changes hide. Check git diff --stat for files you do not expect. Check git log --oneline for the commit message matching the actual diff.

# Quick stowaway detection:
git diff --stat main..feature-branch

# If the diff touches directories outside the expected scope:
git diff --stat main..feature-branch | grep -v 'lib/auth/'
# Any output here is unexpected scope
  1. Edge cases? Agent-generated code often handles the happy path and the most common error path. It frequently misses: null/undefined inputs, empty collections, concurrent access, network timeouts, partial failures, and boundary values. Check the code paths that the tests do not cover.

  2. Follows patterns? Does the new code match the existing patterns in the codebase? If every other service uses a specific error handling pattern and the agent invented a new one, that is a defect regardless of whether the new pattern works.

  3. Error handling? Does the code handle errors in a way that produces actionable information? Or does it catch (error) { throw error } - which is syntactically valid and semantically useless?

  4. Architecture and intent? This is the taste-required question. The code may be correct at the function level but wrong at the architecture level. An agent asked to “add caching” may add a cache that works but creates a consistency problem with the database. The reviewer must evaluate architectural fit, not just functional correctness.

The Review Hydra

The review hydra is the anti-pattern that follows from poor review handling. A reviewer identifies 25 issues across 28 files. The agent (or developer) addresses all 25 issues in a single “address review feedback” commit.

Detection heuristic: commit message containing “address” + “review,” combined with 10+ files touching unrelated concerns.

# Review hydra - what you see in git log:
commit d4e5f6a
Author: agent-pipeline
Date:   Mon Mar 10 16:45:22 2026 +0000

    fix: address review feedback

 28 files changed, 534 insertions(+), 312 deletions(-)

This is the stowaway commit in review clothing. The original 25 issues were each a separate concern. Addressing them all in one commit means:

  • You cannot verify that each issue was addressed individually
  • You cannot revert one fix without reverting all of them
  • The reviewer must now re-review 28 files to verify 25 separate fixes
  • If one of the 25 “fixes” introduced a new bug, bisecting will point to this commit, which touches 28 files - you are back to manual archaeology

The fix: triage review findings explicitly. For each finding:

ResponseAction
Will fixSeparate commit per fix, or group by domain
DisagreeComment explaining why, with evidence
LaterCreate a backlog item with a reference

Each fix gets its own commit (or at minimum, each domain’s fixes are grouped). The review hydra is replaced by a series of individually verifiable commits.

Findings Before Merge, Not After

Review findings are resolved before merge. Not in follow-up PRs. Not in “tech debt” tickets that never get prioritised. Before merge.

The rationale is simple: a follow-up PR has lower priority than the next feature. It will sit in the backlog until someone notices the bug it was supposed to prevent. The merge is the enforcement point. If findings are not resolved before merge, they are not resolved.

This rule feels expensive in the moment. It saves time over any horizon longer than a week.

AGENTIC GROUNDING: When an agent generates a PR and you leave review comments, the worst thing you can do is ask the agent to “address all review feedback.” You will get a review hydra. Instead, give the agent one finding at a time, or group findings by domain, and require separate commits. The agent will comply - agents are good at following explicit structural constraints. The constraint must come from the reviewer, not from the agent’s initiative.


4. The HOTL/HODL Spectrum

Estimated time: 40 minutes

Step 6 introduced the verifiable/taste-required distinction as the load-bearing governance decision for probabilistic systems. This section operationalises that distinction as a practical dial.

HOTL - Human Out The Loop. Machine speed. Plan, execute, review. The human does not steer mid-execution. The plan is the input, the finished output is what gets reviewed.

HODL - Human grips the wheel. Every step requires human approval. Execution happens at human tempo. The human sees intermediate output and redirects before the agent proceeds.

These are not two options. They are endpoints of a spectrum, and every task sits somewhere on that spectrum. The primary governance decision for each task type is: where on the spectrum does this task belong?

The Decision Rule

HOTL when the gate can verify. HODL when it requires taste.

This is not a heuristic. It is the operational rule. If the task’s correctness can be fully determined by automated checks - type checker, linter, test suite, schema validation - then the human does not need to be in the loop during execution. The human reviews the finished output against the gate results. This is HOTL.

If the task’s correctness requires human judgment - architectural decisions, API design, user-facing copy, security-sensitive logic, anything where “technically correct” is insufficient - then the human must be in the loop at decision points. This is HODL.

The Spectrum in Practice

TaskGate Can Verify?Spectrum PositionMode
Fix a failing unit testYes - test passes or it does notFull HOTLPlan, dispatch, review result
Implement a new API endpoint to specPartially - types and tests verify structure, not intentHOTL with reviewDispatch, review diff before merge
Refactor a module for readabilityNo - readability is tasteHODLStep-by-step with approval
Write user-facing error messagesNo - tone and clarity are tasteHODLDraft, human edits, final review
Add logging to existing functionsYes - presence of log statements is verifiableFull HOTLPlan, dispatch, spot-check
Design a new database schemaNo - schema design involves tradeoffsHODLHuman designs, agent implements
Update dependencies and run testsYes - tests pass or they do notFull HOTLAutomated pipeline
Write a security-sensitive auth flowPartially - tests verify behavior, not threat modelHODL with gate assistHuman reviews security model, agent implements, human reviews impl
Generate migration scripts from schema diffYes - migration runs or it does notHOTL with verificationDispatch, review SQL, apply
Rename a variable across the codebaseYes - typecheck verifies consistencyFull HOTLPlan, dispatch, gate confirms

The middle of the spectrum - “partially verifiable” - is where most real work lives. The gate verifies the mechanical properties (types, tests, lint). The human verifies intent, architecture, and edge cases the tests do not cover. The practical implementation: run in HOTL mode (let the agent execute), then switch to HODL mode for the review (human inspects every file in the diff, not just the summary).

Sliding the Dial

The dial position must be an explicit decision, not a drift. Teams that start at HODL (reviewing every agent action) naturally drift toward HOTL (rubber-stamping agent output) as trust builds. This drift is dangerous because it is invisible - nobody decides to stop reviewing, they just review less carefully because “the agent has been reliable so far.”

HISTORY: The HOTL/HODL distinction maps to Toyota’s jidoka principle (Ohno 1988): automation with a human touch. Jidoka means the machine does the work, but the machine stops itself when it detects an abnormality. The human intervenes only when the machine cannot self-correct. In agentic engineering, the gate is the abnormality detector. When the gate passes, the machine did not detect a problem. When the gate fails, the machine stops. The human decides what happens next. HOTL is jidoka. HODL is manual operation. The choice between them is not about capability - it is about what the gate can verify.

AGENTIC GROUNDING: If you find yourself skipping the review of agent-generated PRs because “the CI passed,” you have drifted from HODL to HOTL without making that decision explicitly. This is the Bainbridge irony in action: the more reliable the agent appears, the less you review, and the less capable you become of catching the errors that slip through. The countermeasure is periodic deep engagement - at least once per week, review an agent-generated PR at the level of a senior developer code review, checking every file in the diff. Extended HOTL without periodic deep engagement degrades the expertise that makes HOTL safe.


5. Integration with Existing CI/CD

Estimated time: 30 minutes

Agents are participants in existing workflows, not replacements for them. The integration boundary is the pull request. The trust boundary is the gate. Everything else follows.

The Pull Request as Integration Boundary

A PR is the unit of work that enters the team’s review process. This is true regardless of whether the PR was written by a human, an agent, or a human-agent pair. The PR triggers:

  • Automated CI/CD checks (build, lint, typecheck, test suite)
  • Code review by a human who is not the author
  • Merge into the protected branch after both pass

Agent-generated PRs go through the same process. No shortcuts. No separate “AI fast lane.” The same checks, the same review, the same merge criteria.

This seems obvious, but the pressure to create shortcuts is real. “The agent generated this and all tests pass, can we skip review?” No. “It is just a dependency update, the pipeline is green.” Review it anyway - dependency updates can introduce breaking changes, security vulnerabilities, or subtle behavior modifications that tests do not cover.

The Gate as Trust Boundary

The gate (typecheck + lint + test suite) is the automated verification layer. It answers a specific question: does this code meet the mechanical correctness criteria? Types consistent? Lint rules satisfied? Tests pass?

The gate does not answer: is this the right approach? Does this match the intended architecture? Are the error messages helpful? Is the security model sound?

For agent-generated code, the gate is necessary but insufficient. It catches the defects that automated tools can detect. It does not catch the defects that are specific to probabilistic generation - code that has the shape of correctness without the substance.

# CI/CD pipeline with agent-aware gates:
pipeline:
  stages:
    - name: build
      commands:
        - pnpm install
        - pnpm run build
      # Same for human and agent PRs

    - name: verify
      commands:
        - pnpm run typecheck
        - pnpm run lint
        - pnpm run test
      # Same for human and agent PRs

    - name: agent-metadata
      commands:
        - git log --format='%an' -1  # Identify if agent-authored
        - git diff --stat             # Scope check
      # Additional context for reviewers, not a gate

    - name: review
      type: manual
      required_reviewers: 1
      rule: "reviewer != author"
      # Same for human and agent PRs

Agent-Generated PRs in Practice

A well-structured agent-generated PR looks like any well-structured PR:

  • Clear title following conventional commit format (fix:, feat:, refactor:)
  • Description stating what changed and why
  • Atomic scope - one concern per PR
  • Tests that verify the change
  • Passing CI before review is requested

The differences are in the review approach, not the process. When reviewing agent-generated PRs:

  1. Check scope first. git diff --stat before reading any code. If the scope is larger than expected, reject before reviewing.
  2. Read tests before implementation. Do the tests verify the requirement, or do they verify the implementation? (The “right answer, wrong work” check from Step 6.)
  3. Verify assumptions. The agent may have generated code based on assumptions about the codebase that are no longer true. Check that referenced functions, types, and modules actually exist and have the signatures the agent assumed.
  4. Check for pattern adherence. Does the new code follow the same patterns as the existing code in the same module? Pattern deviation is a signal that the agent generated from its training distribution rather than from the project’s context.

AGENTIC GROUNDING: The most common failure in CI/CD integration is not technical - it is social. Teams create an informal “AI lane” where agent-generated PRs get faster review because “the agent is usually right.” This erodes review quality gradually and invisibly. The defence is structural: agent-generated PRs go through the same process, in the same queue, with the same review depth. If the process is too slow for agent velocity, the problem is the process, not the principle. Fix the process without compromising the review.


6. Team Adoption Patterns

Estimated time: 35 minutes

Teams do not adopt agentic engineering in a single step. Trust builds incrementally, and each level of trust requires evidence from the previous level. Skipping levels produces either fear (the team does not trust the tools) or recklessness (the team trusts the tools too much, too fast).

The Four Levels

Level 1: Code Completion (Autocomplete)

The agent completes individual lines or small blocks. The developer writes the intent and the agent predicts the syntax. The developer accepts, rejects, or modifies each suggestion in real time.

  • Risk: Low. Every suggestion is reviewed before acceptance.
  • Verification: The developer’s immediate judgment.
  • HOTL/HODL position: Full HODL. The human approves every action.
  • Adoption resistance: Minimal. Feels like a better autocomplete.
  • Watch for: Over-acceptance. Developers begin accepting suggestions without reading them because “it is usually right.” This is the seed of cognitive deskilling.

Level 2: Single-File Agents (Write a Function to Spec)

The agent generates a complete function, class, or module to a specification. The developer provides the spec (function signature, behavior description, edge cases) and reviews the complete output.

  • Risk: Medium. The output may be plausible but wrong.
  • Verification: Code review + test suite. The gate must be in place before Level 2.
  • HOTL/HODL position: HOTL for generation, HODL for review.
  • Adoption resistance: Moderate. “I could write this faster myself” is common - and is sometimes true. The value is not speed on individual tasks; it is sustained throughput across many tasks.
  • Watch for: Test quality erosion. If the agent writes both the code and the tests, the tests verify the code, not the requirement. Human-written tests or human-reviewed test specifications are the defence at this level.

Level 3: Multi-File Orchestration (Implement a Feature)

The agent generates changes across multiple files to implement a feature. This requires the agent to understand the codebase’s architecture, follow existing patterns, and coordinate changes across module boundaries.

  • Risk: High. Cross-file changes are harder to review and more likely to violate architectural boundaries.
  • Verification: Code review + gate + architectural review. The reviewer must evaluate not just correctness but whether the agent’s approach matches the intended architecture.
  • HOTL/HODL position: HOTL for generation, HODL for review. The task decomposition from Section 2 (one PR = one concern) is mandatory at this level.
  • Adoption resistance: Significant. Senior developers may feel their architectural judgment is being bypassed. This concern is valid - it should be addressed by making senior developers the reviewers, not by removing them from the loop.
  • Watch for: The stowaway commit. Multi-file agents produce stowaway commits by default. Task decomposition before dispatch is required, not optional.

Level 4: Autonomous Pipeline Tasks (CI/CD Agents)

The agent operates as part of the CI/CD pipeline: dependency updates, migration generation, boilerplate scaffolding, documentation generation, automated refactoring with deterministic rules.

  • Risk: Varies by task. Dependency updates are lower risk than schema migrations.
  • Verification: Gate + human spot-check. The gate is the primary verification layer. Human review is periodic, not per-change.
  • HOTL/HODL position: Full HOTL with periodic HODL spot-checks.
  • Adoption resistance: Low for low-risk tasks (dependency updates). High for anything touching data or security.
  • Watch for: Blast radius. An autonomous agent that runs nightly and produces PRs is fine until it produces a PR that touches 200 files across every module. Size limits and scope constraints are mandatory at this level.

Adoption Realities

Most teams will not progress linearly through these levels. Common patterns:

The enthusiast jump. A developer goes from Level 1 to Level 3 because they had a good experience with code completion. They dispatch a multi-file task without the review infrastructure to support it. The first PR is fine. The second introduces a subtle bug that the test suite does not catch. The team blames the tool and retreats to Level 1.

The sceptic wall. A senior developer refuses to move past Level 1 because “I do not trust it.” This is rational, but it prevents the team from building the evidence that would calibrate their trust. The fix: ask the sceptic to be the reviewer for Level 2 tasks. They get to evaluate agent output at the level of rigor they require, and the team builds evidence about what the agent gets right and wrong.

The speed trap. The team measures adoption success by velocity (features per sprint, PRs merged per week). Velocity goes up. Quality goes down - but quality degradation is invisible in the short term. The defects introduced by agent-generated code do not show up as failing tests (the tests were generated alongside the code). They show up as production incidents weeks later. The fix: track defect rate alongside velocity, with a lag window long enough to catch agent-introduced defects.

The governance gap. Management mandates “use AI tools to improve productivity” without providing the review infrastructure, training, or process changes that safe adoption requires. The team adopts at Level 1-2 in the IDE and pretends to management that this constitutes the productivity improvement. Nobody moves to Level 3-4 because nobody has built the supporting processes.

AGENTIC GROUNDING: The adoption level should be an explicit team decision, not an individual one. If different developers operate at different levels without the team agreeing on review expectations for each level, you get inconsistent verification. One developer’s Level 3 PR gets the same review as another developer’s Level 1 suggestion. The team should agree: “We are at Level 2. Level 3 requires architectural review by a senior developer. Level 4 is not yet approved.” This is a governance decision, not a tooling decision.


7. Change Management: Readback and Decision Tables

Estimated time: 40 minutes

When a human gives an instruction to an agent, two things can go wrong: the instruction can be unclear, or the agent can misinterpret a clear instruction. Both produce the same outcome: the agent does something other than what the human intended. The cost of this misalignment is proportional to how long the agent runs before the error is detected.

Two established patterns address this: the readback protocol from Crew Resource Management, and the decision table (muster format) for efficient binary decisions.

The Readback Protocol

The readback protocol comes from aviation CRM (Crew Resource Management). Its empirical validation spans 40+ years of operational use in commercial aviation, where communication errors between cockpit crew members were identified as a primary cause of preventable accidents.

HISTORY: CRM grew out of specific disasters. Tenerife (1977): two Boeing 747s collided on a runway, killing 583 people. The primary cause was a captain who initiated takeoff without clearance while his co-pilot and flight engineer questioned the decision but did not challenge it effectively. United 173 (1978): a DC-8 ran out of fuel while the captain focused on a landing gear problem, ignoring crew members’ warnings about fuel status. These events led to the development of CRM - a structured communication discipline where readback is mandatory, authority gradients are calibrated, and challenging a superior’s decision is not just permitted but expected. Captain Al Haynes, who successfully crash-landed United 232 in 1989, credited CRM directly: “If I hadn’t used [CRM], if we had not let everybody put their input in, it’s a cinch we wouldn’t have made it.” Helmreich et al. (1999) documented the evolution and empirical validation of CRM across commercial aviation, showing measurable reduction in communication-related incidents.

The readback protocol has four steps:

Instruction -> Readback -> Verify -> Act

  1. Instruction. The human tells the agent what to do.
  2. Readback. The agent echoes its understanding of the instruction in its own words.
  3. Verify. The human confirms the readback is correct, or corrects it.
  4. Act. Only after verification does the agent execute.

In aviation, this catches errors that would otherwise propagate silently. “Descend to flight level three-zero-zero” / “Descending to three-zero-zero” / “Correct, three-zero- zero” / (begins descent). Without readback: “Descend to flight level three-zero-zero” / (hears “three-three-zero” and descends to wrong altitude).

Applied to agentic engineering:

# Step 1: Instruction
Human: "Add rate limiting to the /api/bouts endpoint. Use a sliding
window of 100 requests per minute per user. Return 429 with a
Retry-After header when the limit is exceeded."

# Step 2: Readback
Agent: "I understand the task as:
- Add rate limiting to /api/bouts only (not other endpoints)
- Sliding window algorithm, not fixed window
- 100 requests per minute, keyed per authenticated user
- Return HTTP 429 with Retry-After header (seconds until window resets)
- I will need to modify the route handler and add a rate limit
  middleware or utility
Questions: Should the rate limit apply to all HTTP methods on this
endpoint, or only specific ones? Should unauthenticated requests be
rate-limited by IP?"

# Step 3: Verify
Human: "Correct. All HTTP methods. Unauthenticated requests should
return 401, not be rate-limited - they should not reach the endpoint."

# Step 4: Act
Agent: (implements the feature)

The readback costs seconds. The misalignment it prevents costs hours. The agent that generates without readback may implement a fixed window instead of sliding, apply rate limiting globally instead of per-endpoint, or key by session instead of user. Each of these is a plausible interpretation of an ambiguous instruction.

Authority Gradients

CRM identified that authority gradients - the perceived distance in authority between two team members - directly affect communication quality. When the gradient is too steep (the captain is always right), junior crew members do not challenge errors. When the gradient is too flat (everyone’s opinion is equal), decisions are not made.

In human-agent interaction, the authority gradient has a specific failure mode: the agent has zero authority by default. It defers to the human on everything. This sounds safe until you realise that it means the agent will not flag problems it detects. If the human’s instruction contradicts a standing policy or introduces a security vulnerability, the agent may comply without objection because compliance is the path of least resistance in its training distribution.

The corrective is to calibrate the gradient explicitly. The agent should be instructed:

  • If the instruction contradicts a standing policy, surface the contradiction.
  • If the instruction would break existing tests, report this before implementing.
  • If the instruction is ambiguous, ask for clarification rather than choosing an interpretation.

These are not personality instructions. They are operational constraints that make the readback protocol effective.

The Muster Format

The readback protocol handles instructions. The muster format handles decisions.

When a task reaches a decision point - multiple valid approaches, unclear requirements, trade-offs that require human judgment - the agent presents a decision table:

MUSTER (3 items):

| # | Question | Default | Your Call |
|---|----------|---------|-----------|
| 1 | Rate limit implementation: Redis vs in-memory? | Redis (persistent across restarts) | _____ |
| 2 | Include rate limit headers (X-RateLimit-*) on all responses? | Yes (standard practice) | _____ |
| 3 | Log rate limit violations to audit trail? | Yes | _____ |

Properties of the muster format:

  • Numbered rows. The human can respond “1: default, 2: no, 3: yes” - O(1) per decision.
  • Defaults provided. The agent states what it will do if the human does not override. This saves time when the default is acceptable.
  • Binary per row. Each row is one decision. Not a paragraph of trade-offs. Not a discussion. Accept the default or specify the alternative.
  • All decisions visible at once. The human sees every pending decision in one table, not spread across multiple messages.

The muster format optimises for the human’s scarcest resource: attention. A human reviewing agent output can process a 5-row muster table in under a minute. The same decisions presented as a paragraph of prose take 5 minutes to parse and have higher error rate.

AGENTIC GROUNDING: The readback protocol is not just a communication pattern. It is a diagnostic tool. When you read the agent’s readback, you can detect whether the agent has the right context. If the readback reveals a misunderstanding, the misunderstanding was going to propagate into the generated code. If you skip readback and go straight to generation, the misunderstanding still exists - you just do not discover it until the code review. The readback shifts error detection from code review (expensive) to instruction verification (cheap).


8. Bearing Checks

Estimated time: 30 minutes

A bearing check is a repeatable governance unit. It answers one question: are we still headed where we think we are headed?

In agentic engineering, drift is not a risk - it is the default state. Context degrades over long sessions. Specifications evolve but implementation does not track the evolution. Plans describe work that was completed differently than planned. Backlogs accumulate items that are no longer relevant. The bearing check is the instrument that detects drift before it compounds.

When to Run a Bearing Check

Three triggers:

  1. Phase boundary. When one phase of work completes and the next begins. The bearing check confirms that the completed phase is actually complete and the next phase’s assumptions are still valid.

  2. Session start after a break. When a human returns to a project after hours or days away. Context has degraded (the human forgot details) and the project may have changed (other team members pushed commits).

  3. Suspected drift. When something feels wrong but you cannot identify what. The bearing check is the systematic alternative to “something seems off, let me poke around.”

The Five Checks

Each check is a specific, executable procedure. The bearing check is not a meeting, not a retrospective, not a “let’s reflect on how things are going.” It is a diagnostic with five specific tests.

1. Spec drift - Compare the specification against the implementation.

# Open the spec and the implementation side by side.
# For each requirement in the spec, verify it is implemented.
# For each implementation detail, verify it has a spec origin.
# Note divergences.

Questions to answer: Has the implementation drifted from the spec? Has the spec been updated to reflect implementation decisions? Are there features in the code that have no spec origin (scope creep)?

2. Eval validity - Review the success criteria.

Are the evaluation criteria still reachable? Have circumstances changed that make a criterion irrelevant or impossible? Has the definition of done shifted?

3. Plan accuracy - Review the plan.

# Open the plan document.
# Check the completed table: is it up to date?
# Check the remaining tasks: are dependencies still valid?
# Check estimates: are they still realistic given what you know now?

Questions to answer: Does the plan reflect reality? Are completed items marked complete? Are blocked items actually blocked, or has the blocker been resolved?

4. Gate health - Run the gate.

# Run the full gate:
pnpm run typecheck && pnpm run lint && pnpm run test

Questions to answer: Does the gate pass? Are there flaky tests? Are there warnings that were not present at the last check? Have new tests been added for new features?

5. Backlog sync - Review the backlog.

Questions to answer: Are all open items still relevant? Are priorities correct given current knowledge? Are there items that should be closed (completed, superseded, or no longer needed)?

Cost and Justification

A bearing check costs approximately 15 agent-minutes, or 20-30 human-minutes. This is the price of five specific diagnostics executed in sequence.

The justification is not philosophical. It is mathematical: drift cost is always higher than check cost. A spec divergence caught at the bearing check costs 15 minutes to note and correct. The same divergence caught at the end of a phase costs hours of rework. A divergence caught in production costs days.

The bearing check is cheapest when nothing is wrong. When everything is aligned - spec matches implementation, plan is current, gate is green, backlog is clean - the check completes in 10 minutes and confirms that you can proceed with confidence. The check is most valuable when something IS wrong, because early detection limits blast radius.

HISTORY: The bearing check maps to several established patterns: sprint retrospectives (Agile), configuration audits (systems engineering), and pre-flight checklists (aviation). The specific innovation is codifying it as a repeatable unit with defined checks, defined triggers, and a cost budget - not a meeting with open-ended discussion, but a diagnostic procedure with five specific tests that produce a pass/fail/drift assessment.

AGENTIC GROUNDING: Agents do not drift-check themselves. An agent working on a task will continue executing until the task is complete or the context window dies, whichever comes first. It will not notice that the specification changed since it loaded context. It will not notice that another developer merged a PR that affects its work. It will not notice that the plan it is following was updated two hours ago. The bearing check is the human’s instrument for detecting drift that the agent cannot detect. If you dispatch agents for extended work without periodic bearing checks, you are flying without instruments.


9. Pull-Based Review

Estimated time: 25 minutes

In a Kanban pull system, downstream consumption triggers upstream production. Work flows when the downstream station is ready, not when the upstream station finishes. Applied to human-AI communication: the human controls when agent output is reviewed. Agents do not interrupt.

The Push Problem

The natural model for human-agent interaction is push: the agent finishes a task, notifies the human, and the human reviews. This seems efficient. It is not, because it ignores a fundamental asymmetry: the agent operates at machine speed, and the human operates at human speed.

An agent can generate 5 PRs in the time it takes a human to review one. If the agent pushes notifications after each PR, the human faces a growing queue of interruptions. Each interruption has a context-switching cost - the human must load the PR’s context, review, provide feedback, and then return to whatever they were doing. The interrupt cost compounds: 5 interrupts per hour is not 5 times the cost of 1 interrupt, it is 5 times the cost plus 5 context switches.

Pull-Based Review in Practice

In a pull model:

  1. The agent completes work and places it in a queue (pull requests on a branch, outputs in a directory, items in a tracking system).
  2. The human reviews the queue at human-chosen intervals.
  3. The human processes work in batches, in the order and grouping that makes sense for review efficiency.
  4. The agent does not send notifications, does not ask for review, does not block on human response. It proceeds to the next task or waits.
# Push model (interrupt-driven):
14:00  Agent finishes PR #1 -> notifies human
14:03  Human switches context, begins review
14:15  Human finishes review, returns to previous work
14:18  Agent finishes PR #2 -> notifies human  (3 minutes after human returned to work)
14:21  Human switches context again...

# Pull model (batch review):
14:00  Agent finishes PR #1 -> adds to queue
14:12  Agent finishes PR #2 -> adds to queue
14:25  Agent finishes PR #3 -> adds to queue
15:00  Human chooses to review -> processes all 3 PRs in sequence
15:45  Human done, returns to other work uninterrupted

The pull model produces the same reviews with fewer context switches, higher review quality (the human is not rushed by the next notification), and better human throughput.

Batch Review Strategies

When reviewing a batch of agent-generated PRs:

  1. Triage first. Scan all PR titles and --stat diffs. Sort by risk: security-touching first, multi-file second, single-file last.
  2. Review related PRs together. If PRs 2 and 3 both touch lib/auth/, review them together to check for interaction effects.
  3. Time-box. Set a review window (e.g., 45 minutes) and process what fits. Remaining items wait for the next window. This prevents review fatigue from degrading quality.
  4. Do not rush to empty the queue. A queue of agent PRs waiting for review is not a problem. A poorly reviewed PR merged to main is.

Why Agents Do Not Interrupt

This is a governance decision, not a technical limitation. Agents can be configured to send notifications, request reviews, and escalate blocked tasks. They should not.

The reason is cognitive: every notification from an agent competes for the human’s attention with every other notification, task, and thought. The human’s attention is the scarcest resource in the system. Protecting it is not a luxury. It is the precondition for effective review.

Pull-based review - the human controls when agent output is reviewed - is Kanban pull applied to human-AI communication. The downstream station (human review) pulls work when ready. The upstream station (agent generation) does not push.

AGENTIC GROUNDING: If you are using an agent system that sends you a notification every time a task completes, turn off the notifications. Set a recurring time to review agent output - once per hour, twice per day, whatever matches your workflow. You will review fewer items per batch but each review will be higher quality, and your overall throughput (including non-review work) will increase because you are not paying the context-switching tax on every interrupt.


10. Challenges

Estimated time: 60-90 minutes total


Challenge 1: Atomic Decomposition

Estimated time: 15 minutes

Goal: Decompose a feature request into atomic PRs.

You receive this feature request:

“Add a user preferences system. Users should be able to set their preferred language, timezone, and notification settings. Preferences should be stored in the database, exposed via a REST API, and reflected in the UI header. Also, the existing email notification system should respect the new notification preferences.”

Decompose this into atomic PRs. For each PR, specify:

  1. A title (conventional commit format)
  2. Which files/directories it touches
  3. What type of change it is (feature, refactor, fix, docs)
  4. Dependencies (which PRs must merge first)

Verification: Check your decomposition against these criteria:

  • Does each PR touch only one domain?
  • Can each PR be reviewed independently?
  • Can each PR be reverted without affecting the others?
  • Is there a logical ordering based on dependencies?
  • Could an agent execute each PR with a clear, single-concern instruction?
Hints

A reasonable decomposition might have 5-7 PRs:

  1. Database schema: add preferences table + migration
  2. Data access: add preference CRUD functions + tests
  3. API: add preference endpoints + route tests
  4. UI: add preferences display in header
  5. Email integration: update email service to read notification preferences + tests
  6. Documentation: update API docs for new endpoints

Each one is independently reviewable, testable, and revertible. The database migration must go first. The email integration depends on the data access layer but not on the API or UI.


Challenge 2: Review Protocol Exercise

Estimated time: 20 minutes

Goal: Apply the review checklist to an agent-generated diff.

Examine the following simulated agent-generated diff and identify issues using the review checklist from Section 3.

commit abc1234
Author: agent-pipeline
Date:   Mon Mar 10 09:30:00 2026 +0000

    feat: add rate limiting to API endpoints

 lib/api/middleware/rate-limit.ts     | 45 +++++++++++++
 lib/api/routes/bouts.ts             |  3 +
 lib/api/routes/credits.ts           |  3 +
 lib/api/routes/users.ts             |  3 +
 lib/common/utils.ts                 | 12 ++--
 lib/common/constants.ts             |  8 +++
 lib/auth/session.ts                 |  5 +-
 tests/api/rate-limit.test.ts        | 28 +++++++++
 docs/api/rate-limiting.md           | 15 +++++
 9 files changed, 118 insertions(+), 8 deletions(-)

Using the review checklist:

  1. Does it do what it says? The commit says “add rate limiting to API endpoints.” Does the diff match?
  2. Does it do anything it does not say? Look at the file list. Which files are unexpected given the commit message?
  3. What questions would you ask before approving?

Write your review findings as you would post them on a PR.

Verification: Your review should identify at least:

  • lib/common/utils.ts changes (12 lines modified) are not described in the commit message
  • lib/auth/session.ts changes (5 lines modified) are not described in the commit message
  • Rate limiting was applied to 3 route files but the commit says “API endpoints” without specifying which ones
  • The test file has 28 lines for 45 lines of implementation - check test coverage depth
Hints

The stowaway changes are in lib/common/utils.ts and lib/auth/session.ts. These are modifications to existing files in unrelated domains. The agent likely “improved” these files while it had them open for reference. This is a stowaway commit - the rate limiting feature is bundled with undocumented changes to utilities and authentication.

The correct review response: request that the PR be split. Rate limiting in one PR. Utility changes in a separate PR with their own description and tests. Auth changes in a third PR with security review.


Challenge 3: HOTL/HODL Classification

Estimated time: 15 minutes

Goal: Classify each of the following tasks on the HOTL/HODL spectrum.

For each task, decide:

  • Is the correctness fully gate-verifiable, partially verifiable, or taste-required?
  • What is the spectrum position (Full HOTL, HOTL with review, HODL with gate assist, Full HODL)?
  • What is the primary risk if the dial is set wrong?
#Task
1Rename getUserById to findUserById across the codebase
2Write the error messages shown to users when payment fails
3Add input validation to all API endpoints using the existing Zod schema pattern
4Design the database schema for a new multi-tenant feature
5Update all npm dependencies to their latest compatible versions
6Refactor the authentication module to use the strategy pattern
7Generate OpenAPI documentation from existing route definitions
8Write the onboarding email sequence for new users
9Add structured logging to all service functions following the existing pattern
10Decide whether to use PostgreSQL or MongoDB for a new analytics feature

Verification: Compare your classifications to these ground truths:

  • Tasks 1, 5, 7, 9 should be Full HOTL or HOTL with review (gate-verifiable)
  • Tasks 2, 8, 10 should be Full HODL (taste-required)
  • Tasks 3, 4, 6 should be in the middle (partially verifiable)
Discussion

Task 3 is the interesting case. “Add input validation using the existing pattern” sounds gate-verifiable - either the validation is present or it is not. But the agent must decide WHICH fields to validate and WHAT the validation rules are. For well-defined schemas, this is mechanical. For complex business logic (e.g., “a bout score must be between 0 and 10 unless the bout type is exhibition”), the validation rules encode business knowledge that the gate cannot verify.

Task 6 is also instructive. Refactoring to a design pattern is partially verifiable - the code compiles and tests pass (gate) - but whether the strategy pattern is the RIGHT pattern for this module is a taste-required judgment. An agent may produce a technically correct implementation of the strategy pattern that makes the code more complex without sufficient benefit.


Challenge 4: Bearing Check Execution

Estimated time: 20 minutes

Goal: Run a bearing check on a real or sample project.

If you have an active project:

  1. Open the project and run each of the five checks:
# Check 4: Gate health
# Run your project's gate:
pnpm run typecheck && pnpm run lint && pnpm run test
# or your equivalent
  1. For each check, document:
    • The finding (pass, drift, or N/A)
    • If drift is detected, the estimated cost to fix
    • Priority: fix now, backlog, or acceptable drift

If you do not have an active project, run the bearing check against this bootcamp:

  1. Spec drift: Does the Table of Contents match the actual sections?
  2. Eval validity: Are the learning objectives in “Why This Step Exists” addressed by the content sections?
  3. Plan accuracy: Does the estimated time in the header match what you have experienced?
  4. Gate health: Are all code examples syntactically valid?
  5. Backlog sync: Are there topics mentioned in the content that are not addressed (items that should be on the backlog)?

Verification: You should have a 5-row table with findings for each check. At least one check should identify drift, even in a healthy project. If all five are “pass,” look harder

  • the purpose of the exercise is to practice finding drift, not to confirm that everything is fine.

Challenge 5: Readback Protocol Implementation

Estimated time: 20 minutes

Goal: Implement the readback protocol in an agent interaction.

Use an AI agent (Claude, GPT, or any available model) to implement a specific feature. Before the agent begins implementation, use the readback protocol:

  1. Write a clear instruction for a small feature (e.g., “Add a health check endpoint that returns the application version, uptime, and database connection status”).
  2. Ask the agent to readback its understanding before implementing.
  3. Evaluate the readback:
    • Did the agent identify any ambiguities you missed?
    • Did the agent add assumptions that you did not state?
    • Is the readback complete enough that you would approve execution?
  4. Correct any misunderstandings and approve execution.
  5. Compare the output to the readback. Did the implementation match the verified understanding?

Verification:

Document:

  • The original instruction
  • The agent’s readback
  • Your corrections (if any)
  • Whether the final output matched the verified readback
  • At least one case where the readback caught a potential misunderstanding
Hints

Good instructions for testing readback:

  • Include a detail that could be interpreted two ways (“return the database connection status” - does this mean “connected/disconnected” or a full connection pool status with active/idle counts?)
  • Include an implicit constraint (“health check endpoint” - what HTTP method? What path? What response format?)
  • Omit one detail that the agent should ask about (“uptime” - since when? Process start? Last deployment? Last restart?)

If the agent does NOT ask clarifying questions in its readback, the readback is a restate rather than a genuine comprehension check. A good readback surfaces ambiguities; a sycophantic readback just rephrases your words.


Challenge 6: Adoption Roadmap

Estimated time: 20 minutes

Goal: Design a team adoption plan from Level 1 to Level 4.

You are the tech lead of a team of 6 developers (2 senior, 3 mid-level, 1 junior). The team ships a SaaS product with a TypeScript backend, React frontend, and PostgreSQL database. The team currently uses no AI coding tools. Management has asked you to “adopt AI to improve productivity.”

Design an adoption plan that:

  1. Specifies which level the team starts at
  2. Defines the criteria for advancing to each next level
  3. Identifies the review infrastructure required at each level
  4. Names the specific risks at each level and the mitigations
  5. Provides a realistic timeline (months, not days)
  6. Addresses the senior developer who “does not trust AI tools”
  7. Addresses the junior developer who wants to use agents for everything

Write the plan as you would present it to your team.

Verification: Your plan should:

  • Start at Level 1, not Level 3
  • Take at least 3-6 months to reach Level 3
  • Include explicit advancement criteria (not just time-based)
  • Include the gate as a prerequisite for Level 2+
  • Address the sceptic and the enthusiast as real concerns, not obstacles
  • NOT promise specific productivity improvements (these are uncertain)
Hints

A realistic timeline:

  • Month 1: Level 1 for everyone. Set up IDE completion tools. No process changes. Collect anecdotal data.
  • Month 2: Level 2 for willing developers. Require gate in place. Track: time to implement, defect rate, review feedback volume.
  • Month 3-4: Level 2 for the full team. Assign the senior sceptic as primary reviewer. Their job is to find agent mistakes - this converts scepticism into valuable quality assurance.
  • Month 4-5: Level 3 pilot with one feature. Mandatory task decomposition. Senior developer designs architecture; agent implements to spec; full team reviews.
  • Month 6+: Level 3 for suitable tasks. Level 4 pilot for dependency updates. Evaluate based on data, not enthusiasm.

The junior developer should be told that Level 3+ without the ability to review agent output at a senior level is dangerous - the agent needs a reviewer who can catch its mistakes, and that requires experience the junior is still building.


Key Takeaways

Before moving to Step 11, verify you can answer these questions without looking anything up:

  1. What are the five steps of the engineering loop, and which step do agents perform worst?

  2. What is a stowaway commit? How do you detect one? What is the structural fix (not the instructional fix)?

  3. What is the review hydra? How does it relate to the stowaway commit?

  4. State the HOTL/HODL decision rule in one sentence.

  5. Why is drifting from HODL to HOTL dangerous, and what is the countermeasure?

  6. What are the five checks in a bearing check? What does each one detect?

  7. Why is “drift cost always higher than check cost”?

  8. What is the readback protocol? What aviation-safety programme validated it, and over what timeframe?

  9. What is a muster table? What property makes it efficient for human decision-making?

  10. Why is pull-based review (human controls timing) better than push-based review (agent notifies on completion)?

  11. What are the four levels of team adoption? Why is skipping levels dangerous?

  12. If a team measures adoption success by velocity alone, what failure mode are they missing?


  • Helmreich, R. L., Merritt, A. C., & Wilhelm, J. A. (1999). “The Evolution of Crew Resource Management Training in Commercial Aviation.” International Journal of Aviation Psychology, 9(1), 19-32. The foundational paper on CRM in commercial aviation. Readback, authority gradients, and structured communication.

  • Womack, J. P. & Jones, D. T. (1996). Lean Thinking: Banish Waste and Create Wealth in Your Corporation. Simon & Schuster. Five principles of Lean. Chapter 4 covers pull systems. The value stream concept maps directly to spec-to-commit workflows.

  • Nygard, M. (2011). “Documenting Architecture Decisions.” Cognitect Blog. The original ADR proposal. Short, modular, append-only decision records. The design pattern that the session decision chain extends.

  • Bainbridge, L. (1983). “Ironies of Automation.” Automatica, 19(6), 775-779. The paper that predicted cognitive deskilling from automation 40 years before LLM coding tools. Read this before setting any team’s HOTL/HODL dial.

  • Google SRE Book (2017). Chapter 12 (Effective Troubleshooting), Chapter 14 (Managing Incidents). Available free at sre.google/sre-book/. Postmortem culture, triage frameworks, and the principle that stopping the bleeding comes before root cause analysis.

  • Ohno, T. (1988). Toyota Production System: Beyond Large-Scale Production. The source text for jidoka (automation with a human touch), andon (stop the line), and pull systems. Relevant to HOTL/HODL framing and quality gate design.


Step 11: Cost, Security, Legal, and Compliance - Governance establishes the process. Step 11 establishes the constraints that process must satisfy. It covers token cost modelling and the ROI gate (weigh cost before dispatching), the OWASP Top 10 for LLM applications as a security framework, AI-generated code IP ownership (uncertain and evolving), data residency requirements for regulated industries, and prompt injection as an unsolved problem that requires defence-in-depth rather than a single solution. Every governance decision from this step - HOTL/HODL dial position, review depth, adoption level - has a cost dimension that Step 11 quantifies. The bearing check from Section 8 gets a cost line. The pull-based review from Section 9 gets a throughput model. Process without cost awareness is governance theatre.

index