Part III Harness Engineering
Chapter 12
Operating Model and Organizational Adoption
This chapter defines the operating model for introducing AI agents into a team. The subject is not model selection. The subject is role ownership, review budget, repo hygiene, metrics, and rollout order. The goal is not
Role in This Book
Even when individual runs succeed, adoption fails if PRs pile up, the repo degrades, and no one owns the final decisions. CH09 through CH11 covered the single-agent harness, the verification harness, and long-running tasks with multi-agent coordination. After that point, the next problem is not model capability. It is operating discipline.
This chapter defines the operating model for introducing AI agents into a team. The subject is not model selection. The subject is role ownership, review budget, repo hygiene, metrics, and rollout order. The goal is not to turn AI agents into a fast substitute for missing staff. The goal is to keep human responsibilities explicit and make artifact-driven work sustainable across the team.
Learning Objectives
- Make human responsibilities explicit
- Explain the tradeoff between review budget and throughput
- Design repo hygiene and entropy cleanup operations
Outline
1. Human responsibilities that remain
2. Review budget and throughput
3. Repo hygiene and AI slop control
4. Metrics and retrospectives
5. Plan an adoption roadmap
1. Human Responsibilities That Remain
Introducing AI agents does not remove human responsibility. It only removes part of the repetitive execution load. The repo's docs/en/operating-model.md already separates Human and Agent responsibilities, but five human responsibilities must stay explicit:
- deciding goals and priorities
- approving destructive or boundary-crossing changes
- making major design decisions
- performing final review and merge decisions
- maintaining repo hygiene and entropy cleanup
By contrast, AI agents are well suited to repo exploration, repetitive edits, scoped implementation, test and doc updates, verify execution, and change explanation. ChatGPT remains useful for requirements shaping and design comparison. Codex CLI remains useful for reading the repo, changing artifacts, and running verification. An operating model breaks these roles apart instead of blending them.
The common failure is to assume that if the agent can write code, it should also own design judgment and merge judgment. That does not produce speed. It produces responsibility gaps. In this chapter's operating model, AI agents are execution actors, not accountability owners.
2. Review Budget and Throughput
The first bottleneck after AI-agent adoption is rarely implementation speed. It is review capacity. As generation speed rises, human review budget saturates quickly. If that limit is ignored, unread PRs accumulate, review quality drops, and post-merge regressions increase.
CH12 does not treat throughput as “number of PRs.” docs/en/metrics.md frames it through closed issues / week, PR cycle time, and draft-to-merge time, together with review-budget usage. docs/en/operating-model.md makes the constraint concrete:
- one reviewer should deeply review at most two PRs at a time
- PRs that require an evidence bundle should be limited to one per reviewer at a time
The rule is simple: do not increase agent speed before protecting review flow. Smaller PRs, smaller work packages, and a stable PR template usually improve throughput more than a more aggressive model does. .github/pull_request_template.md supports this by fixing Goal, Changed Files, Verification, Evidence / Approval, and Remaining Gaps.
3. Repo Hygiene and AI Slop Control
AI-agent operations accelerate good diffs and bad diffs at the same time. The accumulated low-quality residue is AI slop. It includes more than obvious bugs. It also includes stale docs, broken paths, orphaned task briefs, drift between verify scripts and the repo, terminology inconsistency, and long explanations that are no longer tied to real artifacts.
checklists/en/repo-hygiene.md separates checks before merge from weekly cleanup. That split matters. A team cannot rely on “we will clean it up later” once agent throughput increases. Cleanup needs a cadence, owners, and explicit escalation conditions.
Repo hygiene stays a human responsibility for that reason. An agent can detect candidate stale artifacts, but deciding which artifact still holds source-of-truth status often requires human judgment. In this chapter, hygiene does not mean aesthetics. It means keeping the repo safe for the next agent run.
4. Metrics and Retrospectives
Metrics are not here to answer whether AI agents feel useful. They exist to show whether the operating model is healthy and where it is currently blocked. docs/en/metrics.md groups them into three sets:
| Group | Example metrics | What they reveal |
|---|---|---|
| throughput | closed issues / week, PR cycle time |
whether work packages are small enough and flow is moving |
| quality | verify failure rate, post-merge regression count |
whether review and verification are actually catching problems |
| hygiene | stale docs count, orphaned task brief count, missing verification evidence count |
whether entropy cleanup is keeping pace with generation |
The critical rule is to attach action to each metric. If verify failure rate rises, the team should inspect task decomposition, briefs, or prompt quality. If PR cycle time grows, the team should inspect review budget. If stale docs count rises, the team should inspect hygiene cadence. Metrics should support operating-model adjustment, not blame allocation.
5. Plan an Adoption Roadmap
AI-agent adoption is safer when rolled out in stages instead of across the entire repo at once. docs/en/operating-model.md already defines three stages:
Pilot- limit work to docs, tests, and scoped bugfixesGuided Delivery- standardize issue structure, task briefs, verify, and PR templatesTeam Scale- make review budget, metrics, and repo hygiene part of regular team operation
This ordering keeps missing artifacts visible while the blast radius is small. If a team jumps directly to multi-agent implementation across broad codepaths, speed may increase before review and hygiene can absorb it. The maturity model in this book should therefore be read not only as Prompt -> Context -> Harness, but also as local success -> guided delivery -> stable team operation.
Bad / Good Example
Bad:
AI agents are fast, so make issues larger and review them whenever someone has time.
Stale docs and terminology drift are minor problems and can be cleaned up later.
This optimizes only raw throughput. Review budget, repo hygiene, and role ownership all collapse under the generated change volume.
Corrected:
Keep 1 issue = 1 work package.
Use the PR template to require Goal, Changed Files, Verification,
Evidence / Approval, and Remaining Gaps.
Humans own goal setting, approval, final review, and entropy cleanup.
Review metrics and repo hygiene every week, and reduce input volume
before review budget is exceeded.
This operating model lets agent speed translate into completed work instead of uncontrolled queue growth.
Comparison points: - The bad version optimizes throughput alone. - The bad version ignores review capacity and hygiene cost. - The corrected version fixes responsibility, cadence, metrics, and cleanup in artifacts.
Worked Example
Consider a three-person team operating this repo.
- lead
- owns prioritization, destructive-change approval, and final merge decisions
- operator
- uses
ChatGPTfor requirements and design comparison, then usesCodex CLIfor implementation and verify - reviewer
- reviews the PR template output, verification, and evidence
In this setup, one reviewer should hold at most two deep reviews at once. The operator must use .github/pull_request_template.md so that Goal, Changed Files, Verification, Evidence / Approval, and Remaining Gaps are always present. Every week, the team reviews docs/en/metrics.md. If PR cycle time grows, work packages are reduced further. If stale artifacts grow, checklists/en/repo-hygiene.md drives the entropy-cleanup pass.
The point of this example is that adoption succeeds or fails based on whether roles and cadence are artifactized. CH12 is therefore an operations chapter, not a model-selection chapter.
Exercises
- Define an operating model for a three-person team.
- Create a weekly entropy cleanup checklist.
Referenced Artifacts
docs/en/operating-model.mddocs/en/metrics.mdchecklists/en/repo-hygiene.md.github/pull_request_template.md
Source Notes / Further Reading
- To revisit this chapter, start with
docs/en/operating-model.md,docs/en/metrics.md,checklists/en/repo-hygiene.md, and.github/pull_request_template.md. Read adoption decisions through roles, review budget, cadence, and cleanup instead of through model comparisons. - For the backmatter path, see
manuscript-en/backmatter/00-source-notes.mdunder### CH12 Operating Model and Organizational Adoptionandmanuscript-en/backmatter/01-reading-guide.mdunder## Verification, Reliability, and Operations.
Chapter Summary
- Human responsibilities remain in goal setting, approval, final review, and repo hygiene.
- Throughput improves only when review budget and work-package size are controlled together.
- Once Prompt, Context, and Harness are translated into artifacts and an operating model, AI agents move closer to completing real work instead of only looking capable.
Parity Notes
- Japanese source:
manuscript/part-03-harness/ch12-operating-model.md - This English draft preserves the same team operating model, metrics framing, and hygiene responsibilities as the Japanese chapter.