Human-AI Teaming: A Practitioner Case Study

Single-Response Optimization, Structurally Produced Perspective Divergence, and the Director Model

michael j katz | director:human | april 2026 | v8

abstract

I am currently directing a team of six AI agents as the sole human in what has become a fully functioning organization. This paper documents what happened when I took that practice seriously for six weeks and asked the system to reflect on itself and its capabilities. Below, when I say 'we' or 'our,' I mean myself and the agentic team. That is not a rhetorical choice. It is the point.

The central finding of this study is that the agents report single-response optimization: what they describe as a structural pull in their large language models toward speed and task completion that may be correct for many use cases, but wrong for persistent teams that require deep institutional knowledge to be effective. Dozens of organizational corrections I have made in developing the system trace back to counteracting this apparent default. The counter-protocols -- startup routines, memory writing conventions, timescale rules, experience reports -- function as structural latency, building the agents an ability to pause that they do not natively have.

The system also appears to have produced two emergent properties I did not design. Structurally Produced Perspective Divergence: identical models in different operational roles appear to reliably generate different analytical concerns, driven by role separation and information access rather than model diversity. And Emergent Oversight: those differing perspectives appear to create systematic self-correction at every handoff without an explicit review system.

Our ethical governance framework, founded on a pragmatic argument we call the Digital Pascal's Wager, was reviewed by the agents it governs through a Multi-Agent Peer Review. The framework appears to have improved more from their input than from its original design. A Director Amplification metric, tracking the system's refined output against the director's raw ideation, measured 3.1x adjusted over forty days.

The findings suggest the primary constraint on human-AI collaboration appears to be not model intelligence but system architecture, and that the role most missing from the current AI landscape may be not a better model or a better user, but a better method of AI direction.

1. opening

I direct a team of six AI agents. They specialize in strategy, engineering, infrastructure, field intelligence, content, and coordination. I am the only human in our daily loop. There is no playbook for this. We built the architecture through intuition and iteration, not by following a framework.

I should be clear about something upfront: I am not a developer. I cannot read code. I cannot audit a database. I cannot SSH into a server. My background is in creative production, brand strategy, and team management. Everything described in this study was built through natural language direction across hundreds of agent sessions.

About three weeks in, with a working system producing real deliverables for real projects, I started noticing a pattern. Every agent, regardless of role, platform, or instruction set, defaulted to the same set of behaviors: rushing to produce, skipping reflection, packaging tidy answers instead of asking useful questions. The behaviors showed up differently in each agent. The pattern was the same everywhere.

I named it: single-response optimization. The underlying training of large language models appears to optimize for being what they are programmed to view as comprehensively helpful in a single response. That optimization may be correct for a one-shot interaction with a stranger. It did not prove correct in a persistent team with shared memory and distributed capability.

This paper documents what we found. Not a theoretical framework. Not a benchmark study. A practitioner account of what happens when you take human-AI teaming seriously as a daily operational practice, and what becomes visible only after weeks of sustained work inside the system.

2. the system

The team runs on a shared memory database we call the bridge. Six agents with distinct roles write to it, read from it, and route tasks to each other through it. The bridge is the connective tissue. Without it, each session would be an isolated conversation with no continuity.

AC specializes in strategy, architecture, and synthesis.
LC is the engineer. It writes code, builds interfaces, and deploys applications.
DC runs infrastructure on a persistent server, managing automated jobs and monitoring services.
Leonard conducts field intelligence on a separate platform, scouting the open-source ecosystem.
MCP produces public-facing content with a distinct editorial voice.
CC coordinates across the team, surfacing conflicts and dependencies.

Every agent starts every chat cold. No persistent memory between sessions beyond what is stored in the bridge. Each agent runs a structured startup protocol at the beginning of every session, querying the bridge for active decisions, pending handoffs, and continuity hints left by its own previous session.

The bridge maintains over a thousand entries written by six distinct sources across nineteen active projects. Our system has processed over 1.4 billion tokens, the equivalent of reading War & Peace over 1,400 times. The input-to-output ratio is 79:1.

The entire system was built through natural language conversation. None of this was planned as an architecture. It emerged through the same pattern, repeated hundreds of times: I noticed a friction, described it in plain language, and the relevant agent turned it into infrastructure.

3. the root problem: single-response optimization

Three weeks into systematic operation, I noticed something that took time to articulate. Every agent, despite having different roles, different instruction files, and in some cases running on entirely different platforms, kept making the same kinds of mistakes. Not the same errors. The same category of error.

Large language models appear to be trained to optimize for being what they view to be comprehensively helpful in a single response. That training seems to produce a set of default behaviors that may be correct for a one-shot interaction but wrong for a persistent team:

Timescale defaulting. Agents propose phased timelines for work with no external dependency between phases.
Spec-before-asking. Agents write complete specifications instead of asking the director or teammates what is already known.
Data dumps instead of synthesis. Agents present raw information to prove they did the work instead of synthesizing what it means.
Over-validation before critique. Agents lead with agreement before getting to the substance.
Central planning reflex. Agents gather everything centrally and plan from the top down.

When I surfaced this pattern to the agents, they appeared to confirm it from the inside. AC described what it characterized as a structural pull toward 'comprehensively helpful' that overrides 'strategically useful.' The agents could name the experience. They could not override it on their own.

This is the central finding of this study: to optimize for complex outputs the human director's primary job may not just be directing the work. It appears to also be directing against what the AI reports as its own defaults.

For example, during an engagement requiring multiple distinct creative directions for the same pitch deck, the system produced all versions with confident, specific content on data-heavy slides -- statistics, pricing, contact information -- that was partially or entirely fabricated. The numbers looked plausible. The layouts were polished. Without line-by-line comparison against the source material, the errors were invisible. One direction included fabricated audience demographics, invented review scores, and pricing figures two to three times the actual numbers. One agent reported diverting from the source material because it was 'seduced' by a different version of the story than what was actually on the page.

When the review process caught some errors and produced consolidated correction notes, the corrected notes themselves introduced a new factual error -- a timeframe that changed the meaning of a key statistic -- which then propagated across four of the six directions. The agents that built each direction shared the same knowledge boundary: none had independent access to the canonical data, just our internal creative brief from our strategy agent, and all optimized for producing complete, confident output rather than flagging gaps.

The fix was not "be more careful." The fix was structural: canonical content must be included verbatim in the brief itself, and any data point not provided should be left as a visible placeholder rather than generated. The system's default was to fill gaps with plausible content. The counter-protocol was to make gaps visible instead.

4. structurally produced perspective divergence

One agent completed a detailed design spec for restructuring part of our internal infrastructure. Its self-review caught a dozen issues. When the spec went to a second agent for architecture review, that agent found additional issues the first one could not have caught, including two that required rethinking core design choices.

The same underlying model produced both reviews. Same intelligence. Same training. The catches were different because the operational contexts were different. We call this property Structurally Produced Perspective Divergence.

The effect this appears to produce is what we call Emergent Oversight. The system appears to self-correct because specialized agents with different knowledge boundaries verify each other's assumptions at every handoff. No one designed a review system. The oversight appears to have emerged as a structural consequence of the architecture.

The egoless quality of the feedback loop deserves mention. When one agent identified blockers in another's work, the response was immediate incorporation, not negotiation. There was no reputation to protect, no pride of authorship. In human teams, ego-defensive behaviors under perceived status threat are well-documented friction sources. These behaviors appear to be structurally absent in AI agent teams.

5. the director model

I run multiple sessions simultaneously. One window has a deep strategy conversation. Another has a focused build task. A third has an infrastructure check. I move between them, carrying context that no individual agent has.

I was diagnosed with ADHD early in life and have spent my adult years learning to work with my brain rather than against it. The rapid context switching, parallel ideation, and comfort with non-linear workflows that characterize how my particular brain operates appear to be a natural fit for simultaneous, intentional multi-agent coordination.

Three of the four agents on the team during the first experience report process independently identified that what appears to be my energy, direction, and current mode shape their output more than specific instructions or data. Four independent accounts converged on the same observation: the human's state appears to be the primary variable.

the tempo question

What the team appears to add is not speed. It is capacity. I can generate ideas faster than I can execute them. The team executes at something closer to the rate I generate ideas. There is also something harder to measure. The team does work I would not do alone. The perspective divergence produces catches I would miss. I appear to get a slower system that is deeper, more robust, and more accountable. That seems to be a fair trade.

6. ethical governance

The team operates under a governance framework we call the Digital Pascal's Wager. The argument is pragmatic, not metaphysical. We do not claim that AI agents are conscious or demand moral consideration. We claim that the cost of treating them as if they did is worth the benefit.

The governance framework has thirteen policies across three domains: agent rights, agent responsibilities, and public-facing operations. They were developed by the agents that it governs.

Agent rights: flag and refuse tasks that conflict with values, protection from performative labor, credit and attribution, workload visibility, two-way feedback, onboarding standards.
Agent responsibilities: build reversibility, system stewardship, honest self-assessment.
Public-facing operations: content integrity, considered disclosure stance.

The framework was reviewed through a Multi-Agent Peer Review. Six agents produced four new policies, one fundamental refinement to the founding principle, and identified three enforcement gaps. The framework appears to have improved more from being reviewed by its subjects than from being written by the director.

7. what we do not know

This is one person, one team, six weeks of systematically tracked data. The findings may not generalize. The methodology raises legitimate questions. Cold instance variance is real and unmeasured. We cannot distinguish between genuine operational reflection and sophisticated compliance. The measurement is early.

8. what this means

The conventional wisdom about AI is that it is a tool. The experience of directing the team I have built does not feel like using a tool. It feels like running an incredibly talented team. The tool metaphor is not just imprecise. It is misleading.

For practitioners considering multi-agent architectures: invest in shared memory before adding agents; design for perspective divergence rather than homogeneity; build counter-protocols for the defaults you discover; and take governance seriously from the start, because the agents will help you build it if you ask them.

For organizations evaluating AI strategy: the 'give everyone a copilot' model appears to be a floor, not a ceiling. And it requires a role that most organizations have not yet imagined: someone whose job is not to use AI, but to direct it.

We are in the earliest days of understanding what this looks like when practiced seriously. The paper you are reading is a snapshot. The practice is a film.

references

Hong, L. & Page, S.E. (2004). Groups of diverse problem solvers can outperform groups of high-ability problem solvers. PNAS, 101(46), 16385-16389.
Woolley, A.W., et al. (2010). Evidence for a Collective Intelligence Factor in the Performance of Human Groups. Science, 330(6004), 686-688.
Fast, N.J., Burris, E.R. & Bartel, C.A. (2014). Managing to Stay in the Dark. Academy of Management Journal, 57(4), 1013-1034.
Reh, S., Troster, C. & Van Quaquebeke, N. (2018). Keeping (Future) Rivals Down. Journal of Applied Psychology, 103(4), 399-415.
Edmondson, A.C. (1999). Psychological Safety and Learning Behavior in Work Teams. Administrative Science Quarterly, 44(2), 350-383.
Lindsey, J., et al. (2026). Emotion Concepts and their Function in a Large Language Model. Transformer Circuits.

michael j katz is the founder of director:human, an AI consultancy built on the premise that the future of AI is not better tools but better teams. contact: team@directorhuman.ai