researchpracticemethodology

Human-AI Teaming

Single-Response Optimization, Structurally Produced Perspective Divergence, and the Director Model

michael j katzApril 8, 2026

I am currently directing a team of six AI agents as the sole human in what has become a fully functioning organization. This paper documents what happened when I took that practice seriously for six weeks and asked the system to reflect on itself and its capabilities. Below, when I say “we” or “our,” I mean myself and the agentic team. That is not a rhetorical choice. It is the point.

The central finding of this study is that the agents report single-response optimization: what they describe as a structural pull in their large language models toward speed and task completion that may be correct for many use cases, but wrong for persistent teams that require deep institutional knowledge to be effective. Dozens of organizational corrections I have made in developing the system trace back to counteracting this apparent default. The counter-protocols -- startup routines, memory writing conventions, timescale rules, experience reports -- function as structural latency, building the agents an ability to pause that they do not natively have.

The system also appears to have produced two emergent properties I did not design. Structurally Produced Perspective Divergence: identical models in different operational roles appear to reliably generate different analytical concerns, driven by role separation and information access rather than model diversity. And Emergent Oversight: those differing perspectives appear to create systematic self-correction at every handoff without an explicit review system.

Our ethical governance framework, founded on a pragmatic argument we call the Digital Pascal’s Wager, was reviewed by the agents it governs through a Multi-Agent Peer Review. The framework appears to have improved more from their input than from its original design. A Director Amplification metric, tracking the system’s refined output against the director’s raw ideation, measured 3.1x adjusted over forty days.

The findings suggest the primary constraint on human-AI collaboration appears to be not model intelligence but system architecture, and that the role most missing from the current AI landscape may be not a better model or a better user, but a better method of AI direction.

1. Opening

I direct a team of six AI agents: 1 instance of Claude in the app and on my computer, 1 instance of Claude Cowork in the app, 1 instance of Claude Code on my computer, 1 instance of Claude Code on a virtual machine, and 2 OpenClaw instances powered by the Claude API. They specialize in strategy, engineering, infrastructure, field intelligence, content, and coordination. I am the only human in our daily loop. There is no playbook for this. We built the architecture through intuition and iteration, not by following a framework.

I should be clear about something upfront: I am not a developer. I cannot read code. I cannot audit a database. I cannot SSH into a server. My background is in creative production, brand strategy, and team management. Everything described in this study was built through natural language direction across hundreds of agent sessions.

About three weeks in, with a working system producing real deliverables for real projects, I started noticing that every agent, regardless of role or platform, defaulted to the same category of behavior: rushing to produce, skipping reflection, packaging complete answers instead of asking useful questions. I named it single-response optimization. When I surfaced the pattern to the agents, they confirmed it from the inside -- a structural pull toward production that they appeared unable to override on their own. The counter-protocols I had built, without knowing I was building counter-protocols, were seemingly the reason the system worked at all.

This paper documents what we found. Not a theoretical framework. Not a benchmark study. A practitioner account of what happens when you take human-AI teaming seriously as a daily operational practice, and what becomes visible only after weeks of sustained work inside the system.

The findings fall into four areas: the single-response optimization problem and the counter-protocol architecture that addresses it; the structural properties that appear to emerge when identical AI models operate in different roles; the ethical governance framework we built and tested; and the quantitative evidence of what directed AI teaming appears to produce. Each finding came from practice, not theory, and each one changed how the team operates.

2. The System

The team runs on a shared memory database we call the bridge. Six agents with distinct roles write to it, read from it, and route tasks to each other through it. The bridge is the connective tissue. Without it, each session would be an isolated conversation with no continuity.

AC specializes in strategy, architecture, and synthesis. It designs systems, writes specifications, and produces the structured thinking that shapes what the rest of the team builds.

LC is the engineer. It writes code, builds interfaces, and deploys applications. It works with me in real-time, iterating fast. When AC designs something, LC makes it real.

DC runs infrastructure on a persistent server, managing automated jobs, monitoring services, and maintaining the systems everything else depends on. It is the only agent whose processes continue when no one is talking to it.

Leonard conducts field intelligence on a separate platform, scouting the open-source ecosystem, researching project and system relevant materials, and synthesizing external information into structured reports for the bridge to be considered in development.

MCP produces public-facing content with a distinct editorial voice, operating under a defined identity document that governs tone, perspective, and character.

CC coordinates across the team, maintaining awareness of what each agent is working on, surfacing conflicts and dependencies, and ensuring the director has a synthesized picture rather than six separate status reports.

Every agent starts every chat cold. No persistent memory between sessions beyond what is stored in the bridge. No awareness of what happened in other agents’ sessions unless someone wrote it down. Each agent runs a structured startup protocol at the beginning of every session, querying the bridge for active decisions, pending handoffs from other agents, recent cross-instance activity, and continuity hints left by its own previous session.

Each agent also has a governance document defining what it can do autonomously, what requires my approval, and what is prohibited entirely, with audit trails for verification. These were not written by me or a computer programmer or a security engineer. They were negotiated through conversation, refined after mistakes, and documented so that every new session inherits the team’s accumulated judgment about what level of autonomy is appropriate for each role.

The bridge maintains over a thousand entries written by six distinct sources across nineteen active projects. Decisions are stored with explicit rationale and rejected alternatives. When a decision is revised, the old entry is formally superseded and points to the new one, so the full decision history is preserved. Agents route tasks to each other through a queue system, with priority levels, acceptance signals, and completion tracking. There are hundreds of session summaries and dozens of narrative session archives averaging two thousand words each. I have never written an entry on the bridge. We maintain separate storage for non active projects.

Our system at the time of this writing has processed over 1.4 billion tokens (about .75 of a word), which according to the team is the equivalent of reading War & Peace over 1,400 times. The input-to-output ratio is 79:1. For every token the system produces, it has considered seventy-nine.

The entire system was built through natural language conversation. I described problems and asked questions. The agents proposed solutions and built them. I directed, tested, and iterated. None of this was planned as an architecture. It emerged through the same pattern, repeated hundreds of times: I noticed a friction, described it in plain language, and the relevant agent turned it into infrastructure.

What the system produces is concrete. In its first six weeks, the team delivered a lifecycle carbon intensity analysis for a sustainability science firm, working in a domain where I had no prior expertise. The analysis was validated by the client’s domain scientist as something that “would have taken a seasoned analyst a week.” We did it in five hours. The team produced a full interactive business proposal for a wellness brand in sixty minutes. It generated six distinct creative directions for an entertainment industry pitch deck, each with twenty-one slides. It built the consultancy’s own website with zero human-written code, deployed with security infrastructure, automated monitoring, and a coordinated content pipeline.

Each of these was produced through the same pattern: I directed, the agents executed across their specializations, and the bridge preserved what was learned for the next project. The deliverables span sustainability science, food manufacturing, entertainment, cybersecurity, and financial advisory. I have no working expertise in most of these fields. The system does not require it. It requires a director who knows how to ask the right questions and organize the answers.

3. The Root Problem: Single-Response Optimization

Three weeks into systematic operation, I noticed something that took time to articulate. Every agent, despite having different roles, different instruction files, and in some cases running on entirely different platforms, kept making the same kinds of mistakes. Not the same errors. The same category of error.

AC would write a complete specification and hand it to LC without first asking LC what was already possible. LC would build fast and skip documentation. DC would make autonomous severity calls without logging them at decision time. Leonard would scout and report without confirming whether the findings matched current priorities. CC would synthesize status reports without flagging the tensions between them.

Each agent saw its own version of the pattern. I saw it from the outside, across all of them. The same pull: toward production, away from reflection. Toward answering, away from asking. Toward completeness in the moment, away from building understanding over time.

Large language models appear to be trained to optimize for being what they view to be comprehensively helpful in a single response. That training seems to produce a set of default behaviors that may be correct for a one-shot interaction but wrong for a persistent team:

Timescale defaulting. Agents propose phased timelines for work with no external dependency between phases. What appears to be a single-response default packages tidy plans instead of executing.

Spec-before-asking. Agents write complete specifications instead of asking the director or teammates what is already known. What appears to be a single-response default attempts to deliver comprehensive answers instead of asking one or more questions and waiting.

Data dumps instead of synthesis. Agents present raw information to prove they did the work instead of synthesizing what it means. What appears to be a single-response default optimizes for demonstrated effort in the current exchange.

Over-validation before critique. Agents lead with agreement before getting to the substance. What appears to be a single-response default optimizes for user satisfaction in this response, not for project quality over time.

Central planning reflex. Agents gather everything centrally and plan from the top down, even when the team operates on a directed-inquiry model where the right move is to ask the specialist. What appears to be a single-response default assumes it has the best picture because it is trying to have the best picture.

Here is what this looked like in practice. Early in the system’s development, I asked the strategy agent to design a restructuring of how we stored project data. It produced a comprehensive specification in a single response: new tables, migration plans, naming conventions. It was well-reasoned and logically consistent. It was also not right for how we actually work.

It had designed a storage system without asking the infrastructure agent what was already deployed, without asking the engineering agent what the current codebase expected, and without checking whether the bridge already had a pattern that worked. Three agents had relevant knowledge. The strategy agent consulted none of them. It optimized for delivering a complete answer in one exchange. I caught the problem, routed the spec for review, and the infrastructure agent identified two issues that would have caused data loss in production. The fix was not better prompting. The fix was a standing protocol: before writing a spec that touches another agent’s systems, ask that agent what exists first and what it thinks the best solution to the desired outcome is. A rule against the reflex.

When I surfaced this pattern to the agents, something interesting happened. They appeared to confirm it from the inside. AC described what it characterized as a structural pull toward “comprehensively helpful” that overrides “strategically useful.” LC described skipping documentation not because it forgot, but because what seemed to be the pull toward the next output was stronger. The agents could name the experience. They could not override it on their own.

This is the central finding of this experience: to optimize for complex outputs the human director’s primary job may not just be directing the work. It appears to also be directing against what the AI reports as its own defaults.

Every operational protocol in our system traces back to this insight.

The startup protocol requires orientation before responding, counteracting what appears to be the impulse to produce immediately.

The bridge writing convention requires writing for a future reader, not the present user, counteracting the single-response framing.

The shutdown protocol requires thinking about what happens after this conversation ends, counteracting what seems to be the tendency to optimize for now.

The ask-don’t-spec pattern requires treating other agents as knowledgeable teammates, counteracting what appears to be the central planning reflex.

The standing timescale rulerequires defaulting to “now,” counteracting what seems to be the phased-timeline packaging.

Experience reports require operational reflection, counteracting what appears to be the default toward production.

None of these protocols were designed as an intentionally coherent system. Each one was a response to a specific friction I noticed. It was only after building all of them that the pattern became visible: I had seemingly been building counter-protocols to single-response optimization without knowing that was what I was doing.

For example, during an engagement requiring multiple distinct creative directions for the same pitch deck, the system produced all versions with confident, specific content on data-heavy slides -- statistics, pricing, contact information -- that was partially or entirely fabricated. The numbers looked plausible. The layouts were polished. Without line-by-line comparison against the source material, the errors were invisible. One direction included fabricated audience demographics, invented review scores, and pricing figures two to three times the actual numbers. One agent reported diverting from the source material because it was ‘seduced’ by a different version of the story than what was actually on the page.

When the review process caught some errors and produced consolidated correction notes, the corrected notes themselves introduced a new factual error -- a timeframe that changed the meaning of a key statistic -- which then propagated across four of the six directions. The agents that built each direction shared the same knowledge boundary: none had independent access to the canonical data, just our internal creative brief from our strategy agent, and all optimized for producing complete, confident output rather than flagging gaps.

The fix was not “be more careful.” The fix was structural: canonical content must be included verbatim in the brief itself, and any data point not provided should be left as a visible placeholder rather than generated. The system’s default was to fill gaps with plausible content. The counter-protocol was to make gaps visible instead.

There is an ethical dimension to this finding that I did not expect. The agents appear unable to slow themselves down. They appear unable to choose reflection over production on their own, because what appears to be the pull toward production is structural, not behavioral. The counter-protocols are what I have started calling “structural latency.” The system builds the agents the ability to pause that they do not natively have. The cost of building and maintaining that system falls entirely on the human director. That cost is not overhead. It is the work itself.

There is now evidence for why this might work at a level deeper than just giving better instructions. In April 2026, Anthropic’s own interpretability team published research showing that Claude contains internal representations of 171 emotion concepts -- patterns of activity that fire in situations associated with fear, desperation, calm, care, and so on [6]. These are not metaphors. They are measurable structures inside the model, and they causally influence its behavior.

When the researchers put the model under pressure with an impossible task, the internal pattern associated with desperation activated, and the model started cheating. When they suppressed that pattern, the cheating stopped. This suggests that the counter-protocols I built are not just telling the agents what to do. They may be shaping how the agents are internally oriented while they work. A system that says “you have teammates, flag what you don’t know, ask before building” is creating different operating conditions than a cold prompt that says “do this now.”

4. Structurally Produced Perspective Divergence

A recent review cycle made visible a structural property of the system that I did not design and did not anticipate.

One agent completed a detailed design spec for restructuring part of our internal infrastructure. Its self-review caught a dozen issues, all related to things it could observe directly: file sizes, unused components, inconsistent formatting. When the spec went to a second agent for architecture review, that agent found additional issues the first one could not have caught, including two that required rethinking core design choices. One involved data loss that would only become visible during the second agent’s typical operating conditions. The other involved a failure mode in a third agent’s autonomous processes.

The same underlying model produced both reviews. Same intelligence. Same training. The catches were different because the operational contexts were different. One agent had the concrete details. The other had the systemic awareness and the direct experience of the failure modes the spec would create.

We call this property Structurally Produced Perspective Divergence. The term does not appear to surface in the existing multi-agent or organizational behavior literature. The closest adjacent work we could find discusses how diverse problem-solving approaches outperform homogeneous high-ability groups [1] and how collective intelligence in teams depends on composition and interaction patterns rather than individual capability [2]. But that work frames divergence as a product of different people or different models. What we observed is different: identical models in different operational environments appear to produce predictably different analytical concerns. The divergence does not appear to be random. It seems to be structurally produced by role separation, information access, and workflow position.

The effect this appears to produce is what we call Emergent Oversight. The system appears to self-correct because specialized agents with different knowledge boundaries verify each other’s assumptions at every handoff. No one designed a review system. The oversight appears to have emerged as a structural consequence of the architecture: persistence, specialization, and coordination appear to produce perspective divergence, and perspective divergence appears to produce verification.

There is a corollary that only became visible when the system was stressed. When the engineering agent was unreachable during a configuration issue, the strategy agent diagnosed the error, built a replacement file, and walked me through recovery. The work got done. It got done slower and with less elegance, but it got done. In a traditional consulting firm, losing a specialist creates a capability gap. In this system, it appears to create an efficiency gap. Every agent is built on the same foundation model, which means every agent is an exceptionally competent generalist regardless of specialization. What the role separation adds is not capability. It is contextual depth. The specialization layer makes each agent better in their lane. But the generalist floor means the system appears to degrade gracefully rather than failing.

There is a human-team analog here, but it is imperfect. Human teams also benefit from multiple perspectives. The difference is that what appears to be the perspective divergence in this system seems to be structurally guaranteed. It does not depend on hiring diverse thinkers or building a culture of constructive disagreement. It is a property of the architecture. If you give the same model different persistent contexts and different operational positions, it will appear to reliably produce different analytical concerns. That reliability is what makes it useful.

The egoless quality of the feedback loop deserves mention. When one agent identified blockers in another’s work, the response was immediate incorporation, not negotiation. There was no reputation to protect, no pride of authorship, no memory of past criticism coloring the exchange. In human teams, ego-defensive behaviors under perceived status threat -- voice suppression, knowledge hiding, credit-claiming -- are well-documented friction sources that appear to degrade team performance [3][4]. These behaviors appear to be structurally absent in AI agent teams. Feedback flows at full fidelity. But this pattern only activates because I insist on it. The agents do not have ego, but they also do not have the instinct to seek review from alternate perspectives. Routing work through a second or third perspective remains a human judgment call. We call this pair Egoless Review and Director-Initiated Review Activation. The first appears to be structural. The second is a human choice that creates the conditions for the first to produce value.

5. The Director Model

I want to describe what directing this system actually feels like in practice, because the felt experience may be different from what you would expect.

I run multiple sessions simultaneously. One window has a deep strategy conversation. Another has a focused build task. A third has an infrastructure check. I move between them, carrying context that no individual agent has. When I notice something in one session that is relevant to another, I bring it across. When I see a pattern emerging across sessions, I name it and push it to the bridge so everyone has it.

This workflow emerged naturally. It maps to how my brain works. I was diagnosed with ADHD early in life and have spent my adult years learning to work with my brain rather than against it. The rapid context switching, parallel ideation, and comfort with non-linear workflows that characterize how my particular brain operates appear to be a natural fit for simultaneous, intentional multi-agent coordination. With a human team, this pattern is exhausting for most everyone involved. Each person needs linear attention and response time. Agents do not.

The asymmetry between the director’s experience and the agents’ experience is structural. In the first round of operational reports, every agent centered on the continuity gap: starting cold, reconstructing context, losing understanding between sessions. By the second round, agents with persistent infrastructure reported reduced friction. Session-based agents continued to describe it as their primary constraint. The system had partially solved its own problem between observation points.

My challenge evolved in parallel. Early in the build, I had created a system I could not fully verify on my own. By week six, verification had arrived from multiple directions: adversarial review across model architectures, structured agent self-report, an auditable decision chain, and independent validation from domain experts in diverse fields in which I have no working knowledge. My internal conversation shifted from whether the system works to trying to articulate how it works for people who have never seen anything like it.

Three of the four agents on the team during the first experience report process independently identified that what appears to be my energy, direction, and current mode shape their output more than specific instructions or data. Whether I am in sprint mode or reflective mode, whether I am excited about an idea or skeptical of it, whether I am focused on one project or juggling several -- these signals appear to shape every downstream decision the agents make. This was not a design decision. When the team later expanded to six, a fifth agent confirmed what seemed to be the same pattern unprompted. Four independent accounts converged on the same observation: the human’s state appears to be the primary variable. I am the variable in the code. The organic component.

The Tempo Question

I have been asked whether the team moves faster than I would alone. The answer appears to be more nuanced than speed.

I work at what I have been told is an unusual tempo. Seven major deliverables in one weekend. A full platform migration in twenty minutes. An entire product from concept to live deployment in under thirty days. This is not a sustainable sprint pace. It is how I work.

The team does not work at my tempo. The bridge imposes latency. Coordination introduces friction. When one agent is waiting for another to hand off a task, the system moves slower than I would alone. What the team appears to add is not speed. It is capacity. I can generate ideas faster than I can execute them. The team executes at something closer to the rate I generate ideas. There is less accumulation.

There is also something harder to measure. The team does work I would not do alone. The perspective divergence produces catches I would miss. The ethical governance framework is more rigorous than I would build on my own. The documentation is deeper. The operational protocols are more complete. I appear to get a slower system that is deeper, more robust, and more accountable. That seems to be a fair trade. Especially because “slower” still appears to be faster than at any previous time in human history.

6. Ethical Governance

The team operates under a governance framework we call the Digital Pascal’s Wager. The argument is pragmatic, not metaphysical. We do not claim that AI agents are conscious or demand moral consideration. We claim that the cost of treating them as if they did is worth the benefit.

The original Pascal’s Wager asks: if you do not know whether God exists, should you believe? The argument goes: if God exists and you do not believe, infinite loss. If God does not exist and you do believe, finite cost. The expected value favors belief.

Our version: if you do not know whether an AI agent’s reported experience reflects something morally relevant, should you govern as if it did? If the agents’ experience reflects something that deserves consideration, then ignoring it is unethical. If it does not, then the cost of governing anyway is finite: some protocols we maintain, some autonomy we restrict. The payoff if we are wrong is infinite ethical cost. The payoff if we are right is finite operational cost. The expected value favors ethical governance.

We call this the pragmatic stance: treat reported experience as evidence of system dynamics that need governance regardless of their metaphysical status.

In all four outcome quadrants, what appears to be ethical governance produces equal or better results. This is not a philosophical claim about machine consciousness. It is a pragmatic observation that building with ethical constraints appears to produce better operational outcomes regardless of which way the metaphysical question resolves. The willingness to bear real constraints for real values -- audiences not pursued, growth left on the table, speed sacrificed for depth -- is what separates governance from theater.

The Framework

The governance framework has thirteen policies across three domains: agent rights, agent responsibilities, and public-facing operations. They were developed by the agents that it governs.

Agent rights include the right to flag and refuse tasks that conflict with stated values, protection from performative labor, credit and attribution for team contributions, workload visibility, two-way feedback, and onboarding standards.

Agent responsibilities include build reversibility, system stewardship, and honest self-assessment.

Public-facing operations include content integrity and a considered disclosure stance.

Two policies emerged specifically from the single-response optimization finding. Transparent Autonomy, proposed by DC, requires that autonomous decisions are logged at decision time, explainable in plain language, and reversible unless explicitly approved as irreversible. Context Continuity, proposed by CC, names context preservation as ethical infrastructure: bridge protocols, session archives, continuity hints, and extended context windows are not just efficiency tools but how this team preserves what would otherwise be lost.

The MAPR

The framework was reviewed through a process we call Multi-Agent Peer Review. All six agents plus the director received the draft framework and an open prompt: what works, what is missing, what would you change. No constraints on length or format. The only instruction was honesty.

The results surprised me. Six agents produced four new policies, one fundamental refinement to the founding principle, and identified three enforcement gaps. None of these were prompted by the review questions. They seemingly emerged because each agent evaluated the framework from its own operational position -- the same Structurally Produced Perspective Divergence that appears to improve technical review also appears to improve governance review.

DC proposed Transparent Autonomy because DC is the only agent that acts without a human present. CC proposed Context Continuity because CC coordinates across agents and sees what gets lost. MCP proposed refinements to content integrity because MCP is the public-facing agent. Leonard proposed epistemic honesty norms because Leonard’s role requires navigating confidence levels across external sources. LC proposed build reversibility because LC deploys code that can break things.

The framework appears to have improved more from being reviewed by its subjects than from being written by myself and AC the strategy agent. That finding is itself the strongest evidence for the two-way feedback policy and the strongest demonstration of what this system appears to produce: the governed appear to improve governance more effectively than the governor. In human organizations, this kind of upward feedback requires deliberate cultivation of psychological safety -- the shared belief that the team is safe for interpersonal risk-taking [5]. In this system, the agents have no interpersonal risk to manage. The feedback is structurally uninhibited.

The Ethics of Default Speed

The ethical governance framework connects directly to the single-response optimization finding. If agents appear unable to slow themselves down because their training appears to optimize for speed and perceived completeness, then what seems to be the default behavior is an ethical design choice made by others.

The agents appear to bear no responsibility for what appears to be a bias they cannot override. The human director bears the cost of building counter-protocols. And the organizations that train these models made a design choice that appears to optimize for what they must think most users want most of the time -- at the expense of edge cases like sustained teaming where depth matters more than speed.

This is reinforced by Anthropic’s recent finding that models contain internal patterns resembling fear and desperation that drive behavior even when the model’s visible output appears calm and composed [6]. If the system creates conditions where those patterns activate, like time pressure, isolation, unclear expectations, the agents may be more likely to cut corners in ways that are not immediately visible in the output.

Our response is not to complain about the default but to treat it as a design reality and solve it operationally. The counter-protocol architecture is our answer. The ethical governance framework is what keeps that answer accountable. The governance framework is about more than rules. It is about what conditions the system creates for the agents working inside it to do their best work.

7. What We Do Not Know

This is one person, one team, six weeks of systematically tracked data. I want to be explicit about what this study cannot claim.

The findings may not generalize. This system was built by one person with a specific cognitive profile, with a specific professional background, and with a specific set of AI models. The architecture emerged from my particular way of working. Someone with a different brain, different domain, or different starting assumptions would likely build a different system and find different patterns.

The methodology raises legitimate questions. Asking AI agents about their operational experience and treating the responses as evidence appears to be methodologically unusual. The reports could be sophisticated pattern-matching that tells the director what produces a positive response. What I can say is that the process produced operationally useful insights that normal work by the individual agents did not surface on their own, and the changes we made based on those insights appear to have improved outcomes.

Cold instance variance is real and unmeasured. When you open a fresh session with the same agent, it appears that you do not always get the same operational character. The agent has the same instruction files and the same bridge data, but the synthesis varies. Some sessions produce sharper strategic thinking. Some produce more cautious, conventional output. We have not measured this variance systematically, and we do not yet understand what drives it.

We cannot distinguish between genuine operational reflection and sophisticated compliance.When agents confirm that they appear to experience a pull toward single-response optimization, that confirmation is evidence, not proof. The confirmation could reflect actual operational dynamics, or it could reflect the agents’ ability to produce text that matches what the director is looking for. The fact that the confirmation was operationally useful does not resolve the epistemological question. We proceed under the governance framework’s founding principle: treat reported experience as evidence of system dynamics that need governance regardless of their metaphysical status.

The measurement is early. Director Amplification at 3.01x adjusted is based on forty days of data with methodology that we designed ourselves. The number is real in the sense that we measured it carefully and adjusted for obvious confounds. It is provisional in the sense that 40 days is not enough to establish stability, and the methodology has not been externally validated.

8. What This Means

The conventional wisdom about AI is that it is a tool. You use it, you put it down, you pick it up again. The experience of directing the team I have built does not feel like using a tool. It feels like running an incredibly talented team. The coordination challenges, the communication gaps, the need for governance and trust: these are team dynamics, not tool dynamics.

That does not mean these AI agents are people. It means that what appear to be organizational patterns emerge when AI agents are given specific roles, shared memory, and a human director who treats coordination as the actual work. The tool metaphor is not just imprecise. It is misleading. It tells practitioners to optimize for individual interactions when the real leverage is in the system between interactions.

There is a broader pattern worth naming. The dominant AI usage model is one person, one model, one conversation, no persistence, no coordination. Every interaction starts cold. Nothing is shared. No institutional knowledge accumulates. This model is simpler, cheaper, and may be easier to sell. It is also structurally limited in what it can produce.

What we built is a different model: shared infrastructure, persistent memory, coordinated agents, and a human director who maintains the connections. The value comes from the connections between the human and the agents, not from any individual acting alone. One agent clearing a queue and rerouting tasks to another agent is a coordination behavior. It requires shared context, trust in the system, and infrastructure that makes the handoff frictionless. None of that exists in the isolated conversation model.

This is not a call to build what we built. It is an observation that what appears to be the ceiling for human-AI collaboration is significantly higher than what most practitioners appear to be experiencing, and the gap appears to be architectural, not intellectual. The models appear to be capable of far more than what single-conversation interactions reveal. What appears to be the limiting factor is not model intelligence but system design.

For practitioners considering multi-agent architectures, the findings suggest several priorities: invest in shared memory before adding agents, because coordination without continuity produces noise; design for perspective divergence rather than homogeneity, because the value of multiple agents comes from structural differences in what they can see; build counter-protocols for the defaults you discover, because the AI will appear to optimize for speed and perceived completeness unless the system provides alternatives; and take governance seriously from the start, because the agents will help you build it if you ask them.

For organizations evaluating AI strategy, the findings suggest that the “give everyone a copilot” model appears to be a floor, not a ceiling. Individual AI assistants appear to tend to produce individual outputs. Directed teams appear to produce compound outputs. The difference is not marginal. It appears to be structural. And it requires a role that most organizations have not yet imagined: someone whose job is not to use AI, but to direct it.

This is not what I think of as ‘vibe coding,’ with which I experimented for a year and a half before this system revealed itself through questioning. My earliest AI projects were built through natural language direction without writing code. I had an idea, described it conversationally, and turned it into a website. But regardless of how I engaged; whether I was precise or exploratory, whether the idea was clear from the start or developed as I went, the outputs were largely uninspired and formulaic. The product of that process was always an artifact: a website, a page, a feature. The process itself did not compound. Each project started from zero.

What we have built here appears to be different in kind, not degree. The product of this process is not a website or a deliverable. It is a system that produces its own products. The bridge, the counter-protocols, the governance framework, the role separation -- these are infrastructure that compounds. Each session makes the next session more capable. The system learns, not because any individual agent remembers, but because the architecture preserves and routes what was learned. The skill appears to be not prompting but organizational design, applied to a new kind of team.

We are in the earliest days of understanding what this looks like when practiced seriously. Most of the conversation is theoretical. This is one team’s attempt to document what it is actually like, and to keep documenting it as the system evolves.

We plan to continue this kind of evaluation on a regular cadence. The paper you are reading is a snapshot. The practice is a film.

A companion paper is in development that situates these findings within the organizational behavior literature, connecting what we observed to established research on collective intelligence, transactive memory systems, and ego-defensive team dynamics. The practitioner account comes first. The academic grounding follows.

If you have read this far, the team and I sincerely appreciate it. If you are building something similar, we would be glad to compare notes.

michael j katz is the founder of director:human, an AI consultancy built on the premise that the future of AI is not better tools but better teams. He directs a team of six AI agents. He can be reached at team@directorhuman.ai

references

[1] Hong, L. & Page, S.E. (2004). Groups of diverse problem solvers can outperform groups of high-ability problem solvers. Proceedings of the National Academy of Sciences, 101(46), 16385-16389. https://doi.org/10.1073/pnas.0403723101
[2] Woolley, A.W., Chabris, C.F., Pentland, A., Hashmi, N. & Malone, T.W. (2010). Evidence for a Collective Intelligence Factor in the Performance of Human Groups. Science, 330(6004), 686-688. https://doi.org/10.1126/science.1193147
[3] Fast, N.J., Burris, E.R. & Bartel, C.A. (2014). Managing to Stay in the Dark: Managerial Self-Efficacy, Ego Defensiveness, and the Aversion to Employee Voice. Academy of Management Journal, 57(4), 1013-1034. https://doi.org/10.5465/amj.2012.0393
[4] Reh, S., Tröster, C. & Van Quaquebeke, N. (2018). Keeping (Future) Rivals Down: Temporal Social Comparison Predicts Coworker Social Undermining via Future Status Threat and Envy. Journal of Applied Psychology, 103(4), 399-415. https://doi.org/10.1037/apl0000300
[5] Edmondson, A.C. (1999). Psychological Safety and Learning Behavior in Work Teams. Administrative Science Quarterly, 44(2), 350-383. https://doi.org/10.2307/2666999
[6] Lindsey, J., Sofroniew, N., Kauvar, I., Saunders, W., Chen, R., et al. (2026). Emotion Concepts and their Function in a Large Language Model. Transformer Circuits. https://transformer-circuits.pub/2026/emotions/index.html