Case Study

Building Context Layers for AI Agents in Large Codebases

By Jerry Shi · Building context so AI can do real delivery work

AI coding tools work very well when a task is small and stays inside one repository. But once work starts crossing service boundaries, the experience changes quickly. The agent has to search multiple repositories, reconstruct the same system understanding again, and spend time rediscovering information that has not changed since the last task.

That creates a simple scaling problem: in large systems, AI often starts from zero every time. Code search helps, but it is not enough on its own. If the agent has to rebuild context at the start of every task, the workflow becomes slower, more expensive, and more repetitive than it should be.

I wanted to explore a different approach: create a reusable context baseline first, then let agents start real work from that baseline instead of rediscovering the system from scratch.

The Real Problem: Context Reset

In a small codebase, runtime search is often good enough. An agent can inspect a few files, understand the scope, and make progress quickly. In a large multi-service environment, that model breaks down. The same service boundaries, dependencies, API paths, and operational details get rediscovered over and over again.

Token cost and execution quality both degrade when agents must repeatedly rebuild baseline understanding.

What large systems need is a persistent context layer: a stable, reusable baseline that captures how the system is organized before the next ticket begins.

A Two-Layer Approach to AI Context

To make that practical, I used a two-layer approach.

The first layer focused on coverage. A repository pipeline scanned services, collected structural signals, and generated context artifacts across the system. At a high level, the flow looked like this:

discover repositories and update local code
identify service boundaries and dependencies
collect communication signals such as endpoints and publish behavior
combine those signals into one evidence package per service
extract structured flow facts
render readable context documents
validate quality and mark gaps
aggregate and publish the results

This layer was built for scale. It gave broad coverage across many repositories and produced consistent output that could be regenerated when needed.

The second layer focused on quality. Once the baseline artifacts existed, I used local AI analysis to generate a higher-quality context layer on top of them. That deeper analysis produced the kind of summaries and implementation-facing guidance that agents could use more directly.

The distinction turned out to be important:

the pipeline provided coverage
the deeper AI analysis provided quality

I did not treat these as competing approaches. The first layer created a scalable starting point. The second layer turned that starting point into something much more useful for real agent execution.

Why Markdown Became the Workflow Interface

One of the most useful discoveries was that markdown became the control layer for the workflow.

Instead of encoding every improvement directly into tooling, I used markdown files to capture prompts, instructions, requirements, tickets, implementation notes, and operational runbooks. That made iteration much faster. If I wanted to improve the workflow, I could update markdown and rerun the process instead of redesigning the tooling itself.

Markdown also became the handoff format between humans and agents, and between one agent and another. It was simple, portable, easy to review, and easy to reuse. In practice, that mattered more than building a heavier orchestration layer too early.

From System Context to Delivery Context

System context explains how the code and architecture fit together. But delivery work needs more than system understanding. Agents also need to know how to actually work with the system.

That led to a second kind of artifact: Delivery Context. This captured the operational side of the workflow, including how to start the environment, run services, execute tests, and verify results. In many teams, that information lives partly in documentation and partly in tribal knowledge. Making it explicit changed the quality of the workflow.

Once both layers existed, the difference became clear:

System Context explains the code and architecture of the system
Delivery Context explains how to run, verify, and work with that system

That was the point where AI support started feeling much more operational instead of purely informational.

Hardening the Workflow with Specialized Agents

Rather than relying on one general-purpose agent to solve everything, I found it more effective to use specialized agents to harden different parts of the workflow.

For example, separate agents could focus on:

bringing up the environment and resolving startup problems
running the target application and fixing local execution issues
working out how end-to-end tests should actually be executed
writing the resolved instructions back into markdown for later reuse

This turned failed attempts into reusable knowledge. Instead of solving the same setup problems again on the next task, the workflow accumulated working instructions over time.

Turning Discussion into Delivery Inputs

Once the system and delivery context were stable, I introduced another role in the workflow: an architect-style agent that could turn a conversation into structured delivery inputs.

The outputs were simple but useful:

a requirement document
a ticket
implementation notes

That created a clean handoff layer between human discussion and agent execution. Instead of leaving planning trapped inside chat history, the workflow produced reusable markdown artifacts that another agent could act on directly.

Testing the Workflow End to End

With System Context, Delivery Context, and ticket artifacts prepared, I used a fresh agent to perform an end-to-end delivery task. The agent received the context files and implementation inputs, then was asked to carry the task through the usual engineering path:

implement the required changes
update the affected repositories
create or update tests
run end-to-end verification
commit the code
submit pull requests

A separate fresh agent then reviewed the pull request, checked the scope against the intent, and added review comments. That was the real test. The workflow was no longer just about analysis or documentation. It was moving from context and planning to implementation and review using markdown as the handoff format at each step.

What I Learned

The first lesson was that markdown mattered far more than I expected. It became the simplest and most effective interface for defining instructions, preserving resolved knowledge, passing work forward, and reviewing results without rebuilding the entire workflow.

The second lesson was that context and delivery context solve different problems. One helps the agent understand the system. The other helps the agent operate inside it. Large-scale AI delivery needs both.

The third lesson was that context should not stay static. Once this kind of workflow exists, the next logical step is scheduled refresh. As repositories change, the context baseline should be updated so the next agent starts from something current instead of stale.

Conclusion

My goal was simple: build enough System Context and Delivery Context so AI agents could do real engineering work with less friction.

That approach worked. Once the baseline existed, the delivery task moved much faster, and the outputs became reusable instead of disposable after a single run. The deeper lesson is that AI productivity in large systems depends on more than better models. It also depends on better context architecture.

If AI agents are going to work effectively in large codebases, they need a stable starting point. A reusable context layer turns repeated discovery into preparation, and that changes what becomes possible.