Skip to main content

How we run software projects with AI at Orus

· 14 min read
Pierre Coimbra
Pierre Coimbra
Software Engineer
Cover

On our last project, the way we planned, built and reviewed the work changed more than the code itself. AI agents were part of the project from day one, no longer a tool we used from time to time. This post is about that method.

The project was a rebuild of our organizations module, one of the oldest and most central parts of the codebase. It models partners, brokers and their members, and it sits upstream of most of what the platform does. Replacing it had been on the backlog for a while. It was the kind of project that stays "not yet" because the risk surface is high: everything touches it, and touching it wrong has cascading consequences.

We shipped it. We sized it at around 30 working days, shipped in about 20, and every milestone landed ahead of its ETA. But the schedule is not the point. The method is, and it is one we want to apply to all our next projects, not just this one.

Why this was a good test case

Replacing a central module in a monorepo is not like adding a feature. The blast radius is wide. You are rewriting something dozens of other modules depend on, in an event-sourced architecture where changing a store contract ripples into the views, reducers and reactors across the whole codebase.

The project had exactly the characteristics that make AI assistance either very useful or very dangerous, depending on how you structure it:

  • High contextual complexity. The agent needs to understand our stores, views and event-handler patterns before it can do anything useful. Generic knowledge of TypeScript and event sourcing is not enough.
  • A wide dependency graph. Any change to the organizations store contract touches downstream views, reducers and reactors in other domains.
  • A mix of reversible and irreversible decisions. Some choices are easy to fix later. Others are not.

Project memory: what the agent knows before it writes anything

The biggest lesson from earlier work with Claude Code is that output quality is determined by how much the agent knows about our codebase. A cold agent writes correct-looking TypeScript that does not compile, invents store patterns that do not exist here, and misses invariants every developer knows by heart. Not because the model is bad, but because it is flying blind.

The monorepo does a lot of the work for us. The agent can go through the real dependency graph, read the actual store definitions from other domains, and see what a view in our codebase concretely looks like. It can compare how the subscriptions module structures its stores to how the claims module does it, and derive the conventions. That is structural context a multi-repo setup cannot give you for free.

On top of that, a single instruction file pins down the conventions that are easy to get wrong. It is committed to the repo, shared by every agent, and deliberately short. A few of the rules from ours:

## Critical rules

1. Never use try/catch for app errors. Use Result types from `@orus.eu/result`.
2. Never use `as` type assertions. Fix the type at the source instead.
3. Views are the only classes allowed to read from stores. Services and routers go through views.
4. Use `timeService.getCurrentDate()`, not `new Date()`, in backend services.
5. Always run `yarn format` before committing. There is no pre-commit hook, CI will fail.

What neither the code nor the rules can provide is the intent behind decisions: the domain rules that live in people's heads, the invariants that are not expressible in TypeScript. For that we write markdown files, scoped to the project:

FileWhat it holds
Project memoryThe single source of truth: every decision taken and why, so any agent we start picks up the full context.
Per-milestone / per-phase filesThe detail for one slice of the work is kept separate so parallel agents on different milestones do not interfere.

These are written for the project and used during it, not maintained as a permanent knowledge base. The agent reads them before it writes anything. The monorepo gives the structural context, the markdown gives the project context, and neither alone is enough.

Here is the kind of thing that goes into the project memory file:

# Project memory: organizations v2

## Decisions
- Membership is its own entity, keyed by `userId` (not email). Email lookups become
an explicit user_account -> membership join at the call site. (IRREVERSIBLE)
- Platform roles (the old `orus-staff.ts` lists) are modeled as memberships to the
`orus` organization, with `role` in `PlatformRole`. (IRREVERSIBLE)
- Cut consumers over to v2 in two waves per store: switch reads first, then deprecate
the v1 mutation entry points. No fallback path, no feature flag. The drift test is
the safety net.

## Invariants
- A consumer migrated in the read wave must contain zero `store.append` calls for the
store being migrated. No read-from-v1 / write-to-v2 in the same flow.
- The forward dispatcher (v1 -> v2) must stay green. Weak dispatcher, stale reads.

## Footguns
- Never retire the orus-staff sync while the source file still feeds it. Running the
sync against an empty desired state revokes every platform-role membership.

Reversible versus irreversible decisions

Not all decisions carry the same cost of being wrong, and we try to be explicit about the difference.

Reversible decisions are easy to change later: the name of a helper, how a file is structured inside a module, the order of fields in a view type. We delegate these to the agent entirely. It picks, we move on.

Irreversible decisions are hard or impossible to undo once committed. In an event-sourced system this category is larger than in a CRUD app:

  • A new event type added to a store is part of permanent history once it is in production. Changing its structure later means a migration.
  • A cross-domain contract is an API. The endpoints we expose to create and update organizations are an example. Once another system depends on them, changing them is a coordinated effort.
  • A structural choice, like whether membership is modeled as its own entity or as a field on the organization, shapes every store, view and reactor built on top, so it is expensive to walk back.

For anything irreversible, we add an explicit checkpoint. The agent lays out the options with their pros and cons, and depending on the stakes we dig deeper, explore alternatives, challenge the agent, and validate the chosen one with the rest of the team. This mostly happens in the exploration phase, before development starts. It is not a formal checklist, it is a judgment step that sharpens as you work.

It pays off in concrete ways. One of the design documents for this project flagged a single ordering rule as the only hard constraint of an otherwise mechanical migration: never retire the function that syncs internal staff roles while the file that feeds it still exists, because running the sync against an empty desired state would revoke every internal membership at once. Surfacing that footgun on paper, before any code, is the cheapest place to catch it.

The flow

The project ran as one loop: brainstorm, plan, build in subagents with tests first, review, ship. Each step has a clear exit condition before the next one starts, and the human stays on the gates that matter.

Brainstorm

Before any code, we work the rough idea through questions and alternatives. The design is presented back in sections for validation, then written to a design document. This is where the irreversible decisions get surfaced and arbitrated up front, while they are still cheap to change. Some are settled on the spot. The heavier ones go through a tech study review with the team before development starts.

The output is a document that ends with the decisions made explicit, so there is no ambiguity left for the build phase:

## Resolved decisions
- Minimal pending creation. The rest is completed on the existing edit page.
- Reuse the detail card's components, layout and constants, keep a minimal field set.
- `technicalName` is generated by the backend, not a user input.
- Permission granted to `techAdmin` implicitly, via its existing all-permissions filter.
- The router sets no values. Defaults are assembled at the frontend call site.

Plan

Once the design is approved, it is broken into bite-sized tasks, a few minutes of work each, with exact file paths, the code to write, and verification steps. Because we are on a monorepo, the plan's impact map is not an estimate, it is a traversal of the real dependency graph. The agent can tell us, concretely, that switching a given store over to its v2 reads touches these specific files in other domains:

impact of cutting organizations over to v2 reads (read wave)

store organization v2 already built, kept fresh by the forward dispatcher
readers ~28 backend files swap organizationView -> organizationViewV2, pure read swap
view subscriptions-persisted-view-30/31 rebuild must run after organization_v2 is initialized (DI order)
service partner-api reads org by token, cross-domain, behavior must stay identical
writer applyOrganizationsChangeV3 deprecate only in the second wave, after every reader has moved

That is not analysis, it is a grep. And each task reads like a recipe, with the verification baked in:

## Task 1: declare the `organization.create` permission (techAdmin only)

- [ ] Step 1: write the failing guard test
- [ ] Step 2: run it, confirm it fails
- [ ] Step 3: add `organization.create` to the permissions array
- [ ] Step 4: add the documentation entry
- [ ] Step 5: run the test, confirm it passes
- [ ] Step 6: type-check the package
- [ ] Step 7: checkpoint.

We review the plan, correct misunderstandings, make the calls flagged as ambiguous, and sign off before execution starts. The more thorough the plan, the more latitude the agents get on the rest.

Build in subagents, tests first

With a validated plan, a fresh subagent picks up each task. Because the tasks were designed to be independent during planning, they run in parallel: one agent on the new store and event types, one on the views and reducers, one on the reactors, one on the new UI, one on the tests. One of us can advance all of them at once.

We split the work across model tiers to match the cost to the job. The orchestrator runs on the strongest model we have, Claude Opus, because breaking a task down and judging when the output is stable is where the hard reasoning lives. The well-scoped subtasks it delegates run on a faster, cheaper model, Claude Sonnet. The expensive thinking happens once, at the top of the loop, where it pays for itself.

A subagent does not generate once and stop. The loop acts as an orchestrator. It breaks a task into subtasks, delegates, runs the TypeScript compiler and the tests, reads the errors, fixes them and re-runs until it reaches a stable state. Implementation follows a test-first discipline: write a failing test, watch it fail, write the minimal code to pass, watch it pass, commit. What we pick up is code that compiles and passes the tests, not a diff to debug. The loop leans entirely on the test suite. Weak tests, weak loop.

This is where most of the time saved came from. Not from typing faster, but from not carrying the debugging load ourselves. And the gain was not only in coding. Scoping, impact analysis and planning all moved faster too, which is why the whole project came in well under estimate.

Review, then ship

The bottleneck shifts from "how fast can I write this" to "how fast can I review what comes back", which is the better constraint to be limited by. It is the same idea as treating quality as the pace-maker rather than a tax on speed: the review is what keeps the pace sustainable. Each task comes back with a two-stage review: spec compliance first, then code quality. To keep the volume from flooding us, an automated pass runs before a PR reaches a human: consistency with repo conventions, edge cases not covered by tests, regressions in downstream modules, exhaustiveness on event-version switches. What reaches a person has already cleared that filter, so the conversation is about approach rather than mechanics.

Each of these steps can be enforced by an agent skill that activates at the right moment, so the discipline holds through tooling rather than willpower. A skill is a small workflow document the agent loads on demand:

---
name: prepare-pr
description: Prepare and create a pull request with full sanity checks.
---

# Prepare Pull Request
Runs pre-PR sanity checks (format, lint, type-check, tests), then creates a
properly formatted draft pull request using the repository's template.

No moving from a rough idea to code without a validated design. No implementation before there is a failing test.

The loop is only as strong as the feedback it runs against, and two areas are still work in progress. End-to-end tests, where the agents have a harder time using our test utilities. And the frontend, where keeping the output clean and aligned with our design system takes more guidance than the backend does. We are actively improving how the agents handle both, and to manage expectations: this is a practice we are still tuning, not a solved problem.

What we kept human

The planning gate and the reversible/irreversible split do a lot of the work, but two things stay with us on purpose.

  • Irreversible architecture calls. Anything that is hard to walk back goes through a human before it ships, by design. The agent can suggest an option and defend it. The decision to commit is ours.
  • The judgment calls that are not in the spec. Sometimes the right answer is not written down anywhere: a modeling choice that will age better, a trade-off that only makes sense if you know where the product is heading. The agent optimizes for what it can see. We own what it cannot.

There is a second reason to keep a human on the planning gate, beyond catching mistakes. It is how understanding spreads. Every time we go through it, the team builds a sharper, shared picture of the project, the kind that makes the next decision faster. Handing the gate to a fully autonomous agent would save a few minutes and cost us the most useful feedback loop we have.

Where this is heading

We are also pushing on the edges of the loop. A script can already take a ticket, clone a fresh workspace, and run the brainstorm-to-PR sequence with little supervision, and a bot handles the mechanical checks before a human looks at the result. It is early and we keep it on a short leash, but the direction is clear. The more of the scaffolding the agents own, the more our attention is free for the decisions that actually need a human. The bottleneck we are chasing next is the review itself, not the writing.

Conclusion

The organizations module is shipped, but the method is the part we are keeping. Coding was never really the bottleneck on this project. The work that mattered was upstream: the context we gave the agents, and the decisions we made before letting them run. That is exactly where we try to move the work in general, because the cheapest bug is the one you kill before you write it. It is where we will keep putting the effort on the next project.

If this is the kind of work you want to do, building the AI workflow as much as the product it ships, we are hiring software engineers in Paris. Have a look at our open positions.