Building Scalable AI Agent Systems: Three Evolutions
Building Scalable AI Agent Systems: Three Evolutions
I. December 2025
We needed to add a new feature: group discussions (discussionChat).
This should've been simple. We already had interviewChat—one-on-one conversations where users deeply engage with AI-simulated personas. Group discussion was just scaling from 1-to-1 to 1-to-many: 3-8 personas engaging simultaneously, watching perspectives collide and insights emerge.
In theory, we just needed to:
- Reuse the interview logic
- Adjust prompts to simulate group dynamics
- Tweak the UI to show multiple speakers
The reality: We had to modify 12 files.
Worse, we discovered this:
Three nearly identical agent wrappers. Every new feature required copy-pasting across all three. Every bug fix meant changing it three times.
That moment, we realized: something was fundamentally wrong.
Not that our code wasn't elegant. Not that we lacked abstraction. But that we were building AI Agent systems with traditional software engineering thinking.
This article chronicles how we escaped this trap—through three architectural evolutions, rethinking how AI Agents should be built from first principles.
II. Rethinking: What is an AI Agent?
Before refactoring, we stopped to ask a fundamental question:
What's the essential difference between AI Agents and traditional software?
The World of Traditional Software
Traditional software is built on state machines:
This model's core assumptions:
- State is explicit: I know exactly where I am
- Transitions are deterministic: Given state + event, next state is unique
- Control is precise: if-else covers all paths
This works beautifully for traditional software. But for AI Agents?
The World of AI Agents
LLMs don't work this way:
Where's the "state" here?
- Not in a
statefield - But in the entire conversation history
The AI infers from conversation history:
- What research does the user want?
- How far have we progressed?
- What should happen next?
This is a completely different paradigm.
Three Core Insights
From this observation, we derived three insights that shaped our architectural evolution.
Insight 1: Conversation as State
Traditional approach: Maintain explicit state
AI-native approach: Infer state from conversation
Why is conversation superior to state machines?
- Natural alignment: LLMs work on message history natively
- Strong fault tolerance: State machines are hard to recover from errors; conversations can be "rewound" and replayed
- Easy extension: Adding new capabilities doesn't require modifying state graphs
Insight 2: Reasoning-Execution Separation
How humans make decisions:
- Understand intent: "What am I trying to achieve?" → Clarify goals
- Choose method: "How do I do it?" → Execution steps
AI Agents should follow the same pattern:
Why separate?
- Reasoning needs deep thinking (use Claude Sonnet 4)
- Execution needs fast response (can use smaller models)
- Separation of concerns, single responsibility
Insight 3: Simple Over Precise
Facing the "AI forgetfulness" problem, we could:
Option A: Vector DB + Semantic Search
- ✅ Precise retrieval
- ❌ Requires embedding, indexing, complex queries
- ❌ High maintenance cost
Option B: Markdown Files + Full Loading
- ✅ Simple, transparent, user-editable
- ✅ Leverages large context windows (Claude 200K tokens)
- ✅ Easier to debug and understand
We chose Option B.
Why?
- Context windows changed the game: User memory typically < 10K, full loading is perfectly viable
- Simple solutions are more reliable: No embedding inconsistency, no retrieval failures
- User control: Memory is transparent, users can view and edit
Four Design Principles
From these three insights, we distilled the core principles of our architecture:
1. Messages as Source of Truth
- All important information lives in messages
- Database only stores derived state (like reports, study logs)
- Similar to Event Sourcing: messages are the event log
2. Configuration over Code
- Use configuration to express differences
- Use code to express commonalities
- Avoid over-abstraction
3. AI as State Manager
- Let AI manage state transitions
- Don't hand-write complex state machines
- Adapt to LLM's capability boundaries
4. Simple, Transparent, Controllable
- Simple beats complex
- Transparent beats black box
- User control beats AI automation
III. Step 1: Message-Driven Architecture
v2.2.0 - 2025-12-27
Problem: Dual Source of Truth
Initially, research data was scattered across three places:
Generating reports required stitching from three places:
Problems:
- Data inconsistency:
interviews.conclusionand interview content in messages could diverge - Partial failures: When tool calls fail, data is half-saved, hard to trace full context
- Hard to extend: Adding
discussionChatrequires new table, new tool, new queries
Even worse, tool outputs were inconsistent:
Agents couldn't handle this uniformly, leading to complex code.
Solution: Messages as Single Source
Core idea: All research content flows into the message stream. Database only stores derived state.
Key changes:
-
Removed 5 specialized save tools
- Deleted:
saveInterview,saveDiscussion,saveScoutTask, ... - Reason: Agents output directly to messages, no explicit save needed
- Deleted:
-
Unified tool output format
- All research tools return
plainText - Agents can uniformly process all tool results
- All research tools return
-
Generate studyLog on demand
Why This Design?
Reasoning from first principles:
-
Conversation as context
- LLMs need complete context to generate reports
- Message history is naturally the most complete, most natural context
- Avoids complexity of "reconstructing context from DB"
-
LLMs excel at extraction
- Generating structured content (studyLog) from conversations is LLM's strength
- More flexible and reliable than hand-written parsing logic
-
Shadow of Event Sourcing
- Message sequence = event log
- studyLog, report = derived state
- Can be replayed and regenerated anytime
Comparison with other approaches:
| Approach | Pros | Cons | Why not chosen |
|---|---|---|---|
| Messages as source | Data consistent, easy to extend | Requires extra LLM call to generate studyLog | ✅ Our choice |
| Traditional state management | Precise control | Complex state sync, hard to trace | Doesn't suit LLM non-determinism |
| Remove DB entirely | Extremely simple | Frontend queries difficult, history hard to manage | Need structured display |
| Event Sourcing | Complete history, replayable | High engineering complexity | Over-engineered for current scale |
Impact
Code simplification:
Development efficiency:
Before:
After:
Cost trade-offs:
✅ Benefits:
- Simplified architecture: deleted 5 tools, simplified 28 files
- Data consistency: full context traceable even on failures
- Easy extension: adding new research methods goes from 12 steps → 3 steps
❌ Costs:
- studyLog generation requires extra LLM call (~2K tokens, ~$0.002)
- Slightly higher token consumption for long conversations
✅ Mitigation:
- Prompt cache reduces repeated token cost by 90%
- Architectural benefits far outweigh costs
III. Step 2: Intent Clarification + Unified Execution
v2.3.0 - 2026-01-06
Problem 1: Vague Requirements → Inefficient Dialogue
After implementing message-driven architecture, adding features became simpler. But user experience wasn't good enough.
When creating research, users often say:
"Want to understand young people's coffee preferences"
This isn't specific enough:
- Which young people? 18-22 college students? Or 23-28 young professionals?
- What method? In-depth interviews? Group discussions? Or social media observation?
- What output? User personas? Market insights? Or product recommendations?
Traditional approach: AI asks multiple questions
Problems:
- Requires 3-5 conversation rounds
- Poor user experience (feels like filling forms)
- AI can't proactively suggest best approaches
Problem 2: 95% Duplicate Code
While adding features became simpler, we discovered a bigger technical debt:
Three nearly identical agent wrappers, totaling 1,211 lines.
Code duplication mainly in:
- Message loading and processing (~80 lines each)
- File attachment handling (~60 lines each)
- MCP integration (~40 lines each)
- Token tracking (~50 lines each)
- Notification sending (~30 lines each)
Every new feature (like webhook integration) required changing all three places.
Solution: Plan Mode + baseAgentRequest
Our solution has two parts:
Part 1: Plan Mode (Intent Clarification Layer)
A separate agent dedicated to intent clarification:
Workflow:
Key design:
- Plan Mode's decisions are recorded in messages
- Study Agent infers intent from messages, no explicit passing needed
- Avoids complexity of context passing
Part 2: baseAgentRequest (Unified Executor)
Merge three duplicate agent wrappers into one generic executor:
Agent routing:
Each agent only needs to define configuration:
Why This Design?
Reasoning-execution separation rationale:
-
Matches cognitive model
- Human decision-making: first figure out "what to do", then consider "how to do it"
- System 1 (intuition) vs System 2 (reasoning)
- Plan Mode = System 2, Study Agent = System 1
-
Single responsibility
- Plan Mode: focuses on intent understanding, doesn't need to know execution details
- Study Agent: focuses on research execution, doesn't need to handle clarification
- Each is simpler and easier to maintain
-
Messages as protocol
- Plan Mode's decisions → messages
- Study Agent reads intent from messages
- Loosely coupled without losing context
Unified executor rationale:
-
Extract, Don't Rebuild
- Extract common patterns from three similar implementations
- Not designing abstraction layer from scratch
-
Configuration over Inheritance
- Agent differences expressed through configuration
- No inheritance or polymorphism
-
Plugin-based Lifecycle
customPrepareStep: dynamic tool controlcustomOnStepFinish: custom post-processing- Preserve extension points, don't hard-code all logic
Comparison with other approaches:
| Approach | Pros | Cons | Why not chosen |
|---|---|---|---|
| Plan Mode + baseAgentRequest | Remove duplicate code, separate reasoning-execution | One more abstraction layer | ✅ Our choice |
| Continue copy-pasting | Simple and direct | Tech debt accumulates, hard to maintain | Unsustainable long-term |
| Fully generic agent | Least code | Sacrifices specialization and control | Can't handle business differences |
| Microservices split | Independent deployment | Over-engineered, adds ops complexity | Unnecessary at current scale |
Impact
Code complexity:
But more importantly:
- Cyclomatic Complexity: 12.3 → 6.7 (45% reduction)
- Code duplication: 95% → 0%
Development efficiency:
Before:
After:
User experience:
Before:
After:
Intent clarification: 3-5 conversation rounds → 1 confirmation
III. Step 3: Persistent Memory
v2.3.0 - 2026-01-08
Problem: AI "Amnesia"
With intent clarification and unified architecture, the research workflow was smooth. But long-term users reported a problem:
"Why does the AI ask me what industry I'm in every single time?"
The AI doesn't remember users. Every conversation feels like the first meeting:
- "What industry are you in?"
- "Which dimensions do you care about?"
- "What's your research goal?"
Users feel the AI is "forgetful", the experience lacks personalization.
Root cause:
LLMs are stateless. Each conversation:
Although we have historical conversations in the DB:
- Cross-conversation info lost: Each research is an independent session
- Important info buried: Key information in long conversations is hard to extract
- No persistent memory: No long-term memory of "who the user is"
Solution: Two-Tier Memory Architecture
We need a persistent memory system. But how to design it?
Inspired by Anthropic's CLAUDE.md approach:
- Simple Markdown files
- User-viewable and editable
- Fully loaded into context
We adopted a similar approach but added automatic update mechanisms.
Data Model
Two-tier architecture:
-
Core Memory (core)
- Markdown format, human-readable
- Long-term stable user information
- Example:
-
Working Memory (working)
- JSON format, structured
- New information to be consolidated
- Example:
Automatic Update Mechanism
Two-stage update:
Memory Update Agent (Haiku 4.5):
- Extract new user information from conversations
- Low cost (~$0.001/time)
- Runs in background after each conversation
Memory Reorganize Agent (Sonnet 4.5):
- Consolidate working memory into core memory
- Remove redundancy, merge similar information
- Slightly higher cost (~$0.02/time), but infrequently triggered
Integration into Conversation Flow
Why This Design?
Why Markdown over Vector DB?
-
Context window is large enough
- Claude 3.5 Sonnet: 200K tokens
- User memory typically < 10K characters (~3K tokens)
- Full loading is simpler and more accurate than retrieval
-
Simple and transparent
- Markdown is user-readable and editable
- No embeddings, no vector search, no complex indexing
- Aligns with Anthropic's philosophy: user control
-
Avoid premature optimization
- Don't need real-time retrieval (low conversation frequency)
- Don't need precise matching (full text provides enough context)
- Start with simple solution, optimize when necessary
Comparison with mainstream approaches:
| Approach | Storage | Control | Retrieval | atypica choice rationale |
|---|---|---|---|---|
| Anthropic (CLAUDE.md) | File-based | User-driven | Full loading | ✅ Simple, transparent, effective with large context |
| OpenAI | Vector DB (speculated) | AI + user confirmation | Semantic retrieval | ❌ Black box, weak user control |
| Mem0 | Vector + Graph + KV | AI-driven | Hybrid retrieval | ❌ Over-engineered, high maintenance cost |
| MemGPT | OS-inspired tiered | AI self-managed | Tiered retrieval | ❌ Conceptually complex, utility unproven |
We chose Anthropic's simple approach because:
- Fits current scale (personal assistant, not enterprise knowledge base)
- User controllable (transparent, editable)
- As context windows grow, this approach becomes better
Impact
User experience:
Before:
After:
System cost:
Response time:
Low cost, fast response, completely acceptable.
IV. Architecture Comparison: Our Unique Choices
Now let's step back and see how atypica's architecture differs from mainstream AI Agent frameworks.
State Management: Messages vs Memory Classes
| atypica | LangChain | Core Difference |
|---|---|---|
| Messages as source | ConversationBufferMemory | We believe conversation history is the best state |
| Generate studyLog on demand | Pre-compute summary | Avoid sync issues, traceable on failures |
| DB stores derived state | DB stores core state | Similar to Event Sourcing |
Why different?
LangChain's design is influenced by traditional software, believing "state should be explicitly stored and managed."
We believe, for LLMs:
- Conversation history = complete state
- Derived state (studyLog) can be regenerated
- Simpler, more fault-tolerant
Agent Architecture: Configuration vs Graph
| atypica | LangGraph | Core Difference |
|---|---|---|
| Configuration-driven | Graph-driven | We use configuration to express differences, code for commonalities |
| Single executor | Node orchestration | Avoid over-abstraction, good enough is enough |
| Messages as protocol | Explicit node communication | Loosely coupled without losing context |
Why different?
LangGraph pursues generality, using graph orchestration to express arbitrarily complex flows.
We believe, for our scenarios:
- Configuration-driven is simpler: 99% of needs can be met with configuration
- Single executor is sufficient: Don't need graph orchestration's flexibility
- Simpler is more reliable: Fewer abstraction layers, easier to debug
Memory System: Markdown vs Vector DB
| atypica | Mem0 | Core Difference |
|---|---|---|
| Markdown files | Vector + Graph + KV | We choose simple and transparent over precise and complex |
| Full loading | Semantic retrieval | When context window is large enough, full text is better |
| User-editable | AI black box | User trust comes from transparency |
Why different?
Mem0 pursues precise retrieval, using multiple databases in hybrid.
We believe, for personal assistants:
- Simple solution is enough: User memory typically < 10K
- Transparent beats precise: Users can view and edit memory
- Gets better as context grows: At 1M tokens in the future, this approach will crush Vector DB
Core Philosophy Differences
atypica's choices:
- Simple, transparent, controllable
- Adapt to LLM characteristics (large context, non-determinism)
- Start from real pain points, not pursuing architectural perfection
Mainstream frameworks' choices:
- Precise, complex, automatic
- Port traditional software engineering patterns
- Pursue generality and flexibility
Who's right or wrong?
Neither is wrong. It's just:
- Our scenario (personal research assistant) suits simple approaches better
- As context windows grow, simple approaches become better
- User trust comes from transparency, not AI magic
V. Quantitative Impact
Specific impact from three evolutions:
Code Complexity
Development Efficiency
| Task | Before | After | Improvement |
|---|---|---|---|
| Add new research method | 12 files, 2-3 days | 3 files, 2-3 hours | 10x |
| Add new capability (MCP) | Modify 3 places, 1 day | Modify 1 place, 2 hours | 4x |
| Fix bug | Change 3 agents | Change 1 base | 3x |
System Performance
Cost and performance impact negligible.
User Experience
VI. Lessons Learned
What did we learn from three evolutions?
What We Did Right
1. Incremental refactoring, not big bang
We didn't rewrite the entire system at once. Three evolutions, each step:
- Delivers value independently
- Maintains backward compatibility (keeping
analyst.studySummaryfield) - Can be rolled back
This let us quickly validate ideas and reduce risk.
2. Start from real pain points
Don't pursue architectural perfection, instead:
- Message-driven: because adding
discussionChatwas too complex - Unified execution: because duplicate code was too much
- Persistent memory: because users reported AI forgetfulness
Let problems drive design, not design drive problems.
3. Embrace LLM characteristics
Don't treat LLMs as traditional software:
- Don't hand-write state machines, let AI infer state from conversations
- Leverage large context windows, rather than pursuing precise retrieval
- Let AI generate studyLog, rather than hand-writing parsers
Adapt to LLM's capability boundaries, rather than fighting them.
Costs We Paid
1. Learning curve for abstraction layer
baseAgentRequest requires understanding to modify:
- 6 phases of execution flow
- Timing of
customPrepareStepandcustomOnStepFinish - Generic constraints and type inference
But: clear interfaces and documentation lowered the barrier.
2. Cost of on-demand generation
studyLog generation requires LLM call (~$0.002/time).
But:
- Prompt cache reduces cost by 90%
- Architectural benefits >> small cost
- Acceptable
3. Limitations of simple solutions
Markdown memory isn't suitable for:
- Large-scale knowledge bases (> 100K tokens)
- Complex relational queries
- Multi-dimensional retrieval
But:
- Good enough for personal assistant scenarios
- Can upgrade to Vector DB in the future
- Solve 80% of problems first
Unexpected Benefits
1. Confidence from type safety
During refactoring, the compiler catches 99% of issues.
2. Flexibility of configuration-driven
Adding webhook integration only requires:
All agents automatically gain new capability, no config changes needed.
3. Power of messages as protocol
Plan Mode and Study Agent communicate through messages:
- Decoupled: can be modified independently
- Without losing context: complete decision process in messages
- Traceable: can replay when problems occur
This was an unexpected benefit.
VII. Future Directions
Three evolutions brought atypica closer to general-purpose agents. But there's more to do.
Short-term (3-6 months)
1. Skills Library
- Further modularize tools
- Users can compose their own agents
- Like GPTs, but more flexible
2. Multi-Agent Collaboration
- Not just serial execution
- Parallel research, cross-validation
- Like AutoGPT, but more controllable
Long-term (1-2 years)
3. Evolve toward GEA
- GEA = General Execution Architecture
- Not just research agents, but a universal AI Agent execution framework
- Can run any type of agent
4. Self-Improving Agents
- Agents learn from past executions
- Continuously optimize prompts and strategies
- Get smarter with use
Unchanging Principles
No matter how we evolve, we stick to:
- Simple beats complex
- Transparent beats black box
- User control beats AI automation
VIII. Conclusion
Building AI Agent systems is not a simple extension of traditional software engineering.
We need to rethink:
- What is state? (Conversation history)
- What is an interface? (Message protocol)
- What is control flow? (AI reasoning)
atypica's three evolutions are essentially three cognitive upgrades:
-
From database thinking → data flow thinking
- Don't maintain explicit state, infer state from messages
-
From code reuse → configuration-driven
- Don't pursue perfect abstraction, use configuration to express differences
-
From stateless → memory-enhanced
- Don't rely on precise retrieval, use simple and transparent methods
These choices may not be the most "advanced."
But they are:
- Simple: easy to understand, easy to debug
- Transparent: users know what AI is doing
- Controllable: users can intervene and adjust
- Good enough: solve 80% of problems
And this, perhaps, is the key to building reliable AI systems.