Building Scalable AI Agent Systems: Three Evolutions

I. December 2025

We needed to add a new feature: group discussions (discussionChat).

This should've been simple. We already had interviewChat—one-on-one conversations where users deeply engage with AI-simulated personas. Group discussion was just scaling from 1-to-1 to 1-to-many: 3-8 personas engaging simultaneously, watching perspectives collide and insights emerge.

In theory, we just needed to:

Reuse the interview logic
Adjust prompts to simulate group dynamics
Tweak the UI to show multiple speakers

The reality: We had to modify 12 files.

Worse, we discovered this:

Three nearly identical agent wrappers. Every new feature required copy-pasting across all three. Every bug fix meant changing it three times.

That moment, we realized: something was fundamentally wrong.

Not that our code wasn't elegant. Not that we lacked abstraction. But that we were building AI Agent systems with traditional software engineering thinking.

This article chronicles how we escaped this trap—through three architectural evolutions, rethinking how AI Agents should be built from first principles.

II. Rethinking: What is an AI Agent?

Before refactoring, we stopped to ask a fundamental question:

What's the essential difference between AI Agents and traditional software?

The World of Traditional Software

Traditional software is built on state machines:

This model's core assumptions:

State is explicit: I know exactly where I am
Transitions are deterministic: Given state + event, next state is unique
Control is precise: if-else covers all paths

This works beautifully for traditional software. But for AI Agents?

The World of AI Agents

LLMs don't work this way:

Where's the "state" here?

Not in a state field
But in the entire conversation history

The AI infers from conversation history:

What research does the user want?
How far have we progressed?
What should happen next?

This is a completely different paradigm.

Three Core Insights

From this observation, we derived three insights that shaped our architectural evolution.

Insight 1: Conversation as State

Traditional approach: Maintain explicit state

AI-native approach: Infer state from conversation

Why is conversation superior to state machines?

Natural alignment: LLMs work on message history natively
Strong fault tolerance: State machines are hard to recover from errors; conversations can be "rewound" and replayed
Easy extension: Adding new capabilities doesn't require modifying state graphs

Insight 2: Reasoning-Execution Separation

How humans make decisions:

Understand intent: "What am I trying to achieve?" → Clarify goals
Choose method: "How do I do it?" → Execution steps

AI Agents should follow the same pattern:

Why separate?

Reasoning needs deep thinking (use Claude Sonnet 4)
Execution needs fast response (can use smaller models)
Separation of concerns, single responsibility

Insight 3: Simple Over Precise

Facing the "AI forgetfulness" problem, we could:

Option A: Vector DB + Semantic Search

✅ Precise retrieval
❌ Requires embedding, indexing, complex queries
❌ High maintenance cost

Option B: Markdown Files + Full Loading

✅ Simple, transparent, user-editable
✅ Leverages large context windows (Claude 200K tokens)
✅ Easier to debug and understand

We chose Option B.

Why?

Context windows changed the game: User memory typically < 10K, full loading is perfectly viable
Simple solutions are more reliable: No embedding inconsistency, no retrieval failures
User control: Memory is transparent, users can view and edit

Four Design Principles

From these three insights, we distilled the core principles of our architecture:

1. Messages as Source of Truth

All important information lives in messages
Database only stores derived state (like reports, study logs)
Similar to Event Sourcing: messages are the event log

2. Configuration over Code

Use configuration to express differences
Use code to express commonalities
Avoid over-abstraction

3. AI as State Manager

Let AI manage state transitions
Don't hand-write complex state machines
Adapt to LLM's capability boundaries

4. Simple, Transparent, Controllable

Simple beats complex
Transparent beats black box
User control beats AI automation

III. Step 1: Message-Driven Architecture

v2.2.0 - 2025-12-27

Problem: Dual Source of Truth

Initially, research data was scattered across three places:

Generating reports required stitching from three places:

Problems:

Data inconsistency: interviews.conclusion and interview content in messages could diverge
Partial failures: When tool calls fail, data is half-saved, hard to trace full context
Hard to extend: Adding discussionChat requires new table, new tool, new queries

Even worse, tool outputs were inconsistent:

Agents couldn't handle this uniformly, leading to complex code.

Solution: Messages as Single Source

Core idea: All research content flows into the message stream. Database only stores derived state.

Key changes:

Removed 5 specialized save tools
- Deleted: saveInterview, saveDiscussion, saveScoutTask, ...
- Reason: Agents output directly to messages, no explicit save needed
Unified tool output format
- All research tools return plainText
- Agents can uniformly process all tool results
Generate studyLog on demand

Why This Design?

Reasoning from first principles:

Conversation as context
- LLMs need complete context to generate reports
- Message history is naturally the most complete, most natural context
- Avoids complexity of "reconstructing context from DB"
LLMs excel at extraction
- Generating structured content (studyLog) from conversations is LLM's strength
- More flexible and reliable than hand-written parsing logic
Shadow of Event Sourcing
- Message sequence = event log
- studyLog, report = derived state
- Can be replayed and regenerated anytime

Comparison with other approaches:

Approach	Pros	Cons	Why not chosen
Messages as source	Data consistent, easy to extend	Requires extra LLM call to generate studyLog	✅ Our choice
Traditional state management	Precise control	Complex state sync, hard to trace	Doesn't suit LLM non-determinism
Remove DB entirely	Extremely simple	Frontend queries difficult, history hard to manage	Need structured display
Event Sourcing	Complete history, replayable	High engineering complexity	Over-engineered for current scale

Impact

Code simplification:

Development efficiency:

Before:

After:

Cost trade-offs:

✅ Benefits:

Simplified architecture: deleted 5 tools, simplified 28 files
Data consistency: full context traceable even on failures
Easy extension: adding new research methods goes from 12 steps → 3 steps

❌ Costs:

studyLog generation requires extra LLM call (~2K tokens, ~$0.002)
Slightly higher token consumption for long conversations

✅ Mitigation:

Prompt cache reduces repeated token cost by 90%
Architectural benefits far outweigh costs

III. Step 2: Intent Clarification + Unified Execution

v2.3.0 - 2026-01-06

Problem 1: Vague Requirements → Inefficient Dialogue

After implementing message-driven architecture, adding features became simpler. But user experience wasn't good enough.

When creating research, users often say:

"Want to understand young people's coffee preferences"

This isn't specific enough:

Which young people? 18-22 college students? Or 23-28 young professionals?
What method? In-depth interviews? Group discussions? Or social media observation?
What output? User personas? Market insights? Or product recommendations?

Traditional approach: AI asks multiple questions

Problems:

Requires 3-5 conversation rounds
Poor user experience (feels like filling forms)
AI can't proactively suggest best approaches

Problem 2: 95% Duplicate Code

While adding features became simpler, we discovered a bigger technical debt:

Three nearly identical agent wrappers, totaling 1,211 lines.

Code duplication mainly in:

Message loading and processing (~80 lines each)
File attachment handling (~60 lines each)
MCP integration (~40 lines each)
Token tracking (~50 lines each)
Notification sending (~30 lines each)

Every new feature (like webhook integration) required changing all three places.

Solution: Plan Mode + baseAgentRequest

Our solution has two parts:

Part 1: Plan Mode (Intent Clarification Layer)

A separate agent dedicated to intent clarification:

Workflow:

Key design:

Plan Mode's decisions are recorded in messages
Study Agent infers intent from messages, no explicit passing needed
Avoids complexity of context passing

Part 2: baseAgentRequest (Unified Executor)

Merge three duplicate agent wrappers into one generic executor:

Agent routing:

Each agent only needs to define configuration:

Why This Design?

Reasoning-execution separation rationale:

Matches cognitive model
- Human decision-making: first figure out "what to do", then consider "how to do it"
- System 1 (intuition) vs System 2 (reasoning)
- Plan Mode = System 2, Study Agent = System 1
Single responsibility
- Plan Mode: focuses on intent understanding, doesn't need to know execution details
- Study Agent: focuses on research execution, doesn't need to handle clarification
- Each is simpler and easier to maintain
Messages as protocol
- Plan Mode's decisions → messages
- Study Agent reads intent from messages
- Loosely coupled without losing context

Unified executor rationale:

Extract, Don't Rebuild
- Extract common patterns from three similar implementations
- Not designing abstraction layer from scratch
Configuration over Inheritance
- Agent differences expressed through configuration
- No inheritance or polymorphism
Plugin-based Lifecycle
- customPrepareStep: dynamic tool control
- customOnStepFinish: custom post-processing
- Preserve extension points, don't hard-code all logic

Comparison with other approaches:

Approach	Pros	Cons	Why not chosen
Plan Mode + baseAgentRequest	Remove duplicate code, separate reasoning-execution	One more abstraction layer	✅ Our choice
Continue copy-pasting	Simple and direct	Tech debt accumulates, hard to maintain	Unsustainable long-term
Fully generic agent	Least code	Sacrifices specialization and control	Can't handle business differences
Microservices split	Independent deployment	Over-engineered, adds ops complexity	Unnecessary at current scale

Impact

Code complexity:

But more importantly:

Cyclomatic Complexity: 12.3 → 6.7 (45% reduction)
Code duplication: 95% → 0%

Development efficiency:

Before:

After:

User experience:

Before:

After:

Intent clarification: 3-5 conversation rounds → 1 confirmation

III. Step 3: Persistent Memory

v2.3.0 - 2026-01-08

Problem: AI "Amnesia"

With intent clarification and unified architecture, the research workflow was smooth. But long-term users reported a problem:

"Why does the AI ask me what industry I'm in every single time?"

The AI doesn't remember users. Every conversation feels like the first meeting:

"What industry are you in?"
"Which dimensions do you care about?"
"What's your research goal?"

Users feel the AI is "forgetful", the experience lacks personalization.

Root cause:

LLMs are stateless. Each conversation:

Although we have historical conversations in the DB:

Cross-conversation info lost: Each research is an independent session
Important info buried: Key information in long conversations is hard to extract
No persistent memory: No long-term memory of "who the user is"

Solution: Two-Tier Memory Architecture

We need a persistent memory system. But how to design it?

Inspired by Anthropic's CLAUDE.md approach:

Simple Markdown files
User-viewable and editable
Fully loaded into context

We adopted a similar approach but added automatic update mechanisms.

Data Model

Two-tier architecture:

Core Memory (core)
- Markdown format, human-readable
- Long-term stable user information
- Example:
Working Memory (working)
- JSON format, structured
- New information to be consolidated
- Example:

Automatic Update Mechanism

Two-stage update:

Memory Update Agent (Haiku 4.5):

Extract new user information from conversations
Low cost (~$0.001/time)
Runs in background after each conversation

Memory Reorganize Agent (Sonnet 4.5):

Consolidate working memory into core memory
Remove redundancy, merge similar information
Slightly higher cost (~$0.02/time), but infrequently triggered

Integration into Conversation Flow

Why This Design?

Why Markdown over Vector DB?

Context window is large enough
- Claude 3.5 Sonnet: 200K tokens
- User memory typically < 10K characters (~3K tokens)
- Full loading is simpler and more accurate than retrieval
Simple and transparent
- Markdown is user-readable and editable
- No embeddings, no vector search, no complex indexing
- Aligns with Anthropic's philosophy: user control
Avoid premature optimization
- Don't need real-time retrieval (low conversation frequency)
- Don't need precise matching (full text provides enough context)
- Start with simple solution, optimize when necessary

Comparison with mainstream approaches:

Approach	Storage	Control	Retrieval	atypica choice rationale
Anthropic (CLAUDE.md)	File-based	User-driven	Full loading	✅ Simple, transparent, effective with large context
OpenAI	Vector DB (speculated)	AI + user confirmation	Semantic retrieval	❌ Black box, weak user control
Mem0	Vector + Graph + KV	AI-driven	Hybrid retrieval	❌ Over-engineered, high maintenance cost
MemGPT	OS-inspired tiered	AI self-managed	Tiered retrieval	❌ Conceptually complex, utility unproven

We chose Anthropic's simple approach because:

Fits current scale (personal assistant, not enterprise knowledge base)
User controllable (transparent, editable)
As context windows grow, this approach becomes better

Impact

User experience:

Before:

After:

System cost:

Response time:

Low cost, fast response, completely acceptable.

IV. Architecture Comparison: Our Unique Choices

Now let's step back and see how atypica's architecture differs from mainstream AI Agent frameworks.

State Management: Messages vs Memory Classes

atypica	LangChain	Core Difference
Messages as source	ConversationBufferMemory	We believe conversation history is the best state
Generate studyLog on demand	Pre-compute summary	Avoid sync issues, traceable on failures
DB stores derived state	DB stores core state	Similar to Event Sourcing

Why different?

LangChain's design is influenced by traditional software, believing "state should be explicitly stored and managed."

We believe, for LLMs:

Conversation history = complete state
Derived state (studyLog) can be regenerated
Simpler, more fault-tolerant

Agent Architecture: Configuration vs Graph

atypica	LangGraph	Core Difference
Configuration-driven	Graph-driven	We use configuration to express differences, code for commonalities
Single executor	Node orchestration	Avoid over-abstraction, good enough is enough
Messages as protocol	Explicit node communication	Loosely coupled without losing context

Why different?

LangGraph pursues generality, using graph orchestration to express arbitrarily complex flows.

We believe, for our scenarios:

Configuration-driven is simpler: 99% of needs can be met with configuration
Single executor is sufficient: Don't need graph orchestration's flexibility
Simpler is more reliable: Fewer abstraction layers, easier to debug

Memory System: Markdown vs Vector DB

atypica	Mem0	Core Difference
Markdown files	Vector + Graph + KV	We choose simple and transparent over precise and complex
Full loading	Semantic retrieval	When context window is large enough, full text is better
User-editable	AI black box	User trust comes from transparency

Why different?

Mem0 pursues precise retrieval, using multiple databases in hybrid.

We believe, for personal assistants:

Simple solution is enough: User memory typically < 10K
Transparent beats precise: Users can view and edit memory
Gets better as context grows: At 1M tokens in the future, this approach will crush Vector DB

Core Philosophy Differences

atypica's choices:

Simple, transparent, controllable
Adapt to LLM characteristics (large context, non-determinism)
Start from real pain points, not pursuing architectural perfection

Mainstream frameworks' choices:

Precise, complex, automatic
Port traditional software engineering patterns
Pursue generality and flexibility

Who's right or wrong?

Neither is wrong. It's just:

Our scenario (personal research assistant) suits simple approaches better
As context windows grow, simple approaches become better
User trust comes from transparency, not AI magic

V. Quantitative Impact

Specific impact from three evolutions:

Code Complexity

Development Efficiency

Task	Before	After	Improvement
Add new research method	12 files, 2-3 days	3 files, 2-3 hours	10x
Add new capability (MCP)	Modify 3 places, 1 day	Modify 1 place, 2 hours	4x
Fix bug	Change 3 agents	Change 1 base	3x

System Performance

Cost and performance impact negligible.

User Experience

VI. Lessons Learned

What did we learn from three evolutions?

What We Did Right

1. Incremental refactoring, not big bang

We didn't rewrite the entire system at once. Three evolutions, each step:

Delivers value independently
Maintains backward compatibility (keeping analyst.studySummary field)
Can be rolled back

This let us quickly validate ideas and reduce risk.

2. Start from real pain points

Don't pursue architectural perfection, instead:

Message-driven: because adding discussionChat was too complex
Unified execution: because duplicate code was too much
Persistent memory: because users reported AI forgetfulness

Let problems drive design, not design drive problems.

3. Embrace LLM characteristics

Don't treat LLMs as traditional software:

Don't hand-write state machines, let AI infer state from conversations
Leverage large context windows, rather than pursuing precise retrieval
Let AI generate studyLog, rather than hand-writing parsers

Adapt to LLM's capability boundaries, rather than fighting them.

Costs We Paid

1. Learning curve for abstraction layer

baseAgentRequest requires understanding to modify:

6 phases of execution flow
Timing of customPrepareStep and customOnStepFinish
Generic constraints and type inference

But: clear interfaces and documentation lowered the barrier.

2. Cost of on-demand generation

studyLog generation requires LLM call (~$0.002/time).

But:

Prompt cache reduces cost by 90%
Architectural benefits >> small cost
Acceptable

3. Limitations of simple solutions

Markdown memory isn't suitable for:

Large-scale knowledge bases (> 100K tokens)
Complex relational queries
Multi-dimensional retrieval

But:

Good enough for personal assistant scenarios
Can upgrade to Vector DB in the future
Solve 80% of problems first

Unexpected Benefits

1. Confidence from type safety

During refactoring, the compiler catches 99% of issues.

2. Flexibility of configuration-driven

Adding webhook integration only requires:

All agents automatically gain new capability, no config changes needed.

3. Power of messages as protocol

Plan Mode and Study Agent communicate through messages:

Decoupled: can be modified independently
Without losing context: complete decision process in messages
Traceable: can replay when problems occur

This was an unexpected benefit.

VII. Future Directions

Three evolutions brought atypica closer to general-purpose agents. But there's more to do.

Short-term (3-6 months)

1. Skills Library

Further modularize tools
Users can compose their own agents
Like GPTs, but more flexible

2. Multi-Agent Collaboration

Not just serial execution
Parallel research, cross-validation
Like AutoGPT, but more controllable

Long-term (1-2 years)

3. Evolve toward GEA

GEA = General Execution Architecture
Not just research agents, but a universal AI Agent execution framework
Can run any type of agent

4. Self-Improving Agents

Agents learn from past executions
Continuously optimize prompts and strategies
Get smarter with use

Unchanging Principles

No matter how we evolve, we stick to:

Simple beats complex
Transparent beats black box
User control beats AI automation

VIII. Conclusion

Building AI Agent systems is not a simple extension of traditional software engineering.

We need to rethink:

What is state? (Conversation history)
What is an interface? (Message protocol)
What is control flow? (AI reasoning)

atypica's three evolutions are essentially three cognitive upgrades:

From database thinking → data flow thinking
- Don't maintain explicit state, infer state from messages
From code reuse → configuration-driven
- Don't pursue perfect abstraction, use configuration to express differences
From stateless → memory-enhanced
- Don't rely on precise retrieval, use simple and transparent methods

These choices may not be the most "advanced."

But they are:

Simple: easy to understand, easy to debug
Transparent: users know what AI is doing
Controllable: users can intervene and adjust
Good enough: solve 80% of problems

And this, perhaps, is the key to building reliable AI systems.