Multi-Agent Research Design in Practice
I recently adapted the Virtual Lab framework—originally developed for drug discovery and published in Nature by Swanson, Wu, Bulaong et al.—to study how hospitals respond to Medicare payment shocks.
This post shows how multi-agent collaboration transforms simple research prompts into rigorous econometric designs, why I ultimately skipped fine-tuning for this project, and how this connects to AI-assisted coding tools like Claude Code.
From Simple Prompt to Rigorous Research Design
Initial prompt to the agent team:
“Analyze changes in CMS Inpatient Prospective Payment System (IPPS) Impact Files. Identify quasi-experimental events related to Wage Index Cliffs and Border Discontinuities to understand how hospitals respond to payment shocks.”
What the agents produced after 5 rounds of discussion:
Final Research Question:
How do hospitals respond to Wage Index shocks, what role do cost-shifting behaviors play, how does Market Concentration (HHI) moderate these responses, and what are the implications for health equity?
Final Econometric Specification:
- Design: Staggered-adoption intensity event study using Sun & Abraham estimator
- Treatment: Baseline-frozen OMB delineation changes mapped to hospital-level payment shocks
- Moderation: HHI interaction terms to test “rent capture” hypothesis
- Primary Outcome: Change in wage index wedge (Δwi_wedge) - the cleanest behavioral signal in Impact Files
- Validity Tests: Pre-trend tests for both Shock and Shock×HHI, manipulation tests, support checks
- Inference: Wild bootstrap clustering at CBSA level
This evolution happened through structured debate among six specialized agents.
The Agent Team
Each agent brought specific expertise:
- Principal Investigator: Enforced focus on “Impact File-native” variables only
- Health Economist: Designed difference-in-differences specifications
- Industrial Economist: Analyzed market concentration effects, pushed for “rent capture” mechanism tests
- Policy Analyst: Grounded analysis in CMS regulations and Federal Register documentation
- Causal Inference Expert: Rejected standard TWFE under staggered adoption, insisted on Sun & Abraham
- Economics Critic: Flagged when claims exceeded empirical support
Example: Multi-Agent Debate on Methods
Industrial Economist’s proposal:
“Use difference-in-differences and regression discontinuity designs, with propensity score matching.”
Economics Critic’s response:
“Caution against TWFE under staggered adoption—require Sun & Abraham or Callaway-Sant’Anna with diagnostics. Pre-trend gates must be pre-specified for both Shock and Shock×HHI. Do not over-interpret payment-on-exposure regressions without first-stage validation.”
Causal Inference Expert’s synthesis:
“Main estimator should use Sun & Abraham interaction-weighted event study. Define time-invariant, baseline-frozen treatment intensity for clean identification. Include formal pre-trend gates for both main effects and interactions. Show nonparametric HHI bins.”
Result: The team converged on modern causal inference methods, with explicit validity checks and robustness requirements that wouldn’t have emerged from a single agent or typical researcher workflow.
The Fine-Tuning Pipeline: How It Works
Virtual Lab includes a 7-phase pipeline to transform generic LLMs into domain specialists:
Phase 1-2: Literature Search and Selection
Agent receives task:
“Search PubMed for query ‘CMS IPPS impact file policy change quasi-experimental hospital behavior’, request 10 abstracts. Rigorously exclude papers about basic science, pure clinical trials without cost data, or non-US healthcare systems. Only select papers with direct relevance to US hospital economics, payment policy, or market structure.”
Example agent response (Industrial Economist):
[
{
"pmcid": "9175183",
"title": "Adoption and Learning Across Hospitals: The Case of a Revenue-Generating Practice",
"reason": "Studies hospital response to 2008 Medicare documentation reform
affecting revenue—directly relevant to understanding how hospitals
react to payment incentives and inter-hospital learning dynamics."
}
]
The agent correctly filtered out conference abstracts and irrelevant papers, selecting only those studying US hospital behavioral responses to Medicare policy changes.
Phase 3-4: Summarization and Training Data Generation
Papers are downloaded and summarized with extreme detail. Here’s an excerpt from the Industrial Economist’s summary of Garmon (2017) on hospital merger screening:
Summary excerpt:
“The paper evaluates whether prospective merger ‘screens’ computed using pre-merger data actually predict post-merger price changes for 28 consummated hospital mergers. It compares traditional structural measures (HHI under different market definitions) versus ‘direct’ unilateral-effects measures (diversion ratios, WTP, UPP) and merger simulation… Key findings: Minimum ΔWTP threshold of 6% and UPP threshold of 4% outperform HHI-based screens. HRR-bed HHI produces many false negatives (flags only 4 of 9 significant price-increase mergers), confirming criticism that EH-like broad markets are too large…”
The summary continues for 15+ pages, covering:
- Policy problem and fiscal implications
- Institutional backdrop (why 1990s enforcement failed)
- Technical details of each screening method (WTP, UPP, diversion ratios)
- Data construction and causal design (synthetic control)
- All empirical findings with specific numbers
- Operational recommendations for local agencies
These detailed summaries are then converted to training examples in OpenAI’s format:
Training data structure:
{
"messages": [
{
"role": "system",
"content": "You are a Health Economist. Your expertise is in
fiscal impact analysis and hospital cost structures..."
},
{
"role": "user",
"content": "Explain the findings of 'Medicaid Disproportionate Share'
regarding hospital policy."
},
{
"role": "assistant",
"content": "## 1) What the document says UC DSH is *supposed* to do
vs. what it *became*
### Original policy purpose...
The document frames Medicaid DSH as a supplemental payment
intended to help hospitals that 'serve a disproportionate
number of low-income patients'... [detailed response]"
}
]
}
Each agent gets 50-200 training examples from their curated literature, creating an 80/20 train/validation split.
Phase 5-7: Fine-Tuning and Evaluation
Training jobs are launched on gpt-4o-mini. To evaluate whether fine-tuning improved performance, the system generates synthetic QA pairs from validation papers.
Example QA pair (from Neprash et al. 2015 on price measurement):
{
"question": "What is the main measurement problem Neprash et al. (2015) address?",
"correctAnswerIndex": 0,
"options": [
"How to construct valid, comparable commercial health care
price measures across geographic markets",
"How to predict individual patient mortality risk using DRG codes",
"How to estimate hospital cost-to-charge ratios from Medicare cost reports",
"How to set optimal Medicaid capitation rates"
]
}
Another example on methodology:
{
"question": "What is a major drawback of using a single 'sentinel service'
price as a proxy for overall market prices?",
"correctAnswerIndex": 0,
"options": [
"Sentinel service prices may be unrepresentative of overall market
pricing and correlate weakly with other services",
"Sentinel service prices require identical service baskets across all markets",
"Sentinel service prices cannot be computed from noncapitated claims",
"Sentinel service prices automatically adjust for case-mix differences"
]
}
The system generates 3-5 multiple-choice questions per validation paper, testing whether models understand:
- Main research questions and policy problems
- Methodological details and their limitations
- Key empirical findings with specific numbers
- Institutional context and fiscal implications
Evaluation Results: Base vs Fine-Tuned Models
The evaluation compared base gpt-4o-mini against fine-tuned versions on the validation set QA pairs.
How agents answered domain questions:
When asked about hospital merger screening methods (from the validation set):
Base Model Response:
“Hospital merger screening typically uses HHI thresholds from the Horizontal Merger Guidelines. Post-merger HHI > 2500 with Δ > 200 raises concerns. More recent approaches incorporate diversion ratios…”
Fine-Tuned Industrial Economist:
“Garmon (2017) shows that minimum ΔWTP threshold of 6% and UPP threshold of 4% outperform HHI-based screens for predicting post-merger price changes. HRR-bed HHI produces false negatives—flagging only 4 of 9 significant price-increase mergers—confirming that EH-like broad markets are too large. The optimal screening pipeline should compute ΔWTP and UPP from discharge-based choice models, using conservative aggregation across models…”
The fine-tuned response included:
- Specific thresholds (6%, 4%) from the literature
- Empirical validation (4 of 9 mergers)
- Methodological critique (EH-like markets too broad)
- Actionable recommendations (discharge-based choice models)
However, the results were mixed:
On the hold-out validation set, base GPT-4o models often performed as well or better than fine-tuned gpt-4o-mini models. The base models’ superior general reasoning sometimes compensated for less domain-specific knowledge.
Why I Skipped Fine-Tuning
Despite implementing the full pipeline, I ran the project with base GPT-4o models. Here’s why:
1. Base GPT-4o outperformed fine-tuned gpt-4o-mini
When evaluating on hold-out papers, base GPT-4o produced more balanced responses than fine-tuned gpt-4o-mini. The base model’s broader reasoning capabilities and training data meant it already had strong health economics knowledge.
2. Fine-tuning is iterative and incomplete
The fine-tuning pipeline I implemented is a first pass. Improving performance would require:
- More training examples (currently 50-200 per agent)
- Better prompt engineering in training data
- Iterative refinement based on validation failures
- Potentially fine-tuning larger models (gpt-4o instead of gpt-4o-mini)
3. Cost-benefit for this research stage
For exploratory research design, the marginal benefit of fine-tuning didn’t justify:
- Training costs (~$3/M tokens)
- Higher inference costs for fine-tuned models
- Maintenance burden (retraining as literature evolves)
Connecting to AI-Assisted Coding
Virtual Lab and tools like Claude Code serve complementary roles:
Virtual Lab: Research Design
Agents design the research through structured debate:
Output from longitudinal modeling phase:
“Adopt an OMB delineation adoption design as the anchor event. Use Sun & Abraham staggered-adoption estimator with baseline-defined, time-invariant treatment intensity. Primary outcome: Δ(wi_wedge). Include HHI moderation via linear interactions and HHI-bin event studies. Exclude transition years by default unless modeling blend weights explicitly.”
Claude Code: Implementation
Take that specification and implement it:
Claude Code writes the implementation based on agent specifications. Agents provide what to build; Claude Code handles how.
There was some back and forth getting the code to work and explored expanding the time horizon.
Ideal Workflow
- Virtual Lab agents debate design → Produce specification
- Feed spec to Claude Code → Generate implementation
- Claude Code runs analysis → Produce results
- Virtual Lab agents critique → Identify issues, suggest refinements
- Claude Code updates code → Iterate to final analysis
Key Lessons
1. Start with base models, fine-tune only when necessary
Test strong prompts first. Fine-tune only for specific, measurable deficiencies (usually style/format, not knowledge).
2. Version control agent prompts
Agent system prompts are code. Use git, code review, changelogs, and A/B testing.
3. Validate agent output rigorously
Agents can be confidently wrong. Always:
- Cross-check citations (agents sometimes hallucinate papers)
- Verify regulatory references against Federal Register
- Test generated code
- Run robustness checks on econometric strategies
4. Log everything
Virtual Lab saves all discussions as JSON. This audit trail is invaluable for understanding decisions and debugging recommendations.
5. Human-in-the-loop for final decisions
Let agents debate alternatives, but humans make final calls on:
- Research questions
- Data inclusion/exclusion
- Publication-ready claims
- Code deployment
Looking Forward
I want to take another pass at finetuning agents on literature using this approach and test it.
As frontier models improve, the bottleneck shifts from “Can agents think at this level?” to “How do we structure agent collaboration effectively?”
Virtual Lab provides one answer: assemble specialists, let them debate, critique iteratively. May also explore ways of folding something like this into a custom claude code workflow.
Resources:
- Virtual Lab framework (Zou Group)
- My presentation on agent tooling and MCP
- Hospital Impact Files project (example application to health economics)
Credits: Virtual Lab created by Swanson, K., Wu, W., Bulaong, N.L. et al., “The Virtual Lab of AI agents designs new SARS-CoV-2 nanobodies,” Nature (2025). https://doi.org/10.1038/s41586-025-09442-9
Enjoy Reading This Article?
Here are some more articles you might like to read next: