<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml"/><link href="/" rel="alternate" type="text/html" hreflang="en"/><updated>2026-03-20T02:41:12+00:00</updated><id>/feed.xml</id><title type="html">blank</title><subtitle>Applying machine learning and data engineering to answer questions in health economics. </subtitle><entry><title type="html">Building Research Intelligence Pipelines with Gemini Grounding</title><link href="/blog/2026/gemini-grounding-research-intelligence/" rel="alternate" type="text/html" title="Building Research Intelligence Pipelines with Gemini Grounding"/><published>2026-01-27T10:00:00+00:00</published><updated>2026-01-27T10:00:00+00:00</updated><id>/blog/2026/gemini-grounding-research-intelligence</id><content type="html" xml:base="/blog/2026/gemini-grounding-research-intelligence/"><![CDATA[<p>I built a research intelligence tool that extracts structured datasets from web sources using Google Gemini’s search grounding feature. The key insight: grounding transforms LLMs from “confident guessers” into “citation machines” that show their work.</p> <h2 id="the-problem-llms-hallucinate-data">The Problem: LLMs Hallucinate Data</h2> <p>Ask an LLM to list recent hospital mergers and you’ll get a confident response. But is it accurate? Are those dates real? Did that acquisition actually happen?</p> <p>Traditional LLM responses have no provenance. You can’t verify claims without manually searching each one. For research applications, “the model said so” isn’t a citation.</p> <h2 id="what-grounding-changes">What Grounding Changes</h2> <p>When Gemini uses Google Search grounding, every claim comes with source URLs. The model searches the web in real-time and grounds its response in actual documents.</p> <p>Instead of generating from training data, the model:</p> <ol> <li>Searches Google based on your query</li> <li>Retrieves relevant documents (news articles, press releases, filings)</li> <li>Synthesizes a response from those documents</li> <li>Returns the source URLs it used</li> </ol> <p>This creates an audit trail. Each data point traces back to a verifiable source.</p> <h2 id="example-tracking-hospital-ma-activity">Example: Tracking Hospital M&amp;A Activity</h2> <p>I wanted to build a dataset of hospital mergers and acquisitions for a given time period. Manually, this means searching news sites, reading SEC filings, cross-referencing dates—hours of work that needs repeating monthly.</p> <p>With grounded search, I define what I’m looking for once:</p> <p><strong>Query configuration:</strong></p> <blockquote> <p>Find Hospital Mergers and Acquisitions. Prioritize official press releases and SEC filings first, then industry publications like Becker’s Hospital Review and Modern Healthcare. Extract: acquirer name, target name, announcement date, deal status, and value.</p> </blockquote> <p><strong>Running a scan for Q1 2024:</strong></p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Starting scan for: mergers (2024-01-01 to 2024-03-31)
Searching Google with Gemini...
Found 12 sources.
Extracting structured data...
Extracted 8 potential records.
Linking records to sources...
Validating records...
6 valid records.
2 records rejected.
</code></pre></div></div> <p><strong>What comes out:</strong></p> <table> <thead> <tr> <th>Acquirer</th> <th>Target</th> <th>Date</th> <th>Status</th> <th>Value</th> <th>Sources</th> </tr> </thead> <tbody> <tr> <td>Risant Health</td> <td>Cone Health</td> <td>2024-01-16</td> <td>Announced</td> <td>Undisclosed</td> <td>beckershospitalreview.com, fiercehealthcare.com</td> </tr> <tr> <td>Prime Healthcare</td> <td>St. Mary Medical Center</td> <td>2024-02-08</td> <td>Completed</td> <td>$350M</td> <td>sec.gov, modernhealthcare.com</td> </tr> <tr> <td>…</td> <td>…</td> <td>…</td> <td>…</td> <td>…</td> <td>…</td> </tr> </tbody> </table> <p>Each record links back to the actual source URLs. The rejected records include specific reasons—maybe a date couldn’t be verified, or the source didn’t actually support the claim.</p> <h2 id="the-four-stage-pipeline">The Four-Stage Pipeline</h2> <h3 id="1-grounded-search">1. Grounded Search</h3> <p>Gemini searches the web with source priorities I specify. It returns both synthesized text and the URLs it drew from. This is where grounding does the heavy lifting—I get real-time web results, not stale training data.</p> <h3 id="2-structured-extraction">2. Structured Extraction</h3> <p>The raw search results are unstructured text. A second pass extracts structured records with consistent fields. Gemini’s structured output mode ensures valid data every time.</p> <h3 id="3-source-linking">3. Source Linking</h3> <p>Each extracted record gets matched to its source URLs. This creates the audit trail—anyone reviewing the data can click through to verify the original source.</p> <h3 id="4-validation">4. Validation</h3> <p>An LLM-as-judge validates each record against the source content. Does the extracted date match what the article says? Is the acquirer name spelled correctly? Records below the confidence threshold get flagged with specific issues.</p> <p>The output includes full HTML snapshots of each source page. URLs change and pages get removed—capturing the source at extraction time creates a permanent record.</p> <h2 id="why-grounding-matters-for-research">Why Grounding Matters for Research</h2> <h3 id="source-provenance">Source Provenance</h3> <p>Every data point traces to a URL. This transforms research from “I found this somewhere” to “here’s exactly where this came from.” For anything that might end up in a paper or presentation, this is essential.</p> <h3 id="temporal-awareness">Temporal Awareness</h3> <p>Grounding searches the live web. Unlike models with knowledge cutoffs, grounded searches find information published yesterday. For tracking ongoing events—M&amp;A activity, policy changes, market developments—this matters.</p> <h3 id="quality-signals">Quality Signals</h3> <p>The grounding metadata tells you where information came from. You can see if a claim comes from an SEC filing or a random blog post. Source credibility becomes visible.</p> <h3 id="verification-pipeline">Verification Pipeline</h3> <p>Because you have source URLs, you can build automated verification. The validation stage cross-references extracted claims against actual source content and flags discrepancies.</p> <h2 id="cost-considerations">Cost Considerations</h2> <p>Each pipeline run makes multiple Gemini API calls:</p> <ul> <li>1 grounded search call</li> <li>1 extraction call</li> <li>1 linking call (batched)</li> <li>N validation calls (one per record)</li> </ul> <p><strong>Gemini 2.0 Flash pricing</strong> (as of early 2025):</p> <ul> <li>Input: ~$0.10 per million tokens</li> <li>Output: ~$0.40 per million tokens</li> <li>Search grounding adds ~$35 per 1,000 grounded queries</li> </ul> <p>For a typical scan returning 10 records with 12 sources, expect roughly:</p> <ul> <li>Search + extraction: ~5K tokens input, ~2K output</li> <li>Validation: ~2K tokens per record</li> <li><strong>Estimated cost per scan: $0.50–$2.00</strong></li> </ul> <p>I’m still instrumenting the pipeline to get precise per-run costs. The main cost driver is the grounding fee—at $0.035 per grounded query, search is the most expensive stage. For high-volume use cases, caching and batching strategies would help.</p> <p><em>Note: Pricing changes frequently. Check <a href="https://ai.google.dev/gemini-api/docs/pricing">Google’s pricing page</a> for current rates.</em></p> <h2 id="limitations">Limitations</h2> <p><strong>Grounding isn’t perfect.</strong> The sources returned are what Gemini used, not necessarily the best sources available. Specifying source priorities in your prompt helps, but the model might still miss authoritative sources.</p> <p><strong>Each stage adds error.</strong> Search might miss relevant results. Extraction might misparse text. Linking might match records to wrong sources. The validation stage catches many of these, but logging everything is essential for debugging.</p> <p><strong>Cost and latency.</strong> Grounded search adds latency compared to direct model queries—typically 2-5 seconds for the search to complete.</p> <h2 id="when-to-use-this-approach">When to Use This Approach</h2> <p><strong>Good fits:</strong></p> <ul> <li>Research requiring citations</li> <li>Tracking recent events (news, filings, announcements)</li> <li>Building datasets that need audit trails</li> <li>Domains where accuracy matters more than speed</li> </ul> <p><strong>Poor fits:</strong></p> <ul> <li>General knowledge questions (use base model)</li> <li>Historical data before ~2020 (grounding searches recent web)</li> <li>High-volume, low-stakes applications</li> <li>Real-time applications where latency matters</li> </ul> <h2 id="takeaway">Takeaway</h2> <p>Grounding transforms LLMs from black boxes into transparent research tools. The model still does the synthesis work, but now you can see its sources and verify its claims.</p> <p>For research intelligence—building datasets, tracking events, monitoring markets—this changes what’s possible. Instead of manually searching and compiling, you define what you’re looking for once and let the pipeline handle extraction and verification.</p> <p>The audit trail is the key. Every claim traces to a source. That’s what makes the output usable for real research.</p> <hr/> <p><strong>Resources:</strong></p> <ul> <li><a href="https://github.com/health-care-affordability-lab/infra_agent-search">Search Intelligence Framework</a> - The project described in this post</li> <li><a href="https://ai.google.dev/gemini-api/docs/grounding">Gemini Search Grounding</a></li> </ul>]]></content><author><name></name></author><category term="research"/><category term="agents"/><category term="data-engineering"/><category term="healthcare"/><summary type="html"><![CDATA[Using Google's search grounding to build verifiable, source-tracked data extraction pipelines]]></summary></entry><entry><title type="html">Multi-Agent Research Design in Practice</title><link href="/blog/2026/virtual-lab-health-economics/" rel="alternate" type="text/html" title="Multi-Agent Research Design in Practice"/><published>2026-01-23T18:00:00+00:00</published><updated>2026-01-23T18:00:00+00:00</updated><id>/blog/2026/virtual-lab-health-economics</id><content type="html" xml:base="/blog/2026/virtual-lab-health-economics/"><![CDATA[<p>I recently adapted the <a href="https://github.com/zou-group/virtual-lab">Virtual Lab</a> framework—originally developed for drug discovery and <a href="https://doi.org/10.1038/s41586-025-09442-9">published in Nature</a> by Swanson, Wu, Bulaong et al.—to study how hospitals respond to Medicare payment shocks.</p> <p>This post shows how multi-agent collaboration transforms simple research prompts into rigorous econometric designs, why I ultimately skipped fine-tuning for this project, and how this connects to AI-assisted coding tools like Claude Code.</p> <h2 id="from-simple-prompt-to-rigorous-research-design">From Simple Prompt to Rigorous Research Design</h2> <p><strong>Initial prompt to the agent team:</strong></p> <blockquote> <p>“Analyze changes in CMS Inpatient Prospective Payment System (IPPS) Impact Files. Identify quasi-experimental events related to Wage Index Cliffs and Border Discontinuities to understand how hospitals respond to payment shocks.”</p> </blockquote> <p><strong>What the agents produced after 5 rounds of discussion:</strong></p> <p><strong>Final Research Question:</strong></p> <blockquote> <p>How do hospitals respond to Wage Index shocks, what role do cost-shifting behaviors play, how does Market Concentration (HHI) moderate these responses, and what are the implications for health equity?</p> </blockquote> <p><strong>Final Econometric Specification:</strong></p> <ul> <li><strong>Design</strong>: Staggered-adoption intensity event study using Sun &amp; Abraham estimator</li> <li><strong>Treatment</strong>: Baseline-frozen OMB delineation changes mapped to hospital-level payment shocks</li> <li><strong>Moderation</strong>: HHI interaction terms to test “rent capture” hypothesis</li> <li><strong>Primary Outcome</strong>: Change in wage index wedge (Δwi_wedge) - the cleanest behavioral signal in Impact Files</li> <li><strong>Validity Tests</strong>: Pre-trend tests for both Shock and Shock×HHI, manipulation tests, support checks</li> <li><strong>Inference</strong>: Wild bootstrap clustering at CBSA level</li> </ul> <p>This evolution happened through structured debate among six specialized agents.</p> <h2 id="the-agent-team">The Agent Team</h2> <p>Each agent brought specific expertise:</p> <ol> <li><strong>Principal Investigator</strong>: Enforced focus on “Impact File-native” variables only</li> <li><strong>Health Economist</strong>: Designed difference-in-differences specifications</li> <li><strong>Industrial Economist</strong>: Analyzed market concentration effects, pushed for “rent capture” mechanism tests</li> <li><strong>Policy Analyst</strong>: Grounded analysis in CMS regulations and Federal Register documentation</li> <li><strong>Causal Inference Expert</strong>: Rejected standard TWFE under staggered adoption, insisted on Sun &amp; Abraham</li> <li><strong>Economics Critic</strong>: Flagged when claims exceeded empirical support</li> </ol> <h2 id="example-multi-agent-debate-on-methods">Example: Multi-Agent Debate on Methods</h2> <p><strong>Industrial Economist’s proposal:</strong></p> <blockquote> <p>“Use difference-in-differences and regression discontinuity designs, with propensity score matching.”</p> </blockquote> <p><strong>Economics Critic’s response:</strong></p> <blockquote> <p>“Caution against TWFE under staggered adoption—require Sun &amp; Abraham or Callaway-Sant’Anna with diagnostics. Pre-trend gates must be pre-specified for both Shock and Shock×HHI. Do not over-interpret payment-on-exposure regressions without first-stage validation.”</p> </blockquote> <p><strong>Causal Inference Expert’s synthesis:</strong></p> <blockquote> <p>“Main estimator should use Sun &amp; Abraham interaction-weighted event study. Define time-invariant, baseline-frozen treatment intensity for clean identification. Include formal pre-trend gates for both main effects and interactions. Show nonparametric HHI bins.”</p> </blockquote> <p><strong>Result:</strong> The team converged on modern causal inference methods, with explicit validity checks and robustness requirements that wouldn’t have emerged from a single agent or typical researcher workflow.</p> <h2 id="the-fine-tuning-pipeline-how-it-works">The Fine-Tuning Pipeline: How It Works</h2> <p>Virtual Lab includes a 7-phase pipeline to transform generic LLMs into domain specialists:</p> <h3 id="phase-1-2-literature-search-and-selection">Phase 1-2: Literature Search and Selection</h3> <p><strong>Agent receives task:</strong></p> <blockquote> <p>“Search PubMed for query ‘CMS IPPS impact file policy change quasi-experimental hospital behavior’, request 10 abstracts. Rigorously exclude papers about basic science, pure clinical trials without cost data, or non-US healthcare systems. Only select papers with direct relevance to US hospital economics, payment policy, or market structure.”</p> </blockquote> <p><strong>Example agent response (Industrial Economist):</strong></p> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="w">
  </span><span class="p">{</span><span class="w">
    </span><span class="nl">"pmcid"</span><span class="p">:</span><span class="w"> </span><span class="s2">"9175183"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"title"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Adoption and Learning Across Hospitals: The Case of a Revenue-Generating Practice"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"reason"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Studies hospital response to 2008 Medicare documentation reform
               affecting revenue—directly relevant to understanding how hospitals
               react to payment incentives and inter-hospital learning dynamics."</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span></code></pre></div></div> <p>The agent correctly filtered out conference abstracts and irrelevant papers, selecting only those studying US hospital behavioral responses to Medicare policy changes.</p> <h3 id="phase-3-4-summarization-and-training-data-generation">Phase 3-4: Summarization and Training Data Generation</h3> <p>Papers are downloaded and summarized with extreme detail. Here’s an excerpt from the Industrial Economist’s summary of Garmon (2017) on hospital merger screening:</p> <p><strong>Summary excerpt:</strong></p> <blockquote> <p>“The paper evaluates whether prospective merger ‘screens’ computed using pre-merger data actually predict post-merger price changes for 28 consummated hospital mergers. It compares traditional structural measures (HHI under different market definitions) versus ‘direct’ unilateral-effects measures (diversion ratios, WTP, UPP) and merger simulation… Key findings: Minimum ΔWTP threshold of 6% and UPP threshold of 4% outperform HHI-based screens. HRR-bed HHI produces many false negatives (flags only 4 of 9 significant price-increase mergers), confirming criticism that EH-like broad markets are too large…”</p> </blockquote> <p>The summary continues for 15+ pages, covering:</p> <ul> <li>Policy problem and fiscal implications</li> <li>Institutional backdrop (why 1990s enforcement failed)</li> <li>Technical details of each screening method (WTP, UPP, diversion ratios)</li> <li>Data construction and causal design (synthetic control)</li> <li>All empirical findings with specific numbers</li> <li>Operational recommendations for local agencies</li> </ul> <p>These detailed summaries are then converted to training examples in OpenAI’s format:</p> <p><strong>Training data structure:</strong></p> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"messages"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"role"</span><span class="p">:</span><span class="w"> </span><span class="s2">"system"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"content"</span><span class="p">:</span><span class="w"> </span><span class="s2">"You are a Health Economist. Your expertise is in
                  fiscal impact analysis and hospital cost structures..."</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"role"</span><span class="p">:</span><span class="w"> </span><span class="s2">"user"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"content"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Explain the findings of 'Medicaid Disproportionate Share'
                  regarding hospital policy."</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"role"</span><span class="p">:</span><span class="w"> </span><span class="s2">"assistant"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"content"</span><span class="p">:</span><span class="w"> </span><span class="s2">"## 1) What the document says UC DSH is *supposed* to do
                  vs. what it *became*

                  ### Original policy purpose...
                  The document frames Medicaid DSH as a supplemental payment
                  intended to help hospitals that 'serve a disproportionate
                  number of low-income patients'... [detailed response]"</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div> <p>Each agent gets 50-200 training examples from their curated literature, creating an 80/20 train/validation split.</p> <h3 id="phase-5-7-fine-tuning-and-evaluation">Phase 5-7: Fine-Tuning and Evaluation</h3> <p>Training jobs are launched on <code class="language-plaintext highlighter-rouge">gpt-4o-mini</code>. To evaluate whether fine-tuning improved performance, the system generates synthetic QA pairs from validation papers.</p> <p><strong>Example QA pair (from Neprash et al. 2015 on price measurement):</strong></p> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"question"</span><span class="p">:</span><span class="w"> </span><span class="s2">"What is the main measurement problem Neprash et al. (2015) address?"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"correctAnswerIndex"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
  </span><span class="nl">"options"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="s2">"How to construct valid, comparable commercial health care
     price measures across geographic markets"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"How to predict individual patient mortality risk using DRG codes"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"How to estimate hospital cost-to-charge ratios from Medicare cost reports"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"How to set optimal Medicaid capitation rates"</span><span class="w">
  </span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div> <p><strong>Another example on methodology:</strong></p> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"question"</span><span class="p">:</span><span class="w"> </span><span class="s2">"What is a major drawback of using a single 'sentinel service'
               price as a proxy for overall market prices?"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"correctAnswerIndex"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
  </span><span class="nl">"options"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="s2">"Sentinel service prices may be unrepresentative of overall market
     pricing and correlate weakly with other services"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"Sentinel service prices require identical service baskets across all markets"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"Sentinel service prices cannot be computed from noncapitated claims"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"Sentinel service prices automatically adjust for case-mix differences"</span><span class="w">
  </span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div> <p>The system generates 3-5 multiple-choice questions per validation paper, testing whether models understand:</p> <ul> <li>Main research questions and policy problems</li> <li>Methodological details and their limitations</li> <li>Key empirical findings with specific numbers</li> <li>Institutional context and fiscal implications</li> </ul> <h2 id="evaluation-results-base-vs-fine-tuned-models">Evaluation Results: Base vs Fine-Tuned Models</h2> <p>The evaluation compared base <code class="language-plaintext highlighter-rouge">gpt-4o-mini</code> against fine-tuned versions on the validation set QA pairs.</p> <p><strong>How agents answered domain questions:</strong></p> <p>When asked about hospital merger screening methods (from the validation set):</p> <p><strong>Base Model Response:</strong></p> <blockquote> <p>“Hospital merger screening typically uses HHI thresholds from the Horizontal Merger Guidelines. Post-merger HHI &gt; 2500 with Δ &gt; 200 raises concerns. More recent approaches incorporate diversion ratios…”</p> </blockquote> <p><strong>Fine-Tuned Industrial Economist:</strong></p> <blockquote> <p>“Garmon (2017) shows that minimum ΔWTP threshold of 6% and UPP threshold of 4% outperform HHI-based screens for predicting post-merger price changes. HRR-bed HHI produces false negatives—flagging only 4 of 9 significant price-increase mergers—confirming that EH-like broad markets are too large. The optimal screening pipeline should compute ΔWTP and UPP from discharge-based choice models, using conservative aggregation across models…”</p> </blockquote> <p>The fine-tuned response included:</p> <ul> <li>Specific thresholds (6%, 4%) from the literature</li> <li>Empirical validation (4 of 9 mergers)</li> <li>Methodological critique (EH-like markets too broad)</li> <li>Actionable recommendations (discharge-based choice models)</li> </ul> <p><strong>However, the results were mixed:</strong></p> <p>On the hold-out validation set, base <code class="language-plaintext highlighter-rouge">GPT-4o</code> models often performed as well or better than fine-tuned <code class="language-plaintext highlighter-rouge">gpt-4o-mini</code> models. The base models’ superior general reasoning sometimes compensated for less domain-specific knowledge.</p> <h2 id="why-i-skipped-fine-tuning">Why I Skipped Fine-Tuning</h2> <p>Despite implementing the full pipeline, I ran the project with base <code class="language-plaintext highlighter-rouge">GPT-4o</code> models. Here’s why:</p> <p><strong>1. Base GPT-4o outperformed fine-tuned gpt-4o-mini</strong></p> <p>When evaluating on hold-out papers, base <code class="language-plaintext highlighter-rouge">GPT-4o</code> produced more balanced responses than fine-tuned <code class="language-plaintext highlighter-rouge">gpt-4o-mini</code>. The base model’s broader reasoning capabilities and training data meant it already had strong health economics knowledge.</p> <p><strong>2. Fine-tuning is iterative and incomplete</strong></p> <p>The fine-tuning pipeline I implemented is a first pass. Improving performance would require:</p> <ul> <li>More training examples (currently 50-200 per agent)</li> <li>Better prompt engineering in training data</li> <li>Iterative refinement based on validation failures</li> <li>Potentially fine-tuning larger models (gpt-4o instead of gpt-4o-mini)</li> </ul> <p><strong>3. Cost-benefit for this research stage</strong></p> <p>For exploratory research design, the marginal benefit of fine-tuning didn’t justify:</p> <ul> <li>Training costs (~$3/M tokens)</li> <li>Higher inference costs for fine-tuned models</li> <li>Maintenance burden (retraining as literature evolves)</li> </ul> <h2 id="connecting-to-ai-assisted-coding">Connecting to AI-Assisted Coding</h2> <p>Virtual Lab and tools like Claude Code serve complementary roles:</p> <h3 id="virtual-lab-research-design">Virtual Lab: Research Design</h3> <p>Agents <strong>design</strong> the research through structured debate:</p> <p><strong>Output from longitudinal modeling phase:</strong></p> <blockquote> <p>“Adopt an OMB delineation adoption design as the anchor event. Use Sun &amp; Abraham staggered-adoption estimator with baseline-defined, time-invariant treatment intensity. Primary outcome: Δ(wi_wedge). Include HHI moderation via linear interactions and HHI-bin event studies. Exclude transition years by default unless modeling blend weights explicitly.”</p> </blockquote> <h3 id="claude-code-implementation">Claude Code: Implementation</h3> <p>Take that specification and <strong>implement</strong> it:</p> <p>Claude Code writes the implementation based on agent specifications. Agents provide <strong>what</strong> to build; Claude Code handles <strong>how</strong>.</p> <p>There was some back and forth getting the code to work and explored expanding the time horizon.</p> <h3 id="ideal-workflow">Ideal Workflow</h3> <ol> <li><strong>Virtual Lab agents debate design</strong> → Produce specification</li> <li><strong>Feed spec to Claude Code</strong> → Generate implementation</li> <li><strong>Claude Code runs analysis</strong> → Produce results</li> <li><strong>Virtual Lab agents critique</strong> → Identify issues, suggest refinements</li> <li><strong>Claude Code updates code</strong> → Iterate to final analysis</li> </ol> <h2 id="key-lessons">Key Lessons</h2> <p><strong>1. Start with base models, fine-tune only when necessary</strong></p> <p>Test strong prompts first. Fine-tune only for specific, measurable deficiencies (usually style/format, not knowledge).</p> <p><strong>2. Version control agent prompts</strong></p> <p>Agent system prompts are code. Use git, code review, changelogs, and A/B testing.</p> <p><strong>3. Validate agent output rigorously</strong></p> <p>Agents can be confidently wrong. Always:</p> <ul> <li>Cross-check citations (agents sometimes hallucinate papers)</li> <li>Verify regulatory references against Federal Register</li> <li>Test generated code</li> <li>Run robustness checks on econometric strategies</li> </ul> <p><strong>4. Log everything</strong></p> <p>Virtual Lab saves all discussions as JSON. This audit trail is invaluable for understanding decisions and debugging recommendations.</p> <p><strong>5. Human-in-the-loop for final decisions</strong></p> <p>Let agents debate alternatives, but humans make final calls on:</p> <ul> <li>Research questions</li> <li>Data inclusion/exclusion</li> <li>Publication-ready claims</li> <li>Code deployment</li> </ul> <h2 id="looking-forward">Looking Forward</h2> <p>I want to take another pass at finetuning agents on literature using this approach and test it.</p> <p>As frontier models improve, the bottleneck shifts from “Can agents think at this level?” to “How do we structure agent collaboration effectively?”</p> <p>Virtual Lab provides one answer: assemble specialists, let them debate, critique iteratively. May also explore ways of folding something like this into a custom claude code workflow.</p> <hr/> <p><strong>Resources:</strong></p> <ul> <li><a href="https://github.com/zou-group/virtual-lab">Virtual Lab framework</a> (Zou Group)</li> <li><a href="https://github.com/daltonmaurice/dissc-agent-tooling">My presentation on agent tooling and MCP</a></li> <li>Hospital Impact Files project (example application to health economics)</li> </ul> <p><strong>Credits:</strong> Virtual Lab created by Swanson, K., Wu, W., Bulaong, N.L. et al., “The Virtual Lab of AI agents designs new SARS-CoV-2 nanobodies,” <em>Nature</em> (2025). <a href="https://doi.org/10.1038/s41586-025-09442-9">https://doi.org/10.1038/s41586-025-09442-9</a></p>]]></content><author><name></name></author><category term="research"/><category term="agents"/><category term="research"/><category term="data-engineering"/><summary type="html"><![CDATA[How I used Virtual Lab to create fine-tuned AI agents that collaborate on Medicare payment policy analysis]]></summary></entry></feed>