blank

Building Research Intelligence Pipelines with Gemini Grounding

2026-01-27T10:00:00+00:00

I built a research intelligence tool that extracts structured datasets from web sources using Google Gemini’s search grounding feature. The key insight: grounding transforms LLMs from “confident guessers” into “citation machines” that show their work.

The Problem: LLMs Hallucinate Data

Ask an LLM to list recent hospital mergers and you’ll get a confident response. But is it accurate? Are those dates real? Did that acquisition actually happen?

Traditional LLM responses have no provenance. You can’t verify claims without manually searching each one. For research applications, “the model said so” isn’t a citation.

What Grounding Changes

When Gemini uses Google Search grounding, every claim comes with source URLs. The model searches the web in real-time and grounds its response in actual documents.

Instead of generating from training data, the model:

Searches Google based on your query
Retrieves relevant documents (news articles, press releases, filings)
Synthesizes a response from those documents
Returns the source URLs it used

This creates an audit trail. Each data point traces back to a verifiable source.

Example: Tracking Hospital M&A Activity

I wanted to build a dataset of hospital mergers and acquisitions for a given time period. Manually, this means searching news sites, reading SEC filings, cross-referencing dates—hours of work that needs repeating monthly.

With grounded search, I define what I’m looking for once:

Query configuration:

Find Hospital Mergers and Acquisitions. Prioritize official press releases and SEC filings first, then industry publications like Becker’s Hospital Review and Modern Healthcare. Extract: acquirer name, target name, announcement date, deal status, and value.

Running a scan for Q1 2024:

Starting scan for: mergers (2024-01-01 to 2024-03-31)
Searching Google with Gemini...
Found 12 sources.
Extracting structured data...
Extracted 8 potential records.
Linking records to sources...
Validating records...
6 valid records.
2 records rejected.

What comes out:

Acquirer	Target	Date	Status	Value	Sources
Risant Health	Cone Health	2024-01-16	Announced	Undisclosed	beckershospitalreview.com, fiercehealthcare.com
Prime Healthcare	St. Mary Medical Center	2024-02-08	Completed	$350M	sec.gov, modernhealthcare.com
…	…	…	…	…	…

Each record links back to the actual source URLs. The rejected records include specific reasons—maybe a date couldn’t be verified, or the source didn’t actually support the claim.

The Four-Stage Pipeline

1. Grounded Search

Gemini searches the web with source priorities I specify. It returns both synthesized text and the URLs it drew from. This is where grounding does the heavy lifting—I get real-time web results, not stale training data.

2. Structured Extraction

The raw search results are unstructured text. A second pass extracts structured records with consistent fields. Gemini’s structured output mode ensures valid data every time.

3. Source Linking

Each extracted record gets matched to its source URLs. This creates the audit trail—anyone reviewing the data can click through to verify the original source.

4. Validation

An LLM-as-judge validates each record against the source content. Does the extracted date match what the article says? Is the acquirer name spelled correctly? Records below the confidence threshold get flagged with specific issues.

The output includes full HTML snapshots of each source page. URLs change and pages get removed—capturing the source at extraction time creates a permanent record.

Why Grounding Matters for Research

Source Provenance

Every data point traces to a URL. This transforms research from “I found this somewhere” to “here’s exactly where this came from.” For anything that might end up in a paper or presentation, this is essential.

Temporal Awareness

Grounding searches the live web. Unlike models with knowledge cutoffs, grounded searches find information published yesterday. For tracking ongoing events—M&A activity, policy changes, market developments—this matters.

Quality Signals

The grounding metadata tells you where information came from. You can see if a claim comes from an SEC filing or a random blog post. Source credibility becomes visible.

Verification Pipeline

Because you have source URLs, you can build automated verification. The validation stage cross-references extracted claims against actual source content and flags discrepancies.

Cost Considerations

Each pipeline run makes multiple Gemini API calls:

1 grounded search call
1 extraction call
1 linking call (batched)
N validation calls (one per record)

Gemini 2.0 Flash pricing (as of early 2025):

Input: ~$0.10 per million tokens
Output: ~$0.40 per million tokens
Search grounding adds ~$35 per 1,000 grounded queries

For a typical scan returning 10 records with 12 sources, expect roughly:

Search + extraction: ~5K tokens input, ~2K output
Validation: ~2K tokens per record
Estimated cost per scan: $0.50–$2.00

I’m still instrumenting the pipeline to get precise per-run costs. The main cost driver is the grounding fee—at $0.035 per grounded query, search is the most expensive stage. For high-volume use cases, caching and batching strategies would help.

Note: Pricing changes frequently. Check Google’s pricing page for current rates.

Limitations

Grounding isn’t perfect. The sources returned are what Gemini used, not necessarily the best sources available. Specifying source priorities in your prompt helps, but the model might still miss authoritative sources.

Each stage adds error. Search might miss relevant results. Extraction might misparse text. Linking might match records to wrong sources. The validation stage catches many of these, but logging everything is essential for debugging.

Cost and latency. Grounded search adds latency compared to direct model queries—typically 2-5 seconds for the search to complete.

When to Use This Approach

Good fits:

Research requiring citations
Tracking recent events (news, filings, announcements)
Building datasets that need audit trails
Domains where accuracy matters more than speed

Poor fits:

General knowledge questions (use base model)
Historical data before ~2020 (grounding searches recent web)
High-volume, low-stakes applications
Real-time applications where latency matters

Takeaway

Grounding transforms LLMs from black boxes into transparent research tools. The model still does the synthesis work, but now you can see its sources and verify its claims.

For research intelligence—building datasets, tracking events, monitoring markets—this changes what’s possible. Instead of manually searching and compiling, you define what you’re looking for once and let the pipeline handle extraction and verification.

The audit trail is the key. Every claim traces to a source. That’s what makes the output usable for real research.

Resources:

Search Intelligence Framework - The project described in this post
Gemini Search Grounding

Multi-Agent Research Design in Practice

2026-01-23T18:00:00+00:00

I recently adapted the Virtual Lab framework—originally developed for drug discovery and published in Nature by Swanson, Wu, Bulaong et al.—to study how hospitals respond to Medicare payment shocks.

This post shows how multi-agent collaboration transforms simple research prompts into rigorous econometric designs, why I ultimately skipped fine-tuning for this project, and how this connects to AI-assisted coding tools like Claude Code.

From Simple Prompt to Rigorous Research Design

Initial prompt to the agent team:

“Analyze changes in CMS Inpatient Prospective Payment System (IPPS) Impact Files. Identify quasi-experimental events related to Wage Index Cliffs and Border Discontinuities to understand how hospitals respond to payment shocks.”

What the agents produced after 5 rounds of discussion:

Final Research Question:

How do hospitals respond to Wage Index shocks, what role do cost-shifting behaviors play, how does Market Concentration (HHI) moderate these responses, and what are the implications for health equity?

Final Econometric Specification:

Design: Staggered-adoption intensity event study using Sun & Abraham estimator
Treatment: Baseline-frozen OMB delineation changes mapped to hospital-level payment shocks
Moderation: HHI interaction terms to test “rent capture” hypothesis
Primary Outcome: Change in wage index wedge (Δwi_wedge) - the cleanest behavioral signal in Impact Files
Validity Tests: Pre-trend tests for both Shock and Shock×HHI, manipulation tests, support checks
Inference: Wild bootstrap clustering at CBSA level

This evolution happened through structured debate among six specialized agents.

The Agent Team

Each agent brought specific expertise:

Principal Investigator: Enforced focus on “Impact File-native” variables only
Health Economist: Designed difference-in-differences specifications
Industrial Economist: Analyzed market concentration effects, pushed for “rent capture” mechanism tests
Policy Analyst: Grounded analysis in CMS regulations and Federal Register documentation
Causal Inference Expert: Rejected standard TWFE under staggered adoption, insisted on Sun & Abraham
Economics Critic: Flagged when claims exceeded empirical support

Example: Multi-Agent Debate on Methods

Industrial Economist’s proposal:

“Use difference-in-differences and regression discontinuity designs, with propensity score matching.”

Economics Critic’s response:

“Caution against TWFE under staggered adoption—require Sun & Abraham or Callaway-Sant’Anna with diagnostics. Pre-trend gates must be pre-specified for both Shock and Shock×HHI. Do not over-interpret payment-on-exposure regressions without first-stage validation.”

Causal Inference Expert’s synthesis:

“Main estimator should use Sun & Abraham interaction-weighted event study. Define time-invariant, baseline-frozen treatment intensity for clean identification. Include formal pre-trend gates for both main effects and interactions. Show nonparametric HHI bins.”

Result: The team converged on modern causal inference methods, with explicit validity checks and robustness requirements that wouldn’t have emerged from a single agent or typical researcher workflow.

The Fine-Tuning Pipeline: How It Works

Virtual Lab includes a 7-phase pipeline to transform generic LLMs into domain specialists:

Phase 1-2: Literature Search and Selection

Agent receives task:

“Search PubMed for query ‘CMS IPPS impact file policy change quasi-experimental hospital behavior’, request 10 abstracts. Rigorously exclude papers about basic science, pure clinical trials without cost data, or non-US healthcare systems. Only select papers with direct relevance to US hospital economics, payment policy, or market structure.”

Example agent response (Industrial Economist):

[
  {
    "pmcid": "9175183",
    "title": "Adoption and Learning Across Hospitals: The Case of a Revenue-Generating Practice",
    "reason": "Studies hospital response to 2008 Medicare documentation reform
               affecting revenue—directly relevant to understanding how hospitals
               react to payment incentives and inter-hospital learning dynamics."
  }
]

The agent correctly filtered out conference abstracts and irrelevant papers, selecting only those studying US hospital behavioral responses to Medicare policy changes.

Phase 3-4: Summarization and Training Data Generation

Papers are downloaded and summarized with extreme detail. Here’s an excerpt from the Industrial Economist’s summary of Garmon (2017) on hospital merger screening:

Summary excerpt:

“The paper evaluates whether prospective merger ‘screens’ computed using pre-merger data actually predict post-merger price changes for 28 consummated hospital mergers. It compares traditional structural measures (HHI under different market definitions) versus ‘direct’ unilateral-effects measures (diversion ratios, WTP, UPP) and merger simulation… Key findings: Minimum ΔWTP threshold of 6% and UPP threshold of 4% outperform HHI-based screens. HRR-bed HHI produces many false negatives (flags only 4 of 9 significant price-increase mergers), confirming criticism that EH-like broad markets are too large…”

The summary continues for 15+ pages, covering:

Policy problem and fiscal implications
Institutional backdrop (why 1990s enforcement failed)
Technical details of each screening method (WTP, UPP, diversion ratios)
Data construction and causal design (synthetic control)
All empirical findings with specific numbers
Operational recommendations for local agencies

These detailed summaries are then converted to training examples in OpenAI’s format:

Training data structure:

{
  "messages": [
    {
      "role": "system",
      "content": "You are a Health Economist. Your expertise is in
                  fiscal impact analysis and hospital cost structures..."
    },
    {
      "role": "user",
      "content": "Explain the findings of 'Medicaid Disproportionate Share'
                  regarding hospital policy."
    },
    {
      "role": "assistant",
      "content": "## 1) What the document says UC DSH is *supposed* to do
                  vs. what it *became*

                  ### Original policy purpose...
                  The document frames Medicaid DSH as a supplemental payment
                  intended to help hospitals that 'serve a disproportionate
                  number of low-income patients'... [detailed response]"
    }
  ]
}

Each agent gets 50-200 training examples from their curated literature, creating an 80/20 train/validation split.

Phase 5-7: Fine-Tuning and Evaluation

Training jobs are launched on gpt-4o-mini. To evaluate whether fine-tuning improved performance, the system generates synthetic QA pairs from validation papers.

Example QA pair (from Neprash et al. 2015 on price measurement):

{
  "question": "What is the main measurement problem Neprash et al. (2015) address?",
  "correctAnswerIndex": 0,
  "options": [
    "How to construct valid, comparable commercial health care
     price measures across geographic markets",
    "How to predict individual patient mortality risk using DRG codes",
    "How to estimate hospital cost-to-charge ratios from Medicare cost reports",
    "How to set optimal Medicaid capitation rates"
  ]
}

Another example on methodology:

{
  "question": "What is a major drawback of using a single 'sentinel service'
               price as a proxy for overall market prices?",
  "correctAnswerIndex": 0,
  "options": [
    "Sentinel service prices may be unrepresentative of overall market
     pricing and correlate weakly with other services",
    "Sentinel service prices require identical service baskets across all markets",
    "Sentinel service prices cannot be computed from noncapitated claims",
    "Sentinel service prices automatically adjust for case-mix differences"
  ]
}

The system generates 3-5 multiple-choice questions per validation paper, testing whether models understand:

Main research questions and policy problems
Methodological details and their limitations
Key empirical findings with specific numbers
Institutional context and fiscal implications

Evaluation Results: Base vs Fine-Tuned Models

The evaluation compared base gpt-4o-mini against fine-tuned versions on the validation set QA pairs.

How agents answered domain questions:

When asked about hospital merger screening methods (from the validation set):

Base Model Response:

“Hospital merger screening typically uses HHI thresholds from the Horizontal Merger Guidelines. Post-merger HHI > 2500 with Δ > 200 raises concerns. More recent approaches incorporate diversion ratios…”

Fine-Tuned Industrial Economist:

“Garmon (2017) shows that minimum ΔWTP threshold of 6% and UPP threshold of 4% outperform HHI-based screens for predicting post-merger price changes. HRR-bed HHI produces false negatives—flagging only 4 of 9 significant price-increase mergers—confirming that EH-like broad markets are too large. The optimal screening pipeline should compute ΔWTP and UPP from discharge-based choice models, using conservative aggregation across models…”

The fine-tuned response included:

Specific thresholds (6%, 4%) from the literature
Empirical validation (4 of 9 mergers)
Methodological critique (EH-like markets too broad)
Actionable recommendations (discharge-based choice models)

However, the results were mixed:

On the hold-out validation set, base GPT-4o models often performed as well or better than fine-tuned gpt-4o-mini models. The base models’ superior general reasoning sometimes compensated for less domain-specific knowledge.

Why I Skipped Fine-Tuning

Despite implementing the full pipeline, I ran the project with base GPT-4o models. Here’s why:

1. Base GPT-4o outperformed fine-tuned gpt-4o-mini

When evaluating on hold-out papers, base GPT-4o produced more balanced responses than fine-tuned gpt-4o-mini. The base model’s broader reasoning capabilities and training data meant it already had strong health economics knowledge.

2. Fine-tuning is iterative and incomplete

The fine-tuning pipeline I implemented is a first pass. Improving performance would require:

More training examples (currently 50-200 per agent)
Better prompt engineering in training data
Iterative refinement based on validation failures
Potentially fine-tuning larger models (gpt-4o instead of gpt-4o-mini)

3. Cost-benefit for this research stage

For exploratory research design, the marginal benefit of fine-tuning didn’t justify:

Training costs (~$3/M tokens)
Higher inference costs for fine-tuned models
Maintenance burden (retraining as literature evolves)

Connecting to AI-Assisted Coding

Virtual Lab and tools like Claude Code serve complementary roles:

Virtual Lab: Research Design

Agents design the research through structured debate:

Output from longitudinal modeling phase:

“Adopt an OMB delineation adoption design as the anchor event. Use Sun & Abraham staggered-adoption estimator with baseline-defined, time-invariant treatment intensity. Primary outcome: Δ(wi_wedge). Include HHI moderation via linear interactions and HHI-bin event studies. Exclude transition years by default unless modeling blend weights explicitly.”

Claude Code: Implementation

Take that specification and implement it:

Claude Code writes the implementation based on agent specifications. Agents provide what to build; Claude Code handles how.

There was some back and forth getting the code to work and explored expanding the time horizon.

Ideal Workflow

Virtual Lab agents debate design → Produce specification
Feed spec to Claude Code → Generate implementation
Claude Code runs analysis → Produce results
Virtual Lab agents critique → Identify issues, suggest refinements
Claude Code updates code → Iterate to final analysis

Key Lessons

1. Start with base models, fine-tune only when necessary

Test strong prompts first. Fine-tune only for specific, measurable deficiencies (usually style/format, not knowledge).

2. Version control agent prompts

Agent system prompts are code. Use git, code review, changelogs, and A/B testing.

3. Validate agent output rigorously

Agents can be confidently wrong. Always:

Cross-check citations (agents sometimes hallucinate papers)
Verify regulatory references against Federal Register
Test generated code
Run robustness checks on econometric strategies

4. Log everything

Virtual Lab saves all discussions as JSON. This audit trail is invaluable for understanding decisions and debugging recommendations.

5. Human-in-the-loop for final decisions

Let agents debate alternatives, but humans make final calls on:

Research questions
Data inclusion/exclusion
Publication-ready claims
Code deployment

Looking Forward

I want to take another pass at finetuning agents on literature using this approach and test it.

As frontier models improve, the bottleneck shifts from “Can agents think at this level?” to “How do we structure agent collaboration effectively?”

Virtual Lab provides one answer: assemble specialists, let them debate, critique iteratively. May also explore ways of folding something like this into a custom claude code workflow.

Resources:

Virtual Lab framework (Zou Group)
My presentation on agent tooling and MCP
Hospital Impact Files project (example application to health economics)

Credits: Virtual Lab created by Swanson, K., Wu, W., Bulaong, N.L. et al., “The Virtual Lab of AI agents designs new SARS-CoV-2 nanobodies,” Nature (2025). https://doi.org/10.1038/s41586-025-09442-9