Automating Medicare Claims Processing

Automated pipeline for processing Medicare claims data

Ingesting CMS data previosly relied on SAS scripts needing manual adjustments. I modernized the stack by deploying Apache Airflow to automate ingestion and transformation. What used to be a fragile, manual yearly process is now a reliable pipeline handling multi-terabyte datasets.

Technical Stack

Apache Spark: Data transformation and validation
Apache Airflow: Workflow orchestration and scheduling
Python: Data transformation and validation
Parquet/Arrow: Columnar storage for fast analytics

What It Does

Ingests raw CMS Medicare RIF files
Applies consistent cleaning and validation rules
Outputs analysis-ready datasets partitioned by year
Runs automatically when new data arrives
Automates the reading of metadata files to generate schemas for Spark

Resources

NBER Medicare Public Data - Public-use files of Medicare knowledge base for users of this data