Ingestion framework for Heterogeneous Research Data

A scalable, modular framework for ingesting research data from heterogeneous sources. Handles messy file formats, inconsistent schemas, legacy codebases—and produces clean, standardized outputs ready for analysis.

Focused on HPC and PySpark for distributed processing. Designed to be extended: add new data sources without touching existing pipelines.