Project
Earth-Science Data Tooling — Packaged Scientific Python
Packaged scientific-Python tooling behind my NASA and UMBC atmospheric research — crawlers, readers, and regridders that turn raw satellite and observatory data into analysis-ready netCDF.
Atmospheric research produces two kinds of code: the notebook you run once to make a figure, and the tooling everything else depends on. This is the second kind — the packaged layer under my NASA and UMBC remote-sensing work that got raw satellite and observatory data into a shape the science could use.
Context
Across two NASA Goddard internships and research at UMBC, the bottleneck was never the analysis — it was the data plumbing. Satellite and ground-station data arrive in awkward formats (HDF4, fixed-width ASCII, vendor text dumps) at sizes that punish naive code. A cluster of focused tools grew up to handle ingestion, conversion, regridding, and visualization.
The problem
Scientific data formats are their own discipline: MODIS products as HDF4, reanalysis and soundings as netCDF or raw ASCII, observatory records as idiosyncratic fixed-width text. Before any analysis, all of it has to be parsed, put on common grids, and stored to read back fast — and at NASA data volumes, “read it all into memory and loop” isn’t an option.
Approach
The work is split across small packages, each doing one job:
- cosmic_crunch — crawls the
JPL COSMIC archive and converts ASCII soundings to netCDF4, with an
argparseCLI and--processesmultiprocessing. - NOAA observatory readers —
packaged readers
that parse NOAA ESRL/GMD station data into
pandasDataFrames. - MODIS / CALIPSO anomaly analysis — the analysis layer for a cloud-opacity discrepancy between the MODIS and CALIOP instruments.
- Dark Target ingestion + regridding (private) — a toolkit around NASA’s
Dark Target aerosol retrieval: AERONET/VIIRS ingestion (ASCII → netCDF4),
congrid-style resampling, and optical-depth processing onpyhdf/HDF4. The substance is the ingestion and regridding, not a from-scratch reimplementation of the retrieval. Plus a small, unit-tested netCDF recompression tool.
Common thread: pyhdf (HDF4) and netCDF4 for the formats, NumPy/SciPy/pandas
for the work, cartopy/Matplotlib for the maps, and real packaging so the next
person could install and run.
At HPC scale
The same MODIS data also drove the distributed-computing side of my UMBC CyberTraining work. I aggregated MODIS products with Apache Spark on a SLURM-scheduled HPC cluster and ran it as a scaling study — the same job across a growing node count, serial versus parallel — to find where the parallelism actually paid off. The machine-learning coursework alongside it covered both distributed supervised learning (Random Forest, logistic regression, and SVM with feature pipelines and cross-validated tuning on Spark MLlib) and deep learning (training and validating a Keras network) — the formal-program groundwork under the ML I build on today.
Outcome
The public pieces are on GitHub under AGPL-3.0; some research-specific tooling stays private. Either way, it’s the demonstrable engineering behind “did atmospheric research” — scientific Python at package quality, HDF4/netCDF fluency, satellite remote sensing, and crawl-to-netCDF ETL.