Earth-Science Data Tooling — Packaged Scientific Python

Atmospheric research produces two kinds of code: the notebook you run once to make a figure, and the tooling everything else depends on. This is the second kind — the packaged layer under my NASA and UMBC remote-sensing work that got raw satellite and observatory data into a shape the science could use.

Context

Across two NASA Goddard internships and research at UMBC, the bottleneck was never the analysis — it was the data plumbing. Satellite and ground-station data arrive in awkward formats (HDF4, fixed-width ASCII, vendor text dumps) at sizes that punish naive code. A cluster of focused tools grew up to handle ingestion, conversion, regridding, and visualization.

The problem

Scientific data formats are their own discipline: MODIS products as HDF4, reanalysis and soundings as netCDF or raw ASCII, observatory records as idiosyncratic fixed-width text. Before any analysis, all of it has to be parsed, put on common grids, and stored to read back fast — and at NASA data volumes, “read it all into memory and loop” isn’t an option.

Approach

The work is split across small packages, each doing one job:

cosmic_crunch — crawls the JPL COSMIC archive and converts ASCII soundings to netCDF4, with an argparse CLI and --processes multiprocessing.
NOAA observatory readers — packaged readers that parse NOAA ESRL/GMD station data into pandas DataFrames.
MODIS / CALIPSO anomaly analysis — the analysis layer for a cloud-opacity discrepancy between the MODIS and CALIOP instruments.
Dark Target ingestion + regridding (private) — a toolkit around NASA’s Dark Target aerosol retrieval: AERONET/VIIRS ingestion (ASCII → netCDF4), congrid-style resampling, and optical-depth processing on pyhdf/HDF4. The substance is the ingestion and regridding, not a from-scratch reimplementation of the retrieval. Plus a small, unit-tested netCDF recompression tool.

Common thread: pyhdf (HDF4) and netCDF4 for the formats, NumPy/SciPy/pandas for the work, cartopy/Matplotlib for the maps, and real packaging so the next person could install and run.

At HPC scale

The same MODIS data also drove the distributed-computing side of my UMBC CyberTraining work. I aggregated MODIS products with Apache Spark on a SLURM-scheduled HPC cluster and ran it as a scaling study — the same job across a growing node count, serial versus parallel — to find where the parallelism actually paid off. The machine-learning coursework alongside it covered both distributed supervised learning (Random Forest, logistic regression, and SVM with feature pipelines and cross-validated tuning on Spark MLlib) and deep learning (training and validating a Keras network) — the formal-program groundwork under the ML I build on today.

Outcome

The public pieces are on GitHub under AGPL-3.0; some research-specific tooling stays private. Either way, it’s the demonstrable engineering behind “did atmospheric research” — scientific Python at package quality, HDF4/netCDF fluency, satellite remote sensing, and crawl-to-netCDF ETL.