Skip to content

Data¤

Summary statistics:

Dataset Partition Trajectory Rows Fuel Segments Flights File Size
Phase 1 (Train) 124,094,050 133,984 11,088 3.2 GB
Phase 1 (Rank) 24,499,924 24,972 1,929 616 MB
Phase 2 (Rank) 37,877,494 61,745 2,839 943 MB

First party data¤

Schema:

The distribution of aircraft type, segment lengths are heavily tailed.

image

A visualisation of the fuel burn in a simple altitude/speed plot.

image

A visualisation of the preprocessed trajectory features. Notice that state vectors are irregularly sampled, often with significant time gaps.

image

Weather Data¤

Note

Weather data is unused in v0.1 versions of the models. Future versions of the model (v0.2 onwards) will allow optionally specifying the wind component for more accurate predictions.

We augment the trajectory data with u and v wind components extracted from the ARCO ERA5 dataset. This requires installing microfuel with the era5 optional depedency.

  1. The weather data is massive (~565 GB). It is recommended to use an extenral HDD and symlink it to data/raw/weather:

    mkdir -p /mnt/hdd/microfuel_era5
    ln -s /mnt/hdd/microfuel_era5 data/raw/era5
    
  2. Install the gcloud CLI and run the following to pull specific pressure level slices in NetCDF format.

    uv run scripts/main.py download-era5
    
    • Months: 2025-04..=2025-10
    • Variables: u_component_of_wind, v_component_of_wind
    • Levels: 28 levels (1000..=70 hPa)
  3. We interpolate the 4D weather grid (time, level, lat, lon) onto the 4D flight trajectory coordinates.

    uv run scripts/main.py create-era5 --partition phase1