Quickstart¤
Warning
The repository is in a pre-alpha state and not ready for production use.
A convenient PyPI package containing inference-only code with slimmed down dependencies will be released in the future.
Prerequisites¤
- Git
- Python 3.10+
uv(highly recommended), orpip- Modern GPU for running
tritonkernels (this requirement will be lifted in the future) - ~10GB disk space for data.
Installation¤
Clone the repository and sync the environment. We use uv to manage the virtual environment and dependencies.
git clone https://github.com/abc8747/microfuel
cd microfuel
uv sync --extras cli
Codebase Layout¤
Understanding the structure will help you navigate the commands:
src/microfuel/: The core library.datasets/: Logic for loading raw parquet files (raw.py) and generating features (preprocessed.py).model.py: The GDN, Hypernetwork, and Loss functions.hacks.py: JIT-compiled kernel patches forflato support variable-length sequences without recompilation.
scripts/: CLI entry points.main.py: The primary interface for all tasks (downloading, processing, training).plots.py: Visualization tools for debugging and analysis.
Step-by-Step Reproduction¤
1. Data Ingestion¤
The PRC Data Challenge 2025 data is hosted on S3. You will need to request a team creation and configure your access credentials in data/config.toml (create this file if it doesn't exist, see data/config.example.toml).
# downloads raw parquet files to data/raw/
uv run scripts/main.py download-raw
For consistency and to avoid confusion, we rename the data partitions:
cd data/raw
mv flightlist_train.parquet flight_list_phase1.parquet
mv fuel_train.parquet fuel_phase1.parquet
mv flights_train flights_phase1
mv flightlist_rank.parquet flight_list_phase1_rank.parquet
mv fuel_rank_submission.parquet fuel_phase1_rank.parquet
mv flights_rank flights_phase1_rank
mv flightlist_final.parquet flight_list_phase2_rank.parquet
mv fuel_train.parquet fuel_phase2_rank.parquet
mv flights_final flights_phase2_rank
2. Preprocessing¤
This step applies the Kalman Filter/RTS Smoother to the raw ADS-B points to generate clean inputs. It also generates the train/validation splits based on stratified sampling of aircraft types and flight durations.
# smoothed trajectory vectors (heavy CPU usage, ~30 minutes)
uv run scripts/main.py create-dataset --partition phase1
# generate normalisation statistics (mean/std, ~5 minutes)
uv run scripts/main.py create-stats --partition phase1
# create stratified splits
uv run scripts/main.py create-splits --partition phase1
3. Training¤
Launch the training loop. The model uses wandb for logging.
uv run scripts/main.py train \
--partition phase1 \
--exp-name "quickstart-gdn-v1" \
--batch-size 64 \
--lr 4e-4 \
--epochs 20 \
--beta 0.999 \
--loss-type rmse_kg
--beta: The Class-Balanced Loss hyperparameter. Higher values (e.g., 0.999) heavily upweight rare aircraft types.--loss-type: To avoid long tail distributions, the model outputs the average fuel burn rate over the segment instead of the total fuel burnt in that segment.rmse_kgeffectively optimises against both for training stability.
4. Evaluation¤
To evaluate a specific checkpoint on the validation set:
uv run scripts/main.py evaluate \
data/checkpoints/quickstart-gdn-v1/step00500_123.45.pt \
--partition phase1 \
--split validation
This will generate a parquet file in data/predictions/ containing ground truth vs. predicted values for analysis.