jumpstation

Distillation Pipeline

Overview

After the targeting suite identifies the minimum viable hardware class for a model, the distillation pipeline compresses the model to fit that target. The pipeline produces model weights at the precision and architecture required by the target device, ready to be packaged into a JumpBundle.

Distillation is where the Turbo’s DX-M1 accelerator provides the most leverage: INT8 quantization-aware training and knowledge distillation loops that take hours on CPU complete in minutes on the DX-M1.


Pipeline Stages

Trained FP32 model
        │
        ▼
┌───────────────────────────┐
│  Quantizer                │
│  Post-training or         │
│  quantization-aware       │
│  training (QAT)           │
│  → INT8 or INT4 weights   │
└─────────────┬─────────────┘
              │  if quantization alone
              │  doesn't meet target:
              ▼
┌───────────────────────────┐
│  Pruner                   │
│  Remove low-saliency      │
│  weights below threshold  │
└─────────────┬─────────────┘
              │  if model still
              │  exceeds target RAM:
              ▼
┌───────────────────────────┐
│  Knowledge Distiller      │
│  Train a smaller student  │
│  model from teacher       │
│  (full-size) outputs      │
└─────────────┬─────────────┘
              │
              ▼
        Compressed model weights
        → packaged into JumpBundle

Each stage is optional. The pipeline runs only the stages required to bring the model within the target hardware’s constraints. The targeting suite’s requirement vector determines which stages are needed.


Quantization (core/distillation/quantizer.py)

Quantization reduces the numerical precision of model weights and activations. JumpStation targets INT8 (for accelerator and capable edge targets) and INT4 (for extreme RAM constraints, e.g. UNO-class targets).

Post-Training Quantization (PTQ)

PTQ converts a trained FP32 model to INT8 without retraining. It requires a small calibration dataset (typically 100–1000 representative samples) to compute activation scale factors.

When to use: Quantization error (from the profiler’s INT8 measurement) is below the quality tolerance threshold. PTQ is fast — minutes on CPU, seconds on DX-M1.

python core/distillation/quantizer.py \
  --model ./my_model.onnx \
  --mode ptq \
  --precision int8 \
  --calibration ./data/calibration/ \
  --output ./model/weights.tflite

Quantization-Aware Training (QAT)

QAT fine-tunes the model with simulated quantization during the forward pass, allowing the weights to adapt to reduced precision before the final conversion. It requires access to training data.

When to use: PTQ quantization error exceeds the quality tolerance. QAT typically recovers 50–90% of the accuracy lost in PTQ.

python core/distillation/quantizer.py \
  --model ./my_model.onnx \
  --mode qat \
  --precision int8 \
  --train-data ./data/train/ \
  --epochs 10 \
  --output ./model/weights_qat.tflite

DX-M1 Acceleration

On the Turbo, the DX-M1 accelerates both the calibration pass (PTQ) and the fine-tuning loop (QAT) by running the INT8 simulation natively on the accelerator rather than emulating it in software on the CPU.


Pruning (core/distillation/distiller.py)

Pruning removes weights that contribute least to the model’s output, reducing the model’s parameter count and RAM footprint. JumpStation uses unstructured magnitude pruning as the default — zero-masking the smallest-magnitude weights — with structured pruning (removing entire channels or layers) available for targets with strict latency budgets.

Pruning is applied before quantization in the pipeline and is typically combined with a short fine-tuning step to recover accuracy.

Target use case: Models that pass quantization error checks but still exceed the target device’s RAM budget.

python core/distillation/distiller.py \
  --model ./my_model.onnx \
  --mode prune \
  --sparsity 0.5 \
  --fine-tune-data ./data/train/ \
  --output ./model/pruned.onnx

Knowledge Distillation (core/distillation/distiller.py)

When the original model architecture is intrinsically too large for the target hardware (even after quantization and pruning), knowledge distillation trains a smaller student model guided by the outputs of the full-size teacher model.

The student learns to match the teacher’s soft output distributions (not just the hard labels), transferring generalization capacity to a fraction of the parameter count.

The role of baked-in operational data

JumpStation’s distillation pipeline supports operational envelope injection: the calibration dataset used during profiling — which represents the real-world input distribution the model will encounter in deployment — is used as the distillation training set. This means:

  1. The student is trained on the inputs it will actually see, not on generic training data.
  2. The teacher’s knowledge is compressed specifically around the operational envelope.
  3. The resulting student model is often significantly smaller than a generically distilled model at the same quality level.

This is one of the key ways the Turbo’s DX-M1 enables faster, better distillation: the DX-M1 runs teacher inference at full speed during the distillation loop, making it economical to use a large high-quality teacher model.

python core/distillation/distiller.py \
  --teacher ./my_model.onnx \
  --student-arch mobilenetv3_small \
  --mode distill \
  --operational-data ./data/operational_samples/ \
  --target pico \
  --output ./model/student.tflite

Pipeline Output

The distillation pipeline outputs:

  1. Compressed weights file at the target precision and framework format
  2. Compression report — quantization error, parameter count before/after, peak RAM before/after, latency estimate
  3. Updated requirement vector — fed back to the JumpBundle builder for embedding in the manifest

Further Reading