Lightweighting Models using ONNX

January 13, 2024

3-Minute Read

485 words

The models nowadays are becoming larger, especially language models. Search engines need to provide real-time results, and models on mobile devices have limited memory and computational power, requiring model lightweighting. ONNX, developed by Microsoft, is a cross-platform machine learning framework that can convert models from various frameworks (such as PyTorch, TensorFlow, etc.) into the ONNX format and perform lightweighting. This article takes a PyTorch model as an example, using ONNX to lightweight it, achieving a smaller model size and faster inference speed at a similar level of accuracy. All the code will be available on GitHub model_quatization.

Converting PyTorch Model to ONNX Format

After setting the model to eval mode, you can use torch.onnx.export to convert a PyTorch model to the ONNX format by providing example inputs. The dynamic_axes parameter informs ONNX that our batch size and sequence length are dynamic.

example_inputs = tokenizer("query: this is a test sentence", return_tensors="pt")
model.eval()
torch.onnx.export(
    model,
    tuple((example_inputs['input_ids'], example_inputs['attention_mask'])),
    "models/distilbert-imdb.onnx",
    input_names=["input_ids", "attention_mask"],
    output_names=["output"],
    dynamic_axes={
        "input_ids": {0: "batch_size", 1: "max_seq_len"},
        "attention_mask": {0: "batch_size", 1: "max_seq_len"},
        "output": {0: "batch_size"},
    },
    opset_version=17,
    export_params=True,
    do_constant_folding=True,
)

Model Quantization

Model quantization is the process of representing a model, which uses floating-point numbers to represent weights, by mostly using integers to reduce the model size and speed up inference. Converting from 32-bit floating-point to 8-bit integers can reduce the model size by a quarter, and integer calculations are less complex than floating-point, resulting in faster inference.

Quantization can be done by “Post-training Quantization” or “Quantization-aware Training”. The former directly quantizes the model, while the latter simulates the behavior of the quantized model during training to better preserve accuracy. This article focuses on post-training quantization as an example, which is relatively simple to implement.

Quantizing the ONNX Model

Quantization can be dynamic or static quantization, with dynamic quantization calculating the quantized parameters of activation functions based on the inference data during execution, and static quantization pre-calculating the parameters using a dataset.

ONNX’s official recommendation is to use dynamic quantization for RNN and transformer-based models and static quantization for CNN models.

quantize_dynamic(
        model_input="models/distilbert-imdb.onnx",
        model_output="models/distilbert-imdb.int8.onnx",
        weight_type=QuantType.QInt8,
        extra_options=dict(
            EnableSubgraph=True
        ),
    )

Experimenting with one of the DistilBERT models fine-tuned on the IMDB dataset from HuggingFace, available here.

The results running on a MacBook Air M1 CPU and Windows 10 WSL with an i5-8400 CPU are provided below (results may vary on different platforms):

	Model Size	Inference Time per Instance	Accuracy
PyTorch Model (MAC)	256MB	71.1ms	93.8%
ONNX Model(MAC)	256MB	113.5ms	93.8%
ONNX 8-bit Model(MAC)	64MB	87.7ms	93.75%
PyTorch Model (Win)	256MB	78.6ms	93.8%
ONNX Model(Win)	256MB	85.1ms	93.8%
ONNX 8-bit Model(Win)	64MB	61.1ms	93.85%

Running on GPU

After quantizing the ONNX model, if an operator does not support GPU calculation, it will use CPU to calculate the result. This may incur additional data transfer costs and potentially result in the quantized model being slower than the non-quantized model.

Lightweighting Models using ONNX

Converting PyTorch Model to ONNX Format

Model Quantization

Quantizing the ONNX Model

Running on GPU

Refences

Recent Posts

Building a Standout Entry-Level Machine Learning Engineer Resume

Y Combinator Startup School Online Course

Lightweighting Models using ONNX

Thoughts on Elon Musk's Biography

The Dangers of the 4% Rule

Categories

Tags