3-Minute Read

485 words

The models nowadays are becoming larger, especially language models. Search engines need to provide real-time results, and models on mobile devices have limited memory and computational power, requiring model lightweighting. ONNX, developed by Microsoft, is a cross-platform machine learning framework that can convert models from various frameworks (such as PyTorch, TensorFlow, etc.) into the ONNX format and perform lightweighting. This article takes a PyTorch model as an example, using ONNX to lightweight it, achieving a smaller model size and faster inference speed at a similar level of accuracy. All the code will be available on GitHub model_quatization.

Converting PyTorch Model to ONNX Format

After setting the model to eval mode, you can use torch.onnx.export to convert a PyTorch model to the ONNX format by providing example inputs. The dynamic_axes parameter informs ONNX that our batch size and sequence length are dynamic.

example_inputs = tokenizer("query: this is a test sentence", return_tensors="pt")
model.eval()
torch.onnx.export(
    model,
    tuple((example_inputs['input_ids'], example_inputs['attention_mask'])),
    "models/distilbert-imdb.onnx",
    input_names=["input_ids", "attention_mask"],
    output_names=["output"],
    dynamic_axes={
        "input_ids": {0: "batch_size", 1: "max_seq_len"},
        "attention_mask": {0: "batch_size", 1: "max_seq_len"},
        "output": {0: "batch_size"},
    },
    opset_version=17,
    export_params=True,
    do_constant_folding=True,
)

Model Quantization

Model quantization is the process of representing a model, which uses floating-point numbers to represent weights, by mostly using integers to reduce the model size and speed up inference. Converting from 32-bit floating-point to 8-bit integers can reduce the model size by a quarter, and integer calculations are less complex than floating-point, resulting in faster inference.

Quantization can be done by “Post-training Quantization” or “Quantization-aware Training”. The former directly quantizes the model, while the latter simulates the behavior of the quantized model during training to better preserve accuracy. This article focuses on post-training quantization as an example, which is relatively simple to implement.

Quantizing the ONNX Model

Quantization can be dynamic or static quantization, with dynamic quantization calculating the quantized parameters of activation functions based on the inference data during execution, and static quantization pre-calculating the parameters using a dataset.

ONNX’s official recommendation is to use dynamic quantization for RNN and transformer-based models and static quantization for CNN models.

quantize_dynamic(
        model_input="models/distilbert-imdb.onnx",
        model_output="models/distilbert-imdb.int8.onnx",
        weight_type=QuantType.QInt8,
        extra_options=dict(
            EnableSubgraph=True
        ),
    )

Experimenting with one of the DistilBERT models fine-tuned on the IMDB dataset from HuggingFace, available here.

The results running on a MacBook Air M1 CPU and Windows 10 WSL with an i5-8400 CPU are provided below (results may vary on different platforms):

Model SizeInference Time per InstanceAccuracy
PyTorch Model (MAC)256MB71.1ms93.8%
ONNX Model(MAC)256MB113.5ms93.8%
ONNX 8-bit Model(MAC)64MB87.7ms93.75%
PyTorch Model (Win)256MB78.6ms93.8%
ONNX Model(Win)256MB85.1ms93.8%
ONNX 8-bit Model(Win)64MB61.1ms93.85%

Running on GPU

After quantizing the ONNX model, if an operator does not support GPU calculation, it will use CPU to calculate the result. This may incur additional data transfer costs and potentially result in the quantized model being slower than the non-quantized model.

References

  1. ONNX Runtime
  2. GitHub nixiesearch/onnx-convert
  3. Distilbert IMDB finetune version
  4. IMDB dataset
comments powered by Disqus

Recent Posts

Categories

Tags