Skip to content

Distilling Efficiency: Experiments in Compressing BAAI/bge-m3 using a Synthetic Dataset

May 16, 2025 · Aleynaahukmet

Originally published on Medium


Distillation results

Introduction

The rapid progress in Natural Language Processing (NLP) is mainly due to the development of large and complex neural network models. While these models perform very well, their size and high computational needs make them hard to use in real-world applications, especially for specific types of text. Many use cases — like semantic search and analyzing synthetic questions — could benefit from AI tools, but large models are often too costly and slow.

This study explores distilling the powerful multilingual BAAI/bge-m3 text embedding model into smaller versions with 8, 6, and 4 layers. The goal is to find the right balance between model size, speed, and retrieval performance using a synthetic dataset of contextual texts and generated questions. As NLP models become more complex, distillation helps make advanced AI more accessible and practical, especially for resource-limited needs.

1. Background

1.1 What is Model Distillation?

Model distillation — also known as knowledge distillation — is a technique used to compress large, powerful models by training smaller models to replicate their behavior. First introduced by Hinton et al. (2015), this approach allows a smaller "student" model to approximate the performance of a larger "teacher" model while using significantly fewer parameters. The result is faster inference, lower memory requirements, and reduced deployment costs — ideal for real-world applications that require scalability and efficiency.

At the heart of this process is the idea that the teacher model has already learned rich patterns and decision boundaries from extensive training data. Instead of training the student model solely on standard "hard" labels (like class IDs), distillation trains it on the "soft targets" produced by the teacher model. These soft targets — such as the probability distributions generated at the output layer — convey more nuanced information about how the teacher interprets the data. This helps the student not just learn the correct answers, but also gain insight into the reasoning behind the teacher's decisions.

In this work, the BAAI/bge-m3 model serves as the teacher. It is a state-of-the-art text embedding model, known for its effectiveness and versatility in a wide range of retrieval and representation tasks. It stands out in three important dimensions:

  • Multi-Functionality: BAAI/bge-m3 supports dense retrieval, sparse retrieval (based on lexical matching), and multi-vector retrieval — all within a single model architecture.
  • Multi-Linguality: It works across more than 100 languages, mapping them into a shared semantic space to enable both monolingual and cross-lingual search.
  • Multi-Granularity: The model handles input texts of varying lengths, from short phrases to full-length documents — up to 8192 tokens — making it suitable for diverse use cases.

By distilling this teacher into smaller variants, the goal is to retain as much of its semantic power as possible, while greatly improving efficiency for deployment in real-world systems.

2. Distilled Variants

The primary goal of this work was to create smaller, faster versions of the BAAI/bge-m3 model while retaining as much of its semantic understanding capabilities as possible, particularly for the contextual texts within the synthetic dataset. The motivation stems from the observation that while powerful models are highly effective, their large size can make them prohibitively expensive and slow for large-scale, low-latency applications.

The student models are:

  • altaidevorg/bge-m3-distill-8l (8-layer distilled model)
  • altaidevorg/bge-m3-distill-6l (6-layer distilled model)
  • altaidevorg/bge-m3-distill-4l (4-layer distilled model)

The student models were derived from the BAAI/bge-m3 teacher model by reducing the number of model layers. This is a common and effective approach to creating smaller versions of transformer-based models.

Architecture: Transformer encoder with 4, 6, and 8 layers respectively for the distill-4l, distill-6l, and distill-8l models. The 8-layer model has a 366M parameter size, a significant reduction from the original 24-layer teacher. Other architectural parameters (hidden size, number of attention heads) are kept consistent with the layer configuration of the teacher model.

Training Data: Distillation used a mix of general text corpora and domain-specific contextual texts from the synthetic dataset (including question-answer pairs). This combination aimed to ensure both general understanding and adaptation to the target domain.

Use Cases: These models are intended for:

  • Semantic search
  • Information retrieval
  • Document clustering
  • Embedding components in RAG systems, especially for contextual texts in the synthetic dataset

Performance: The 8-layer model processed about 454 texts/sec on a T4 GPU — 2.5x faster than the teacher model (175 texts/sec). It also showed strong alignment with the teacher's output, scoring a Spearman Cosine of 0.965 and MSE of 0.006 on the test set.

Limitations:

  • May lose some depth on complex tasks compared to the full teacher model.
  • Might generalize less effectively to unrelated domains or unsupported languages.
  • Could inherit biases from the teacher model or training data.

3. Experimental Setup

To evaluate the performance of the distilled models, a robust experimental setup was designed, focusing on a semantic search task using a synthetic dataset composed of contextual texts and synthetically generated questions.

3.1 Dataset and Vector Search Process

The dataset used in the evaluation process is a set of synthetically generated questions derived from contextual texts within the synthetic dataset. Each query is converted into a high-dimensional vector representation that captures its semantic meaning. This embedding is then used to search against the pre-indexed document embeddings in the Qdrant vector database. The system retrieves the top-k most semantically similar document IDs, where k is varied across multiple evaluation rounds (10, 20, 30, 40, 50).

3.2 Evaluation Metric: Precision@k

Precision@k measures the proportion of overlapping items found within the top k results returned by two different models for the same query. When comparing Model A and Model B:

$$P@k_{A vs. B} = \frac{|{A\text{'s top-k docs}} \cap {B\text{'s top-k docs}}|}{k}$$

In this experimental context, Precision@k is used to compare the student models against each other (4L vs. 6L, 4L vs. 8L, 6L vs. 8L) for k values of 10, 20, 30, 40, and 50.

4. Results

4.1 Performance (Precision@k Scores)

Observations:

  • The bge-m3-distill-6l and bge-m3-distill-8l models exhibit the highest similarity in their retrieval behavior across all tested values of k. P@k scores range from 0.700 at k=10 to 0.792 at k=50, indicating that their top retrieved documents are largely overlapping. This suggests that, despite having fewer layers, the 6-layer model closely mirrors the retrieval patterns of the more expressive 8-layer variant.

  • In contrast, the bge-m3-distill-4l model shows noticeably lower alignment with both the 6L and 8L models. The P@10 score between the 4L and 8L models is 0.580, which gradually increases to 0.665 at P@50. This indicates that the 4-layer model diverges more in selecting the most relevant documents, particularly in the top-ranked results.

  • A broader trend observed across all inter-model comparisons is that P@k values generally increase or stabilize as k increases. This means that while the most highly ranked results may vary more between models — especially between those with greater architectural differences like 4L vs. 8L — the overlap in retrieved documents grows when a larger set of top results is considered.

Efficiency Gains: The bge-m3-distill-8l model provides an empirically measured inference speedup of approximately 2.5x compared to the original 24-layer BAAI/bge-m3 teacher model, alongside substantial reductions in model size (366M parameters). The 6L and 4L variants are projected to offer even greater efficiency, with estimated speedups of approximately 3.1x and 4.0x respectively.

Conclusion

This study on distilling the BAAI/bge-m3 model has resulted in the development of a series of more lightweight and computationally efficient embedding models. These distilled variants were designed to strike a balance between semantic performance and practical deployment needs such as speed and resource efficiency.

The 8-layer model emerges as a robust baseline, offering a 2.5x speedup over the original 24-layer teacher model with minimal loss in retrieval quality. It serves as a strong default option for most retrieval-based applications.

The 6-layer model demonstrates impressive alignment with the 8-layer variant, achieving a P@50 of 0.792, indicating that it preserves much of the retrieval behavior while being even more resource-efficient. This makes it an attractive choice for environments where performance needs to be balanced with greater throughput or tighter latency constraints.

The 4-layer model, while showing the largest deviation in top-k retrieval overlap (P@10 of 0.580 compared to the 8-layer model), offers the highest projected speedup of approximately 4.0x. Its lightweight architecture makes it especially well-suited for use cases where maximizing inference speed and minimizing computational footprint are paramount, even at the cost of some retrieval precision.

Together, these models provide a flexible set of options for embedding-based applications, enabling practitioners to make informed decisions based on their specific trade-off requirements between speed, scale, and retrieval fidelity.

Your model. Not theirs.