Back to all posts
Knowledge Distillation in Large Language Models
Rise Data Labs

Unmatched Expert Data for Superior SFT and RLHF

Get in touch
Mohammed EL Houcine Ayoubi 7 min read

Knowledge Distillation in Large Language Models

An in-depth look at how knowledge distillation transfers capabilities from large teacher models to smaller, efficient student models covering response-based, feature-based, and attention-based techniques.

# Knowledge Distillation in Large Language Models

Knowledge distillation is the process of transferring knowledge from a large “teacher” model to a smaller “student” model so that the student preserves as much of the teacher’s capabilities as possible while being more efficient. The teacher (often a large pre-trained LLM) generates labels or logits for data, and the student learns to mimic the teacher’s internal behavior on a task. Common motivations for distillation include:

  1. Efficiency: Large LLMs are expensive to run. Distilled students are smaller and faster, reducing inference cost.
  2. Latency: Smaller models respond quicker in real time, which is crucial for interactive applications.
  3. Deployability: Tiny models can run on limited hardware (edge devices, mobile) or be hosted privately, avoiding the infrastructure and compliance headaches of massive models.
  4. Scalability: Once a student model is trained, it can be redistributed cheaply and used widely, whereas a giant teacher might be closed or too costly to access constantly.

Licenses for Knowledge Distillation

Model distillation is allowed when the model’s license explicitly allows derivative works or reuse of its outputs. For example, open-source licenses such as Apache 2.0 or MIT allow modification and redistribution, so distilling a model under those licenses is possible. In contrast, proprietary or custom licenses often forbid using a model’s outputs to train a competing model.

Many permissive licenses explicitly allow model distillation. For instance, NVIDIA’s Nemotron family of models is a prominent example of a distillation-friendly design: they are released under an open model license permitting derivatives, and the Nemotron 3 Nano (30B) model is explicitly described as “open, efficient… approved for distillation workflows”. Also, Google’s FLAN-T5 (Apache 2.0) permits commercial use and modification.

By contrast, restrictive licenses limit or forbid derivatives. Meta’s Llama 2 (Community License) and Llama 3 technically allow creating derivatives, but explicitly forbid using Llama outputs to train any other LLM. OpenAI’s GPT-3 and GPT-4 are closed-source models, and their API terms forbid using their outputs to train competing systems. Many academic or third-party models use Creative Commons licenses such as CC-BY-NC (no commercial use) or CC-BY-NC-ND (no derivatives), which would bar commercial distillation or any derivative model.


General Introduction to Knowledge Distillation

In the simplest form of distillation, knowledge is transferred to a student model by training it on a transfer set using soft target distributions for each sample. These soft targets are produced by running the teacher model with a raised temperature in its softmax.

In fact, modern neural networks typically output a vector of probabilities via a softmax layer, which converts logits $z_{i}$ (computed for each class) into probabilities $q_{i}$ by comparing each $z_{i}$ to the others:

$$q_{i} = \frac{\exp(z_{i}/T)}{\sum_{j} \exp(z_{j}/T)}$$

This technique is very powerful. When access to true labels is limited or missing for some or all of the transfer set, the method can be improved by also training the student to match the available correct labels.

Knowledge distillation flowchart diagram KDLLM


Knowledge Distillation in LLMs

Distillation in the context of LLMs is largely similar. Typically, one first collects input–output pairs (prompts and corresponding outputs or logits) and trains the student to match those outputs. The student’s loss typically blends the usual cross-entropy on true labels with a soft-target loss that compares the student’s output distribution to the teacher’s.

Response-Based Knowledge Distillation

A classic formulation (from Hinton et al.), often called response-based distillation, softens both teacher and student logits using a temperature $T$ and minimizes the Kullback–Leibler (KL) divergence between the resulting distributions. In effect, one minimizes:

$$L_{KD} = T^{2} \cdot KL!\left( \text{softmax}!\left(\frac{z^{(t)}}{T}\right) \Big|\Big| \text{softmax}!\left(\frac{z^{(s)}}{T}\right) \right)$$

Here $z^{(t)}$, $z^{(s)}$ are the teacher and student logits, respectively. Increasing the temperature ($T > 1$) softens the probability distribution, revealing “dark knowledge” about how the teacher spreads probability among classes. In practice, the temperature $T$ and the weighting between soft and hard targets are balanced (for example with a hyperparameter $\alpha$).

Feature-Based Distillation

Instead of using only the final output (as in response-based distillation), feature-based distillation encourages the student to produce intermediate representations similar to those of the teacher. The intuition is that the teacher’s hidden states encode hierarchical linguistic features (e.g., syntax, semantics, discourse) that are valuable for downstream tasks. By aligning these representations, the student can internalize a more structured understanding.

$$L_{feat} = \sum_{(i,j)\in P} \lambda_{ij}\ell\left(T_{i}, g_{j}(S_{j})\right)$$

where $\ell$ is a distance metric and $\lambda_{ij}$ are weighting coefficients.

  • The set of aligned layer pairs $P$ (e.g., teacher layer $i$ corresponds to student layer $j$).
  • The teacher’s hidden states $T_{i} \in \mathbb{R}^{d_t}$ and the student’s as $S_{j} \in \mathbb{R}^{d_s}$.
  • Since dimensions may differ, we often apply a learnable projection $g(\cdot)$ to the student’s features (or $f(\cdot)$ to the teacher’s) to map them to a common space.

Common choices for ℓ:

  • Mean squared error (MSE): $$\ell(a,b) = ||a - b||_{2}^{2}$$
  • Cosine similarity: $$\ell(a,b) = 1 - \dfrac{a \cdot b}{||a||\quad||b||}$$

Feature-based Knowledge Distillation Flowchart Diagram

S-T-Mode

Attention-Based Knowledge Distillation

Attention maps in Large language models have been shown to encode linguistic phenomena: different heads may focus on syntactic dependencies, coreference, or local context. Therefore, they are rich intermediate representations.

The core idea behind Attention Transfer in LLMs is to encourage the student’s attention maps to resemble those of the teacher, thereby transferring the teacher’s internal focus patterns. This is especially effective because attention maps are naturally comparable across models (they have the same dimensionality if sequence length matches, though head counts may differ).

Let $A_{l}^{T} \in \mathbb{R}^{h_t \times n \times n}$ be the attention maps from the teacher layer $l$ (concatenated or averaged across heads), and $A_{m}^{S} \in \mathbb{R}^{h_s \times n \times n}$ from the student layer $m$. We select a set of aligned layer pairs $P$ (e.g., teacher layer $l$ corresponds to student layer $m$). The attention transfer loss is:

$$L_{attn} = \sum_{(l,m)\in P} \lambda_{lm} ,l\left(A_{l'}^{T}A_{m}^{S}\right)'$$

where $l$ is a distance function and $\lambda_{lm}$ are weights.

Common choices for $l$:

  • Mean squared error (MSE) on the raw attention weights.
  • KL divergence treating attention distributions per query.
  • Cosine similarity between flattened attention vectors.

General Distillation Loss

The total training objective often combines feature loss with response-based distillation:

$L = L_{task} + \alpha L_{KD} + \beta L_{feat} + \gamma L_{att}$

where $L_{task}$ is the original task loss (e.g., cross-entropy for classification).


How Data Quality Affects Knowledge Distillation

Knowledge distillation for LLMs hinges on the quality of the data and prompts used to generate teacher outputs. If input prompts are poorly chosen or irrelevant, the teacher’s responses can be off-target, and the student will learn those errors. In fact, theoretical analysis shows that when the prompt distribution diverges from the true task distribution, distillation error grows linearly.

A lack of diversity is another common pitfall. If the prompt set and teacher responses cover only a narrow range of topics, styles, or domains, the student will learn a correspondingly narrow behavior. For example, scraped prompt collections like ShareGPT have “low average quality and narrow distributions,” meaning students trained on them may fail to generalize.


References

  1. Apache License 2.0
  2. MIT License
  3. Meta AI — Llama 2 Community License Agreement
  4. OpenAI — Terms of Use
  5. Google — FLAN-T5 Model Card
  6. EleutherAI — GPT-Neo Model Card
  7. NVIDIA — Nemotron 3 8B Model Card
  8. NVIDIA — Nemotron Technical Report

Tell us about your use case.

By contacting us you agree to our terms and privacy policy.