---
language:
- en
license: mit
library_name: transformers.js
tags:
- audio
pipeline_tag: automatic-speech-recognition
---

# Distil-Whisper: Distil-Large-v3.5

Distil-Whisper is the knowledge-distilled version of OpenAI's [Whisper-Large-v3](https://huggingface.co/openai/whisper-large-v3), described in the paper [Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430). As the newest addition to the Distil-Whisper English family, Distil-Large-v3.5 maintains the high efficiency of its predecessors while delivering better performance.

Compared to earlier models, Distil-Large-v3.5 has been trained on over 4× more diverse public data (98k hours) and uses a ["patient" teacher](https://arxiv.org/abs/2106.05237) with an extended training schedule and aggressive data augmentation ([SpecAugment](https://arxiv.org/abs/1904.08779)) during distillation. This results in enhanced robustness and accuracy compared to previous Distil-Whisper models, making it suitable as a drop-in replacement.

| Model                                                                        | Params / M | Rel. RTFx | Short-Form OOD WER | Long-Form OOD WER |
| ---------------------------------------------------------------------------- | ---------- | --------- | ------------------ | ----------------- |
| [large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo)       | 809        | 1.0       | 7.30               | 10.25             |
| [distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3)     | 756        | 1.44      | 7.53               | 11.6              |
| [distil-large-v3.5](https://huggingface.co/distil-whisper/distil-large-v3.5) | 756        | **1.46**  | **7.08**           | **11.39**         |

*Why consider Distil-Large-v3.5 when Whisper-Large-v3-Turbo already exists?*

1. It offers a different balance between accuracy and efficiency, remains **~1.5x faster** than Whisper-Large-v3-Turbo while performing slightly better on short-form transcription and falling ~1% behind on long-form transcription.
2. It works perfectly as a draft model for **speculative decoding** with Whisper-Large-v3. By keeping the encoder frozen during training, we need to load just two extra decoder layers and forward the encoder only once. This achieves ~2x faster inference compared to Whisper-Large-v3 while maintaining identical outputs.

This model is a 🤗 collaborative effort between [Bofeng Huang](https://huggingface.co/bofenghuang), [Eustache Le Bihan](https://huggingface.co/eustlb), [Steven Zheng](https://huggingface.co/Steveeeeeeen), [Vaibhav Srivastav](https://huggingface.co/reach-vb), and [Joshua Lochner](https://huggingface.co/xenova).


## Usage (Transformers.js)

If you haven't already, you can install the [Transformers.js](https://huggingface.co/docs/transformers.js) JavaScript library from [NPM](https://www.npmjs.com/package/@huggingface/transformers) using:
```bash
npm i @huggingface/transformers
```

You can then transcribe audio as follows:

```js
import { pipeline } from '@huggingface/transformers';

const transcriber = await pipeline('automatic-speech-recognition', 'distil-whisper/distil-large-v3.5-ONNX');

const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav';
const output = await transcriber(url);
// { text: "And so, my fellow Americans, ask not what your country can do for you. Ask what you can do for your country." }
```