--- language: - en license: mit library_name: transformers.js tags: - audio pipeline_tag: automatic-speech-recognition --- # Distil-Whisper: Distil-Large-v3.5 Distil-Whisper is the knowledge-distilled version of OpenAI's [Whisper-Large-v3](https://huggingface.co/openai/whisper-large-v3), described in the paper [Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430). As the newest addition to the Distil-Whisper English family, Distil-Large-v3.5 maintains the high efficiency of its predecessors while delivering better performance. Compared to earlier models, Distil-Large-v3.5 has been trained on over 4× more diverse public data (98k hours) and uses a ["patient" teacher](https://arxiv.org/abs/2106.05237) with an extended training schedule and aggressive data augmentation ([SpecAugment](https://arxiv.org/abs/1904.08779)) during distillation. This results in enhanced robustness and accuracy compared to previous Distil-Whisper models, making it suitable as a drop-in replacement. | Model | Params / M | Rel. RTFx | Short-Form OOD WER | Long-Form OOD WER | | ---------------------------------------------------------------------------- | ---------- | --------- | ------------------ | ----------------- | | [large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) | 809 | 1.0 | 7.30 | 10.25 | | [distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3) | 756 | 1.44 | 7.53 | 11.6 | | [distil-large-v3.5](https://huggingface.co/distil-whisper/distil-large-v3.5) | 756 | **1.46** | **7.08** | **11.39** | *Why consider Distil-Large-v3.5 when Whisper-Large-v3-Turbo already exists?* 1. It offers a different balance between accuracy and efficiency, remains **~1.5x faster** than Whisper-Large-v3-Turbo while performing slightly better on short-form transcription and falling ~1% behind on long-form transcription. 2. It works perfectly as a draft model for **speculative decoding** with Whisper-Large-v3. By keeping the encoder frozen during training, we need to load just two extra decoder layers and forward the encoder only once. This achieves ~2x faster inference compared to Whisper-Large-v3 while maintaining identical outputs. This model is a 🤗 collaborative effort between [Bofeng Huang](https://huggingface.co/bofenghuang), [Eustache Le Bihan](https://huggingface.co/eustlb), [Steven Zheng](https://huggingface.co/Steveeeeeeen), [Vaibhav Srivastav](https://huggingface.co/reach-vb), and [Joshua Lochner](https://huggingface.co/xenova). ## Usage (Transformers.js) If you haven't already, you can install the [Transformers.js](https://huggingface.co/docs/transformers.js) JavaScript library from [NPM](https://www.npmjs.com/package/@huggingface/transformers) using: ```bash npm i @huggingface/transformers ``` You can then transcribe audio as follows: ```js import { pipeline } from '@huggingface/transformers'; const transcriber = await pipeline('automatic-speech-recognition', 'distil-whisper/distil-large-v3.5-ONNX'); const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav'; const output = await transcriber(url); // { text: "And so, my fellow Americans, ask not what your country can do for you. Ask what you can do for your country." } ```