Speech Recognition AI: Fine-Tuned Whisper and Wav2Vec2 for Real-Time Audio

Hugging Face

This project fine-tunes OpenAI's Whisper (whisper-small) and Facebook's Wav2Vec2 (wav2vec2-base-960h) models for real-time speech recognition using live audio recordings. Itโ€™s designed for dynamic environments where low-latency transcription is key, such as live conversations or streaming audio.

Model Description

This is a fine-tuned version of OpenAI's Whisper small model and Facebook's Wav2Vec2 base model, optimized for real-time speech-to-text transcription. The models were trained on live 16kHz mono audio recordings, improving transcription accuracy over their base versions for continuous input scenarios.

Features

  • Real-time audio recording: Captures live 16kHz mono audio via microphone input.
  • Continuous fine-tuning: Updates model weights incrementally during live sessions.
  • Speech-to-text transcription: Converts audio to text with high accuracy.
  • Model saving/loading: Automatically saves fine-tuned models with timestamps.
  • Dual model support: Choose between Whisper and Wav2Vec2 architectures.

Note: Currently supports English-only transcription.

Installation

Clone the repository and install the dependencies:

git clone https://github.com/bniladridas/speech-model.git
cd speech-model
pip install -r requirements.txt

Optional: Install system dependencies for Sounddevice (e.g., libsoundio on Linux):

sudo apt-get install libsndfile1

Usage

Start Fine-Tuning

Fine-tune the model on live audio:

# For Whisper model
python main.py --model_type whisper

# For Wav2Vec2 model
python main.py --model_type wav2vec2

Records audio in real-time and updates the model continuously. Press Ctrl+C to stop training and save the model automatically.

Transcription

Test the fine-tuned model:

# For Whisper model
python test_transcription.py --model_type whisper

# For Wav2Vec2 model
python test_transcription.py --model_type wav2vec2

Records 5 seconds of audio (configurable in code) and generates a transcription.

Model Storage

Models are saved by default to:

models/speech_recognition_ai_fine_tune_[model_type]_[timestamp]

Example: models/speech_recognition_ai_fine_tune_whisper_20250225

To customize the save path:

export MODEL_SAVE_PATH="/your/custom/path"
python main.py --model_type [whisper|wav2vec2]

Requirements

  • Python 3.8+
  • PyTorch (torch==2.0.1 recommended)
  • Transformers (transformers==4.35.0 recommended)
  • Sounddevice (sounddevice==0.4.6)
  • Torchaudio (torchaudio==2.0.1)

A GPU is recommended for faster fine-tuning. See requirements.txt for the full list.

Model Details

  • Task: Automatic Speech Recognition (ASR)
  • Base Models:
    • Whisper: openai/whisper-small
    • Wav2Vec2: facebook/wav2vec2-base-960h
  • Fine-tuning: Trained on live 16kHz mono audio recordings with a batch size of 8, using the Adam optimizer (learning rate 1e-5).
  • Input: 16kHz mono audio
  • Output: Text transcription
  • Language: English

Loading the Model (Hugging Face)

To load the models from Hugging Face:

from transformers import WhisperForConditionalGeneration, WhisperProcessor
model = WhisperForConditionalGeneration.from_pretrained("bniladridas/speech-recognition-ai-fine-tune")
processor = WhisperProcessor.from_pretrained("bniladridas/speech-recognition-ai-fine-tune")

Repository Structure

speech-model/
โ”œโ”€โ”€ dataset.py              # Audio recording and preprocessing
โ”œโ”€โ”€ train.py                # Training pipeline
โ”œโ”€โ”€ test_transcription.py   # Transcription testing
โ”œโ”€โ”€ main.py                 # Main script for fine-tuning
โ”œโ”€โ”€ README.md               # This file
โ””โ”€โ”€ requirements.txt        # Dependencies

Training Data

The models are fine-tuned on live audio recordings collected during runtime. No pre-existing dataset is requiredโ€”users generate their own data via microphone input.

Evaluation Results

Placeholder: Future updates will include WER (Word Error Rate) metrics compared to base models.

License

Licensed under the MIT License. See the LICENSE file for details.

Downloads last month
1
Safetensors
Model size
242M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for bniladridas/speech-recognition-ai-fine-tune

Finetuned
(140)
this model

Dataset used to train bniladridas/speech-recognition-ai-fine-tune