Model Card for Model ID

Model Details

  • Developer: Dhulipalla Gopi Chandu
  • Base Model: facebook/wav2vec2-base-960h
  • Techniques Used: LoRA, Quantization
  • Library: 🤗 Transformers
  • Task: Automatic Speech Recognition (ASR)
  • Language: English
  • License: Apache 2.0 (or specify if different)

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

  • Developed by: [Dhulipalla Gopi Chandu]
  • Funded by : Not applicable (independent project)
  • Shared by : Dhulipalla Gopi Chandu
  • Model type: Automatic Speech Recognition (ASR)
  • Language(s) (NLP): English
  • License: Apache 2.0 (same as base model)
  • Finetuned from model : facebook/wav2vec2-base-960h

Usage(python)

-from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

-import torch

-import soundfile as sf

model = Wav2Vec2ForCTC.from_pretrained("DhulipallaGopiChandu/wav2vec2-lora-quantized") processor = Wav2Vec2Processor.from_pretrained("DhulipallaGopiChandu/wav2vec2-lora-quantized")

speech, rate = sf.read("audio.wav") inputs = processor(speech, return_tensors="pt", sampling_rate=rate) logits = model(**inputs).logits predicted_ids = torch.argmax(logits, dim=-1)

transcription = processor.batch_decode(predicted_ids) print(transcription)'''

Limitations & Risks

-This model may not perform well on non-English or noisy audio. Ensure to validate for fairness if used in production.

Citation

@misc{dhulipalla2025wav2vec2lora, author = {Dhulipalla Gopi Chandu}, title = {Wav2Vec2 LoRA

Uses

-This model is intended for automatic speech recognition (ASR) tasks. It can transcribe spoken English audio into text with reasonable accuracy and efficiency due to LoRA fine-tuning and quantization.

Direct Use

Speech-to-text applications for English language input.

Voice-controlled assistants and accessibility tools for the hearing impaired.

Transcription tools for meetings, lectures, interviews, or podcasts.

Users can directly load and use the model without additional training. Downstream Use This model can be integrated into larger systems like voice bots, real-time captioning systems, or automated subtitling software.

It may also serve as a base model for further fine-tuning on domain-specific datasets (e.g., medical speech, call center logs).

Out-of-Scope Use Non-English audio input: This model is not fine-tuned for multilingual or non-English datasets.

Real-time safety-critical systems (e.g., medical decision-making, emergency call processing) without validation.

Noisy or overlapping speech scenarios: Performance may drop in low-quality or multi-speaker environments.

Downstream Use

This model can be integrated into larger systems like voice bots, real-time captioning systems, or automated subtitling software.

It may also serve as a base model for further fine-tuning on domain-specific datasets (e.g., medical speech, call center logs).

Out-of-Scope Use

Non-English audio input: This model is not fine-tuned for multilingual or non-English datasets.

Real-time safety-critical systems (e.g., medical decision-making, emergency call processing) without validation.

Noisy or overlapping speech scenarios: Performance may drop in low-quality or multi-speaker environments.

[More Information Needed]

Bias, Risks, and Limitations

While this model performs well on clean English audio, it has limitations and potential biases:

Accents and dialects: May underperform on heavily accented speech not present in the training data.

Background noise sensitivity: Not ideal for environments with high noise levels.

Bias in training data: If the base model or fine-tuning data was unbalanced (e.g., by gender, region, or age), recognition performance may vary across different demographics.

[More Information Needed]

Recommendations

Use this model in conjunction with human-in-the-loop systems when high accuracy is critical.

Test the model's performance on your specific audio environment and user group before production deployment.

Consider fine-tuning on your domain-specific data if accuracy is suboptimal for your needs.

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.

[More Information Needed]

Training Details

Training Data

The model was fine-tuned using the LibriSpeech ASR Corpus, specifically the train-clean-100 split. This dataset consists of 100 hours of clean speech read by native English speakers. It is commonly used for automatic speech recognition tasks and includes paired audio-transcription data. Data Summary:

Source: LibriVox audiobooks (public domain)

Language: English

Audio Format: 16kHz sampled mono-channel WAV files

Additional preprocessing steps:

All audio was resampled to 16kHz

Transcriptions were lowercased and stripped of punctuation

Audio longer than 20 seconds was truncated or split into segments

Training Procedure

Preprocessing Audio and text inputs were processed using the Hugging Face Wav2Vec2Processor, which combines a tokenizer for text and a feature extractor for audio.

Example code:

python Copy Edit inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt", padding=True) labels = processor.tokenizer(transcript, return_tensors="pt", padding=True).input_ids Training Hyperparameters Base Model: facebook/wav2vec2-base-960h

Fine-tuning Method: LoRA (Low-Rank Adaptation)

Precision: FP16 mixed precision

Epochs: 5

Batch Size: 8 (with gradient accumulation)

Learning Rate: 3e-4

LoRA Configuration:

Rank = 8

Alpha = 16

Dropout = 0.1

Warmup Ratio: 0.1

Optimizer: AdamW

Scheduler: Linear decay

Framework: Transformers + PEFT (Parameter-Efficient Fine-Tuning)

Preprocessing

The audio data was processed as follows:

Sampling Rate: All audio resampled to 16kHz (matching model requirements)

Format: .wav or .flac (mono-channel)

Duration: Audios longer than 20 seconds were filtered or chunked

Normalization: Audio normalized between -1 and 1

Text Normalization: Transcriptions were lowercased, punctuation removed (except apostrophes), and extra spaces collapsed

Tokenizer and feature extractor were used from Wav2Vec2Processor: processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")

Training Hyperparameters

Training regime: fp16 mixed precision

Batch size: 8 (due to memory constraints)

Epochs: 5

Learning rate: 3e-4

Optimizer: AdamW

LoRA rank: 8

LoRA alpha: 16

LoRA dropout: 0.1

Gradient accumulation steps: 2

Warmup ratio: 0.1

Speeds, Sizes, Times

Training time: ~2 hours (Google Colab Pro+ with T4 GPU)

Model size after LoRA + quantization: ~123 MB

Original model size: ~360 MB

Upload checkpoint size: ~150 MB (includes processor + model card)

Inference speed: ~2.3x faster than full precision model on CPU

Evaluation

This section describes the evaluation protocols and metrics used to assess model performance.

Testing Data, Factors & Metrics

Testing Data

Primary: LibriSpeech Test-clean

Secondary (custom): Indian English Speech Dataset (internal), 2-hour curated sample

Factors

Evaluation included variability across:

Speaker accents: American, British, Indian

Background noise: Clean and noisy (SNR ≥ 20dB)

Speech tempo: Normal and fast speech

Audio duration: 5s–15s segments

Metrics

The following evaluation metrics were used to assess the performance of the model:

WER (Word Error Rate): Measures the percentage of words incorrectly predicted. Lower is better.

WER=S+D+I/N

where S = substitutions, D = deletions, I = insertions, and N = total words.

CER (Character Error Rate): Similar to WER but at the character level. Useful when dealing with small-vocabulary datasets or noisy transcriptions.

These metrics help evaluate the model’s performance on speech-to-text tasks across varying input lengths and accents.

Results

WER on LibriSpeech test-clean: 7.15%

WER on LibriSpeech test-other: 12.4%

CER (internal test): ~4.6% (on custom Indian English speech dataset)

Summary

The LoRA-quantized version of facebook/wav2vec2-base-960h achieves competitive accuracy while drastically reducing model size and memory consumption. It is well-suited for deployment in edge environments or low-resource applications.

Model Examination

No formal interpretability analysis was conducted. However, the attention patterns and token activations from intermediate layers may be visualized using tools such as:

  • BertViz transformers’ attention_visualizer (for Wav2Vec2 models)

Environmental Impact

Estimated using Lacoste et al. (2019):

Hardware Type: NVIDIA Tesla T4 (for training)

Hours Used: ~2 hours (for LoRA fine-tuning and quantization)

Cloud Provider: Google Colab

Compute Region: Asia-South1 (Mumbai)

Carbon Emitted: ~0.47 kg CO₂eq (Estimated via MLCO2 calculator)

While small in emissions due to the use of parameter-efficient training (LoRA) and quantization, we encourage users to consider green AI practices and re-use this checkpoint where possible.

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Technical Specifications

Model Architecture and Objective

Base Model: facebook/wav2vec2-base-960h

Modified With: LoRA (Low-Rank Adaptation) + 8-bit Quantization

Objective: Speech-to-text transcription using CTC loss

Framework: Transformers

Pretraining Objective: Self-supervised learning on masked speech frames

Fine-tuning Objective: CTC on labeled datasets (LibriSpeech / custom)

Compute Infrastructure

As previously described:

Hardware

GPU: NVIDIA Tesla T4 / A100 (depending on availability)

RAM: 16–32 GB

Storage: SSD for faster data loading

Software

transformers >= 4.36

datasets

peft

accelerate

bitsandbytes (for quantization)

huggingface_hub

torchaudio and soundfile for audio processing

Citation

BibTeX:

@misc{dhulipalla2025wav2vec2lora, title={Wav2Vec2 LoRA Quantized Model}, author={Dhulipalla Gopi Chandu}, year={2025}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/DhulipallaGopiChandu/wav2vec2-lora-quantized}}, }

APA:

Dhulipalla, G. C. (2025). Wav2Vec2 LoRA Quantized Model [Computer software]. Hugging Face. https://huggingface.co/DhulipallaGopiChandu/wav2vec2-lora-quantized

Glossary

ASR (Automatic Speech Recognition): The process of converting spoken language into text using machine learning models.

LoRA (Low-Rank Adaptation): A fine-tuning technique that enables efficient training of large models with fewer parameters.

Quantization: Technique that reduces the precision of the model’s weights to reduce memory and improve inference speed (e.g., from float32 to int8).

CTC (Connectionist Temporal Classification): A loss function used in speech-to-text tasks that enables alignment-free training.

[More Information Needed]

More Information

GitHub: github.com/DhulipallaGopiChandu

LinkedIn: linkedin.com/in/dhulipalla-gopi

Instagram: @dhulipalla_gopi_9999

Model Card Authors

Dhulipalla Gopi Chandu – B.Tech in AI & ML AI Researcher | Speech & NLP Enthusiast Hugging Face Profile

Model Card Contact

If you have questions, suggestions, or issues related to this model, please contact: [email protected]

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support