slplab
/

wav2vec2-xls-r_Korean_ASR_by_foreigners

Automatic Speech Recognition

speech-recognition

Model card Files Files and versions Metrics Training metrics Community

Wav2Vec2-XLS-R-300m finetuned on the data of Korean pronunciations of English speakers.

This repository contains a finetuned Wav2Vec2-xls-r-300m model for Automatic Speech Recognition (ASR) task. The model was trained and evaluated on “the spoken Korean voice of native English speakers” provided by AIHub https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=&topMenu=&aihubDataSe=data&dataSetSn=71469

Creator & Uploader: Sehyun Oh ([email protected])

Data Information

Dataset Name: the spoken Korean voice of native English speakers.
Data Type: Speech recordings of English speakers speaking Korean.
Annotation: Each utterance is annotated with korean words and phoneme sequences.
Train Set: 50,525 samples, 47.91 hours
Valid Set: 6,510 samples, 6.18 hours
Test Set: 6,315 samples, 6.03 hours

Training Procedure

The model was fine-tuned for ASR using the Hugging Face transformers library. Below are the training steps:

Data preprocessing to align audio with word labels.
Wav2Vec2-XLS-R-300M model fine-tuning with CTC loss.
Evaluation on validation and test sets.

Training Hyperparameters

Epochs: 50
Learning Rate: 0.0001
Warmup Ratio: 0.1
Scheduler: Linear
Batch Size: 8
Loss Reduction: Mean
Feature Extractor Freeze: Enabled

Test Results

The model was evaluated on the test dataset with the following performance:

Word Error Rate (WER): 0.0130
Character Error Rate (CER): 0.0069
Phoneme Error Rate (PER): 0.0114

Sample :

Correct Sentence: 좋은 의견이 있으시면 의견란에 꼭 써 주시기 바랍니다
Predicted Sentence: 좋은 의견이 있으시면 의견란에 꼭 써 주시기 바랍니다

Training Logs

TensorBoard logs are available for detailed training analysis:

events.out.tfevents.1742786238.oem-WS-C621E-SAGE-Series.3352548.0’, ‘events.out.tfevents.1742889983.oem-WS-C621E-SAGE-Series.3352548.1

Use the following command to visualize logs:

tensorboard --logdir=./logs/

Downloads last month: 18

Safetensors

Model size

316M params

Tensor type

F32

·

Inference Providers NEW

Automatic Speech Recognition

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support