mBART-50 Sinhala Transliteration Model
This model transliterates Romanized Sinhala text to Sinhala script.
Model description
This is a fine-tuned version of facebook/mbart-large-50-many-to-many-mmt specialized for Sinhala transliteration. It converts romanized Sinhala (using Latin characters) to proper Sinhala script.
Intended uses & limitations
This model is intended for transliterating Romanized Sinhala text to proper Sinhala script. It can be useful for:
- Text input conversion in applications
- Helping non-native speakers type in Sinhala
- Converting legacy text in romanized format to proper Sinhala
How to use
from transformers import MBartForConditionalGeneration, MBartTokenizerFast
# Load model and tokenizer
model_name = "deshanksuman/mbart_50_SinhalaTransliteration"
tokenizer = MBartTokenizerFast.from_pretrained(model_name)
model = MBartForConditionalGeneration.from_pretrained(model_name)
# Set language codes
tokenizer.src_lang = "en_XX" # Using English as source language token
tokenizer.tgt_lang = "si_LK" # Sinhala as target
# Prepare input
text = "heta api mkda krnne"
inputs = tokenizer(text, return_tensors="pt", max_length=128, padding="max_length", truncation=True)
# Generate output
outputs = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
max_length=96,
num_beams=5,
early_stopping=True
)
# Decode output
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
Training data
The model was trained on the deshanksuman/SwaBhasha_Transliteration_Sinhala dataset, which contains pairs of Romanized Sinhala and corresponding Sinhala script text.
Training procedure
The model was trained with the following parameters:
- Learning rate: 5e-05
- Batch size: 16
- Number of epochs: 2
- Max sequence length: 128
- Optimizer: AdamW
This is trained for sentence level.
Examples:
Example 1:
- Romanized: Dakunu koreyawe eithihasika
- Expected: දකුණු කොරියාවේ ඓතිහාසික
- Predicted: දකුණු කොරියාවේ ඓතිහාසික
- Correct: True
Example 2:
- Romanized: Okoma hodai ganu gathiya
- Expected: ඔක්කොම හොදයි ගෑනු ගතිය
- Predicted: ඕකම හොදයි ගනු ගතිය
- Correct: False
Example 3:
- Romanized: Malki akkith ennwa nedenntm godak kemathiyakkila dennm supiriyatam dance
- Expected: මල්කි අක්කිත් එනව නෙදෙන්නටම ගොඩක් කෑමතියිඅක්කිල දෙන්නම සුපිරියටම ඩාන්ස්
- Predicted: මල්කි අක්කිත් එන්නව නෑද්දෑන්ත්ම ගොඩක් කෑමතියිඅකිල දෑන්ඩම් සුපිරියටම ඩාන්ස්
- Correct: False
- Downloads last month
- 14
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support