|
--- |
|
license: mit |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
base_model: |
|
- google-bert/bert-base-uncased |
|
pipeline_tag: text-classification |
|
tags: |
|
- text-classification |
|
- spam |
|
- english |
|
--- |
|
# Fine-tuned BERT-base-uncased pre-trained model to classify spam SMS. |
|
|
|
Check Github for Eval Results logs: https://github.com/fzn0x/bert-sms-classification |
|
|
|
My second project in Natural Language Processing (NLP), where I fine-tuned a bert-base-uncased model to classify spam SMS. This is huge improvements from https://github.com/fzn0x/bert-indonesian-english-hate-comments. |
|
|
|
How to use this model? |
|
|
|
```py |
|
from transformers import BertTokenizer, BertForSequenceClassification |
|
import torch |
|
|
|
tokenizer = BertTokenizer.from_pretrained('fzn0x/bert-spam-classification-model') |
|
model = BertForSequenceClassification.from_pretrained('fzn0x/bert-spam-classification-model') |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model.to(device) |
|
model.eval() |
|
|
|
def model_predict(text: str): |
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device) |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
logits = outputs.logits |
|
prediction = torch.argmax(logits, dim=1).item() |
|
return 'SPAM' if prediction == 1 else 'HAM' |
|
|
|
def predict(): |
|
text = "Hello, do you know with this crypto you can be rich? contact us in 88888" |
|
predicted_label = model_predict(text) |
|
print(f"1. Predicted class: {predicted_label}") # EXPECT: SPAM |
|
|
|
text = "Help me richard!" |
|
predicted_label = model_predict(text) |
|
print(f"2. Predicted class: {predicted_label}") # EXPECT: HAM |
|
|
|
text = "You can buy loopstation for 100$, try buyloopstation.com" |
|
predicted_label = model_predict(text) |
|
print(f"3. Predicted class: {predicted_label}") # EXPECT: SPAM |
|
|
|
text = "Mate, I try to contact your phone, where are you?" |
|
predicted_label = model_predict(text) |
|
print(f"4. Predicted class: {predicted_label}") # EXPECT: HAM |
|
|
|
if __name__ == "__main__": |
|
predict() |
|
``` |
|
|
|
## 📚 Citations |
|
|
|
If you use this repository or its ideas, please cite the following: |
|
|
|
See [`citations.bib`](./citations.bib) for full BibTeX entries. |
|
|
|
- Wolf et al., *Transformers: State-of-the-Art Natural Language Processing*, EMNLP 2020. [ACL Anthology](https://www.aclweb.org/anthology/2020.emnlp-demos.6) |
|
- Pedregosa et al., *Scikit-learn: Machine Learning in Python*, JMLR 2011. |
|
- Almeida & Gómez Hidalgo, *SMS Spam Collection v.1*, UCI Machine Learning Repository (2011). [Kaggle Link](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset) |
|
|
|
## 🧠 Credits and Libraries Used |
|
|
|
- [Hugging Face Transformers](https://github.com/huggingface/transformers) – model, tokenizer, and training utilities |
|
- [scikit-learn](https://scikit-learn.org/stable/) – metrics and preprocessing |
|
- Logging silencing inspired by Hugging Face GitHub discussions |
|
- Dataset from [UCI SMS Spam Collection](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset) |
|
- Inspiration from [Kaggle Notebook by Suyash Khare](https://www.kaggle.com/code/suyashkhare/naive-bayes) |