File size: 3,084 Bytes
5d7987d
 
 
 
 
 
 
 
 
 
 
 
 
 
21859d5
d585445
5d7987d
d585445
21859d5
d585445
21859d5
d585445
2e53b1f
f123e69
 
 
 
21859d5
f123e69
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21859d5
d585445
21859d5
d585445
21859d5
d585445
21859d5
d585445
21859d5
 
 
d585445
21859d5
d585445
21859d5
 
 
 
5d7987d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
license: mit
language:
- en
metrics:
- accuracy
base_model:
- google-bert/bert-base-uncased
pipeline_tag: text-classification
tags:
- text-classification
- spam
- english
---
# Fine-tuned BERT-base-uncased pre-trained model to classify spam SMS.

Check Github for Eval Results logs: https://github.com/fzn0x/bert-sms-classification

My second project in Natural Language Processing (NLP), where I fine-tuned a bert-base-uncased model to classify spam SMS. This is huge improvements from https://github.com/fzn0x/bert-indonesian-english-hate-comments.

How to use this model?

```py
from transformers import BertTokenizer, BertForSequenceClassification
import torch

tokenizer = BertTokenizer.from_pretrained('fzn0x/bert-spam-classification-model')
model = BertForSequenceClassification.from_pretrained('fzn0x/bert-spam-classification-model')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

def model_predict(text: str):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits
    prediction = torch.argmax(logits, dim=1).item()
    return 'SPAM' if prediction == 1 else 'HAM'

def predict():
    text = "Hello, do you know with this crypto you can be rich? contact us in 88888"
    predicted_label = model_predict(text)
    print(f"1. Predicted class: {predicted_label}") # EXPECT: SPAM

    text = "Help me richard!"
    predicted_label = model_predict(text)
    print(f"2. Predicted class: {predicted_label}") # EXPECT: HAM

    text = "You can buy loopstation for 100$, try buyloopstation.com"
    predicted_label = model_predict(text)
    print(f"3. Predicted class: {predicted_label}") # EXPECT: SPAM

    text = "Mate, I try to contact your phone, where are you?"
    predicted_label = model_predict(text)
    print(f"4. Predicted class: {predicted_label}") # EXPECT: HAM

if __name__ == "__main__":
    predict()
```

## 📚 Citations

If you use this repository or its ideas, please cite the following:

See [`citations.bib`](./citations.bib) for full BibTeX entries.

- Wolf et al., *Transformers: State-of-the-Art Natural Language Processing*, EMNLP 2020. [ACL Anthology](https://www.aclweb.org/anthology/2020.emnlp-demos.6)
- Pedregosa et al., *Scikit-learn: Machine Learning in Python*, JMLR 2011.
- Almeida & Gómez Hidalgo, *SMS Spam Collection v.1*, UCI Machine Learning Repository (2011). [Kaggle Link](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset)

## 🧠 Credits and Libraries Used

- [Hugging Face Transformers](https://github.com/huggingface/transformers) – model, tokenizer, and training utilities
- [scikit-learn](https://scikit-learn.org/stable/) – metrics and preprocessing
- Logging silencing inspired by Hugging Face GitHub discussions
- Dataset from [UCI SMS Spam Collection](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset)
- Inspiration from [Kaggle Notebook by Suyash Khare](https://www.kaggle.com/code/suyashkhare/naive-bayes)