Model Card for PII Detection with DeBERTa
This model is a fine-tuned version of
microsoft/deberta
for Named Entity Recognition (NER), specifically designed for detecting Personally Identifiable Information (PII) entities like names, SSNs, phone numbers, credit card numbers, addresses, and more.
Model Details
Model Description
This transformer-based model is fine-tuned on a custom dataset to detect sensitive information, commonly categorized as PII. The model performs sequence labeling to identify entities using token-level classification.
- Developed by: [Privatone]
- Finetuned from model:
microsoft/deberta
- Model type: Token Classification (NER)
- Language(s): English
- Use case: PII detection in text
Training Details
Training Data
The model was fine-tuned on a custom dataset containing labeled examples of the following PII entity types:
- NAME
- SSN
- PHONE-NO
- CREDIT-CARD-NO
- BANK-ACCOUNT-NO
- BANK-ROUTING-NO
- ADDRESS
Epoch Logs
Epoch | Train Loss | Val Loss | Precision | Recall | F1 | Accuracy |
---|---|---|---|---|---|---|
1 | 0.3672 | 0.1987 | 0.7806 | 0.8114 | 0.7957 | 0.9534 |
2 | 0.1149 | 0.1011 | 0.9161 | 0.9772 | 0.9457 | 0.9797 |
3 | 0.0795 | 0.0889 | 0.9264 | 0.9825 | 0.9536 | 0.9813 |
4 | 0.0708 | 0.0880 | 0.9242 | 0.9842 | 0.9533 | 0.9806 |
5 | 0.0626 | 0.0858 | 0.9235 | 0.9851 | 0.9533 | 0.9806 |
SeqEval Classification Report
Label | Precision | Recall | F1-score | Support |
---|---|---|---|---|
ADDRESS | 0.91 | 0.94 | 0.92 | 77 |
BANK-ACCOUNT-NO | 0.91 | 0.99 | 0.95 | 169 |
BANK-ROUTING-NO | 0.85 | 0.96 | 0.90 | 104 |
CREDIT-CARD-NO | 0.95 | 1.00 | 0.97 | 228 |
NAME | 0.98 | 0.97 | 0.97 | 164 |
PHONE-NO | 0.94 | 0.99 | 0.96 | 308 |
SSN | 0.87 | 1.00 | 0.93 | 90 |
Summary
- Micro avg: 0.95
- Macro avg: 0.95
- Weighted avg: 0.95
Evaluation
Testing Data
Evaluation was done on a held-out portion of the same labeled dataset.
Metrics
- Precision
- Recall
- F1 (via seqeval)
- Entity-wise breakdown
- Token-level accuracy
Results
- F1-score consistently above 0.95 for most labels, showing robustness in PII detection.
Recommendations
- Use human review in high-risk environments.
- Evaluate on your own domain-specific data before deployment.
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
model_name = "AI-Enthusiast11/pii-entity-extractor"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Post processing logic to combine the subword tokens
def merge_tokens(ner_results):
entities = {}
for entity in ner_results:
entity_type = entity["entity_group"]
entity_value = entity["word"].replace("##", "") # Remove subword prefixes
# Handle token merging
if entity_type not in entities:
entities[entity_type] = []
if entities[entity_type] and not entity_value.startswith(" "):
# If the previous token exists and this one isn't a new word, merge it
entities[entity_type][-1] += entity_value
else:
entities[entity_type].append(entity_value)
return entities
def redact_text_with_labels(text):
ner_results = nlp(text)
# Merge tokens for multi-token entities (if any)
cleaned_entities = merge_tokens(ner_results)
redacted_text = text
for entity_type, values in cleaned_entities.items():
for value in values:
# Replace each identified entity with the label
redacted_text = redacted_text.replace(value, f"[{entity_type}]")
return redacted_text
#Loading the pipeline
nlp = pipeline("ner", model=model_name, tokenizer=tokenizer, aggregation_strategy="simple")
# Example input (choose one from your examples)
example = "Hi, I’m Mia Thompson. I recently noticed that my electricity bill hasn’t been updated despite making the payment last week. I used account number 4893172051 linked with routing number 192847561. My service was nearly suspended, and I’d appreciate it if you could verify the payment. You can reach me at 727-814-3902 if more information is needed."
# Run pipeline and process result
ner_results = nlp(example)
cleaned_entities = merge_tokens(ner_results)
# Print the NER results
print("\n==NER Results:==\n")
for entity_type, values in cleaned_entities.items():
print(f" {entity_type}: {', '.join(values)}")
# Redact the single example with labels
redacted_example = redact_text_with_labels(example)
# Print the redacted result
print(f"\n==Redacted Example:==\n{redacted_example}")
- Downloads last month
- 152
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
1
Ask for provider support