--- library_name: transformers datasets: - unimelb-nlp/wikiann - ai4privacy/open-pii-masking-500k-ai4privacy - custom language: - en - de - es - it - fr base_model: - google/flan-t5-base --- # Model Card for Flan-T5 Base Token Classifier (NER: LOC, ORG, PER) This model is a fine-tuned encoder-only version of `google/flan-t5-base` for **token-level Named Entity Recognition (NER)**. It predicts entity labels (e.g., LOC, ORG, PER) by classifying **individual tokens** in a prompting setup using `` and `` markers. --- ## Model Details ### Model Description This model is based on the encoder of the T5 architecture and has been fine-tuned for single-token classification using a prompt-driven approach. Given a sentence, one token is wrapped with `` and ``, and the model predicts the corresponding entity class. - **Developed by:** pepegiallo - **Model type:** Encoder-only token classifier - **Language(s) (NLP):** en, de, fr, it, es - **License:** MIT - **Finetuned from model:** google/flan-t5-base ## Uses ### Direct Use You can use this model to classify named entities (PER, ORG, LOC, or O) one token at a time. This approach is suitable for tasks such as: - PII detection - Privacy-preserving document redaction - Legal or medical text anonymization ### Out-of-Scope Use - Full-sequence tagging (this model is optimized for classifying one token at a time) - Multi-token entity recognition without aggregation logic ## How to Get Started with the Model ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch # Load tokenizer and model model = AutoModelForSequenceClassification.from_pretrained("pepegiallo/flan-t5-base_ner") tokenizer = AutoTokenizer.from_pretrained("pepegiallo/flan-t5-base_ner") # Helper: wrap a token with and def wrap_token(text, target_token, tstart="", tend=""): return text.replace(target_token, f"{tstart} {target_token} {tend}") text = "The headquarters of Microsoft is in Redmond." target_token = "Microsoft" prompt = "classify token in: " + wrap_token(text, target_token) inputs = tokenizer(prompt, return_tensors="pt", padding="max_length", truncation=True, max_length=128) outputs = model(**inputs) label_id = torch.argmax(outputs.logits, dim=-1).item() id2label = {0: "LOC", 1: "ORG", 2: "PER", 3: "O"} print("Predicted entity:", id2label[label_id]) ``` --- ## Training Details The model was fine-tuned for 3 epochs on a multilingual, balanced dataset combining: - `wikiann` (unimelb-nlp) - `open-pii-masking-500k-ai4privacy` - Custom annotated examples with `` / `` tags ### Training Hyperparameters - Model: google/flan-t5-base (encoder only) - Batch size: 128 - Max input length: 128 - Optimizer: AdamW - Learning rate: 3e-5 - Epochs: 3 ## Evaluation ### Metrics The model was evaluated using the following metrics: - Accuracy - Precision (Macro) - Recall (Macro) - F1 Score (Macro) ### Results | Epoch | Training Loss | Validation Loss | Accuracy | Precision (Macro) | Recall (Macro) | F1 (Macro) | |-------|---------------|-----------------|----------|-------------------|----------------|------------| | 1 | 0.1702 | 0.1504 | 95.21% | 0.9521 | 0.9521 | 0.9521 | | 2 | 0.1444 | 0.1310 | 95.89% | 0.9588 | 0.9589 | 0.9589 | | 3 | 0.1290 | 0.1246 | 96.14% | 0.9614 | 0.9614 | 0.9614 | --- ## Environmental Impact - **Hardware Type:** [More Information Needed] - **Hours used:** [More Information Needed] - **Cloud Provider:** [More Information Needed] - **Compute Region:** [More Information Needed] ## Technical Specifications - **Architecture:** T5 Encoder + Dense Classification Head - **Precision:** fp32 - **Framework:** PyTorch + Huggingface Transformers ## Citation [optional] ```bibtex @misc{flan-t5-ner, title={Token Classification with Flan-T5 Encoder}, author={pepegiallo}, year={2025}, howpublished={\url{https://huggingface.co/pepegiallo/flan-t5-base_ner}} } ``` ## Model Card Contact For questions, contact: [https://huggingface.co/pepegiallo](https://huggingface.co/pepegiallo)