The first open BERT-based NER model for detecting Personal Identifiable Information (PII) in Uzbek text. Now with bank card detection! Trained on 475K+ samples with 96.1% F1 score.
Uzbekistan introduced a new Personal Data Law that requires organizations to protect citizens' personal information. However, there were no tools available to automatically detect PII in Uzbek text before this model.
Personal data leaks happen in log files, training datasets for AI/ML, customer service chat histories, document archives, API responses, and financial transaction records. This model provides automatic PII detection that can be integrated into data pipelines, document processing systems, compliance audit tools, and financial services.
| Category | Pass Rate | Notes |
|---|---|---|
| CARD_NUMBER | 100% (8/8) | All card types detected |
| False Positives | 100% (24/24) | No over-detection |
| DOCUMENT_ID | 100% (8/8) | Passports, licenses |
| DATE | 100% (3/3) | Various formats |
| NAME | 100% (2/2) | Uzbek names |
| ADDRESS | 80% (8/10) | Some edge cases |
| PHONE | 80% (4/5) | Some denorm confusion |
The model now detects 7 entity types including the new CARD_NUMBER type:
Detects bank card numbers: UzCard, HUMO, Visa, Mastercard.
Detects Uzbek names including full names with patronymics.
Detects phone numbers in various formats with or without country codes.
Detects dates in Uzbek format, dot format, and date ranges.
Detects addresses including cities, districts, streets, and house numbers.
Detects passport numbers, driver's licenses (including HTA, HAA), and tax IDs.
pip install transformers torch
from transformers import pipeline
# Load the model
ner = pipeline(
"ner",
model="islomov/rubai-PII-detection-v1-latin",
aggregation_strategy="simple"
)
# Detect PII (including card numbers!)
text = "Sardor Rustamov telefon 90 123 45 67, karta 8600 1234 5678 9012"
results = ner(text)
for entity in results:
if entity['entity_group'] != 'TEXT':
print(f"{entity['entity_group']}: {entity['word']}")
def has_card_number(text: str) -> bool:
"""Check if text contains any card numbers."""
ner = pipeline(
"ner",
model="islomov/rubai-PII-detection-v1-latin",
aggregation_strategy="simple"
)
results = ner(text)
return any(r["entity_group"] == "CARD_NUMBER" for r in results)
# Examples
print(has_card_number("Karta: 8600 1234 5678 9012")) # True
print(has_card_number("Telefon: 90 123 45 67")) # False
Base Model: tahrirchi/tahrirchi-bert-base (Uzbek BERT)
Architecture: BERT for Token Classification
Parameters: ~110M
Hidden Size: 768
Attention Heads: 12
Layers: 12
Max Sequence Length: 256 tokens
Labels: 7 (TEXT, NAME, PHONE, DATE, ADDRESS, DOCUMENT_ID, CARD_NUMBER)
The model was trained on 475,185 samples - the largest Uzbek PII dataset to date, covering 100+ domains including banking, healthcare, education, e-commerce, government services, and financial transactions.
😉 Important Note: No real user personal data was used in training this model. All names, phone numbers, addresses, passport numbers, and card numbers in the dataset are 100% synthetic - generated specifically for training purposes. Your data is safe, we just made up a lot of fake Uzbek people!
| Data Source | Samples | Description |
|---|---|---|
| Original HuggingFace | ~150K | Standard text from rubai-NER-150K-Personal |
| LLM Generated | ~200K | Synthetic data with realistic PII patterns |
| Card Number Samples | ~72K | UzCard, HUMO, Visa in various formats |
| Denormalized | ~50K | Numbers written as words (ASR simulation) |
| Total | 475,185 | After cleaning |
Training a PII detection model for a low-resource language like Uzbek requires careful attention to data preparation and augmentation. Here's a guide based on this model's development:
Use a BERT model pre-trained on your target language. For Uzbek, tahrirchi/tahrirchi-bert-base provides strong language understanding. This dramatically reduces the amount of labeled data needed.
Cover multiple domains (banking, healthcare, education, government, etc.) to ensure the model generalizes well. Include both formal and informal text styles.
Augment your data with format variations:
# Recommended hyperparameters (v1.3)
{
"num_epochs": 5,
"batch_size": 32,
"gradient_accumulation_steps": 2, # effective batch 64
"learning_rate": 2e-5,
"weight_decay": 0.01,
"warmup_ratio": 0.1,
"max_seq_length": 256,
"eval_split": 0.05,
"fp16": True
}
When tokenizing, carefully align labels with subword tokens. Use -100 for special tokens and subword continuations to exclude them from loss calculation.
Key insight: Training on both original text (with digits) and denormalized text (numbers as words) helps the model work with both written documents and ASR/speech transcription outputs. For card numbers, include samples with and without spaces.
This model is released under Creative Commons Attribution-NonCommercial 4.0. This license was chosen deliberately to balance open access with sustainable development.
Working on open-source AI projects for low-resource languages is genuinely hard. It takes countless hours of data collection, annotation, training, and testing - often done in evenings and weekends alongside other responsibilities.
The CC BY-NC license is an experiment: Can someone working on Uzbek NLP earn enough to continue building useful tools for the community?
If this works, it means more models, more datasets, and more open resources for Uzbek language AI. For non-commercial users - government, education, researchers, individuals - please use it freely. That's why these tools are built.
@misc{rubai-pii-detection-2026,
author = {Sardor Islomov},
title = {Rubai PII Detection v1.3 - Uzbek Personal Information Detector},
year = {2026},
publisher = {Hugging Face},
organization = {Rubai AI},
url = {https://huggingface.co/islomov/rubai-PII-detection-v1-latin},
note = {BERT-based NER model for detecting PII in Uzbek text, including bank card numbers}
}