NER Model for Privacy

Rubai PII Detectionv1.3

The first open BERT-based NER model for detecting Personal Identifiable Information (PII) in Uzbek text. Now with bank card detection! Trained on 475K+ samples with 96.1% F1 score.

Islomov Sardor
Islomov Sardor
AI Researcher & Engineer

What's New in v1.3

Why This Model Exists

Uzbekistan introduced a new Personal Data Law that requires organizations to protect citizens' personal information. However, there were no tools available to automatically detect PII in Uzbek text before this model.

Personal data leaks happen in log files, training datasets for AI/ML, customer service chat histories, document archives, API responses, and financial transaction records. This model provides automatic PII detection that can be integrated into data pipelines, document processing systems, compliance audit tools, and financial services.

🤗 Model on Hugging Face Support My Work

Performance Metrics

96.1%
F1 Score
96.1%
Precision
96.2%
Recall
100%
Card Detection
0%
False Positives
475K+
Training Samples

Test Suite Results (96 tests)

Category Pass Rate Notes
CARD_NUMBER 100% (8/8) All card types detected
False Positives 100% (24/24) No over-detection
DOCUMENT_ID 100% (8/8) Passports, licenses
DATE 100% (3/3) Various formats
NAME 100% (2/2) Uzbek names
ADDRESS 80% (8/10) Some edge cases
PHONE 80% (4/5) Some denorm confusion

Supported Entity Types

The model now detects 7 entity types including the new CARD_NUMBER type:

💳
CARD_NUMBERNew

Detects bank card numbers: UzCard, HUMO, Visa, Mastercard.

8600 1234 5678 9012 (UzCard)
9860 9876 5432 1098 (HUMO)
4012 8888 8888 1881 (Visa)
👤
NAME

Detects Uzbek names including full names with patronymics.

Sardor Rustamov
Karimova Nilufar Shavkatovna
Abdullayev Sardor Karimovich
📞
PHONE

Detects phone numbers in various formats with or without country codes.

90 123 45 67
+998 91 234 56 78
901234567
📅
DATE

Detects dates in Uzbek format, dot format, and date ranges.

15-yanvar 2025-yil
25.12.2024
31-dekabr dan 2-yanvar gacha
🏠
ADDRESS

Detects addresses including cities, districts, streets, and house numbers.

Toshkent shahri Chilonzor tumani
chilonzor 12-uy
yunusobod 5-kvartal 23-uy
📄
DOCUMENT_ID

Detects passport numbers, driver's licenses (including HTA, HAA), and tax IDs.

AA1234567
HTA159387
HAA123456

Live Examples

Example 1: Bank Card Payment

Input: Sardor Rustamov telefon 90 123 45 67, karta 8600 1234 5678 9012
Detected PII:

NAME: Sardor Rustamov PHONE: 90 123 45 67 CARD_NUMBER: 8600 1234 5678 9012

Example 2: Multiple Cards

Input: Kartalarim: 8600 1111 2222 3333 va 9860 4444 5555 6666
Detected PII:

CARD_NUMBER: 8600 1111 2222 3333 CARD_NUMBER: 9860 4444 5555 6666

Example 3: Full Customer Record

Input: Mijoz Rustam Aliyev 2024-yil 15-noyabr kuni 93 456 78 90 raqamidan qo'ng'iroq qildi. Manzili: Samarqand shahri Registon ko'chasi 10-uy. Pasport: AB7654321. Karta: 8600 1234 5678 9012.
Detected PII:

NAME: Rustam Aliyev DATE: 2024-yil 15-noyabr PHONE: 93 456 78 90 ADDRESS: Samarqand shahri Registon ko'chasi 10-uy DOCUMENT_ID: AB7654321 CARD_NUMBER: 8600 1234 5678 9012

Example 4: No False Positives

Input: 5 ta olma oldim, narxi 15000 so'm. Jami 75000 so'm to'ladim. Oylik maoshi 5000000 so'm.
Detected PII: None

✅ Numbers correctly identified as prices/quantities, not PII

Quick Start

Installation

pip install transformers torch

Basic Usage

from transformers import pipeline

# Load the model
ner = pipeline(
    "ner",
    model="islomov/rubai-PII-detection-v1-latin",
    aggregation_strategy="simple"
)

# Detect PII (including card numbers!)
text = "Sardor Rustamov telefon 90 123 45 67, karta 8600 1234 5678 9012"
results = ner(text)

for entity in results:
    if entity['entity_group'] != 'TEXT':
        print(f"{entity['entity_group']}: {entity['word']}")
Output:
NAME: Sardor Rustamov
PHONE: 90 123 45 67
CARD_NUMBER: 8600 1234 5678 9012

Check for Card Numbers

def has_card_number(text: str) -> bool:
    """Check if text contains any card numbers."""
    ner = pipeline(
        "ner",
        model="islomov/rubai-PII-detection-v1-latin",
        aggregation_strategy="simple"
    )
    results = ner(text)
    return any(r["entity_group"] == "CARD_NUMBER" for r in results)

# Examples
print(has_card_number("Karta: 8600 1234 5678 9012"))  # True
print(has_card_number("Telefon: 90 123 45 67"))       # False

Changelog

v1.3 Current
January 2026
  • Added CARD_NUMBER entity type (7 entities total)
  • Supports UzCard (8600), HUMO (9860), Visa (4), Mastercard (51-55)
  • Detects card numbers in spoken format (sakkiz olti nol nol...)
  • Dataset cleaned - removed 3,211 misaligned samples
  • Fixed 2,888 invalid/noise labels
  • Training data: 475,185 samples (up from 400K)
  • Added driver license formats (HTA, HAA)
  • F1 Score: 96.12% (improved from 95.7%)
v1.2
January 2026
  • Added denormalized text support for ASR output
  • Improved address detection
  • Added more document ID formats
v1.0
January 2026
  • Initial release with 5 entity types
  • 400K training samples
  • F1 Score: 95.7%

Model Architecture

Base Model: tahrirchi/tahrirchi-bert-base (Uzbek BERT)
Architecture: BERT for Token Classification
Parameters: ~110M
Hidden Size: 768
Attention Heads: 12
Layers: 12
Max Sequence Length: 256 tokens
Labels: 7 (TEXT, NAME, PHONE, DATE, ADDRESS, DOCUMENT_ID, CARD_NUMBER)

Training Data

The model was trained on 475,185 samples - the largest Uzbek PII dataset to date, covering 100+ domains including banking, healthcare, education, e-commerce, government services, and financial transactions.

😉 Important Note: No real user personal data was used in training this model. All names, phone numbers, addresses, passport numbers, and card numbers in the dataset are 100% synthetic - generated specifically for training purposes. Your data is safe, we just made up a lot of fake Uzbek people!

Data Source Samples Description
Original HuggingFace ~150K Standard text from rubai-NER-150K-Personal
LLM Generated ~200K Synthetic data with realistic PII patterns
Card Number Samples ~72K UzCard, HUMO, Visa in various formats
Denormalized ~50K Numbers written as words (ASR simulation)
Total 475,185 After cleaning

How to Train Such a Model

Training a PII detection model for a low-resource language like Uzbek requires careful attention to data preparation and augmentation. Here's a guide based on this model's development:

Step 1: Start with a Strong Base Model

Use a BERT model pre-trained on your target language. For Uzbek, tahrirchi/tahrirchi-bert-base provides strong language understanding. This dramatically reduces the amount of labeled data needed.

Step 2: Create a Diverse Dataset

Cover multiple domains (banking, healthcare, education, government, etc.) to ensure the model generalizes well. Include both formal and informal text styles.

Step 3: Data Augmentation

Augment your data with format variations:

Step 4: Training Configuration

# Recommended hyperparameters (v1.3)
{
    "num_epochs": 5,
    "batch_size": 32,
    "gradient_accumulation_steps": 2,  # effective batch 64
    "learning_rate": 2e-5,
    "weight_decay": 0.01,
    "warmup_ratio": 0.1,
    "max_seq_length": 256,
    "eval_split": 0.05,
    "fp16": True
}

Step 5: Label Alignment

When tokenizing, carefully align labels with subword tokens. Use -100 for special tokens and subword continuations to exclude them from loss calculation.

Key insight: Training on both original text (with digits) and denormalized text (numbers as words) helps the model work with both written documents and ASR/speech transcription outputs. For card numbers, include samples with and without spaces.

Use Cases

Limitations

  • Latin script primarily - Cyrillic Uzbek text has limited support (62% accuracy)
  • Denormalized phone/card confusion - Phone numbers as words may sometimes be detected as cards
  • Addresses need context - Casual addresses with house numbers work; landmark-only references don't
  • Not detected: Email addresses, URLs, IP addresses (planned for v2)
  • Recommendations: For Cyrillic text, consider transliteration to Latin before processing. Validate card numbers with Luhn algorithm after detection.

License: CC BY-NC 4.0

This model is released under Creative Commons Attribution-NonCommercial 4.0. This license was chosen deliberately to balance open access with sustainable development.

FREE Use (Auto-Approved)
  • Government - Ministries, public agencies, municipalities
  • Education - Universities, schools, research institutes
  • Non-profit - NGOs, foundations, charities
  • Individuals - Students, researchers, personal projects
Requires Approval
  • Commercial products and services
  • For-profit company internal tools
  • Revenue-generating applications
  • Contact: islomov49@gmail.com

Why CC BY-NC 4.0?

Working on open-source AI projects for low-resource languages is genuinely hard. It takes countless hours of data collection, annotation, training, and testing - often done in evenings and weekends alongside other responsibilities.

The CC BY-NC license is an experiment: Can someone working on Uzbek NLP earn enough to continue building useful tools for the community?

If this works, it means more models, more datasets, and more open resources for Uzbek language AI. For non-commercial users - government, education, researchers, individuals - please use it freely. That's why these tools are built.

Citation

@misc{rubai-pii-detection-2026,
  author = {Sardor Islomov},
  title = {Rubai PII Detection v1.3 - Uzbek Personal Information Detector},
  year = {2026},
  publisher = {Hugging Face},
  organization = {Rubai AI},
  url = {https://huggingface.co/islomov/rubai-PII-detection-v1-latin},
  note = {BERT-based NER model for detecting PII in Uzbek text, including bank card numbers}
}