NER Model for Privacy

Rubai PII Detectionv1.3

The first open BERT-based NER model for detecting Personal Identifiable Information (PII) in Uzbek text. Now with bank card detection! Trained on 475K+ samples with 96.1% F1 score.

Islomov Sardor

AI Researcher & Engineer

What's New in v1.3

Bank Card Detection - Now detects UzCard (8600), HUMO (9860), Visa, and Mastercard numbers
Improved Accuracy - F1 score increased from 95.7% to 96.1%
Larger Dataset - Training data expanded to 475K+ samples (up from 400K)
Driver License Formats - Added support for HTA, HAA formats
100% Card Detection - All card number formats detected with zero false positives

Why This Model Exists

Uzbekistan introduced a new Personal Data Law that requires organizations to protect citizens' personal information. However, there were no tools available to automatically detect PII in Uzbek text before this model.

Personal data leaks happen in log files, training datasets for AI/ML, customer service chat histories, document archives, API responses, and financial transaction records. This model provides automatic PII detection that can be integrated into data pipelines, document processing systems, compliance audit tools, and financial services.

🤗 Model on Hugging Face ❤ Support My Work

Performance Metrics

96.1%

F1 Score

96.1%

Precision

96.2%

Recall

100%
Card Detection

0%
False Positives

475K+

Training Samples

Test Suite Results (96 tests)

Category	Pass Rate	Notes
CARD_NUMBER	100% (8/8)	All card types detected
False Positives	100% (24/24)	No over-detection
DOCUMENT_ID	100% (8/8)	Passports, licenses
DATE	100% (3/3)	Various formats
NAME	100% (2/2)	Uzbek names
ADDRESS	80% (8/10)	Some edge cases
PHONE	80% (4/5)	Some denorm confusion

Supported Entity Types

The model now detects 7 entity types including the new CARD_NUMBER type:

💳

CARD_NUMBERNew

Detects bank card numbers: UzCard, HUMO, Visa, Mastercard.

8600 1234 5678 9012 (UzCard)
9860 9876 5432 1098 (HUMO)
4012 8888 8888 1881 (Visa)

👤

NAME

Detects Uzbek names including full names with patronymics.

Sardor Rustamov
Karimova Nilufar Shavkatovna
Abdullayev Sardor Karimovich

📞

PHONE

Detects phone numbers in various formats with or without country codes.

90 123 45 67
+998 91 234 56 78
901234567

📅

DATE

Detects dates in Uzbek format, dot format, and date ranges.

15-yanvar 2025-yil
25.12.2024
31-dekabr dan 2-yanvar gacha

🏠

ADDRESS

Detects addresses including cities, districts, streets, and house numbers.

Toshkent shahri Chilonzor tumani
chilonzor 12-uy
yunusobod 5-kvartal 23-uy

📄

DOCUMENT_ID

Detects passport numbers, driver's licenses (including HTA, HAA), and tax IDs.

AA1234567
HTA159387
HAA123456

Live Examples

Example 1: Bank Card Payment

Input: Sardor Rustamov telefon 90 123 45 67, karta 8600 1234 5678 9012

Detected PII:

NAME: Sardor Rustamov PHONE: 90 123 45 67 CARD_NUMBER: 8600 1234 5678 9012

Example 2: Multiple Cards

Input: Kartalarim: 8600 1111 2222 3333 va 9860 4444 5555 6666

Detected PII:

CARD_NUMBER: 8600 1111 2222 3333 CARD_NUMBER: 9860 4444 5555 6666

Example 3: Full Customer Record

Input: Mijoz Rustam Aliyev 2024-yil 15-noyabr kuni 93 456 78 90 raqamidan qo'ng'iroq qildi. Manzili: Samarqand shahri Registon ko'chasi 10-uy. Pasport: AB7654321. Karta: 8600 1234 5678 9012.

Detected PII:

NAME: Rustam Aliyev DATE: 2024-yil 15-noyabr PHONE: 93 456 78 90 ADDRESS: Samarqand shahri Registon ko'chasi 10-uy DOCUMENT_ID: AB7654321 CARD_NUMBER: 8600 1234 5678 9012

Example 4: No False Positives

Input: 5 ta olma oldim, narxi 15000 so'm. Jami 75000 so'm to'ladim. Oylik maoshi 5000000 so'm.

Detected PII: None

✅ Numbers correctly identified as prices/quantities, not PII

Quick Start

Installation

pip install transformers torch

Basic Usage

from transformers import pipeline

# Load the model
ner = pipeline(
    "ner",
    model="islomov/rubai-PII-detection-v1-latin",
    aggregation_strategy="simple"
)

# Detect PII (including card numbers!)
text = "Sardor Rustamov telefon 90 123 45 67, karta 8600 1234 5678 9012"
results = ner(text)

for entity in results:
    if entity['entity_group'] != 'TEXT':
        print(f"{entity['entity_group']}: {entity['word']}")

Output:
NAME: Sardor Rustamov
PHONE: 90 123 45 67
CARD_NUMBER: 8600 1234 5678 9012

Check for Card Numbers

def has_card_number(text: str) -> bool:
    """Check if text contains any card numbers."""
    ner = pipeline(
        "ner",
        model="islomov/rubai-PII-detection-v1-latin",
        aggregation_strategy="simple"
    )
    results = ner(text)
    return any(r["entity_group"] == "CARD_NUMBER" for r in results)

# Examples
print(has_card_number("Karta: 8600 1234 5678 9012"))  # True
print(has_card_number("Telefon: 90 123 45 67"))       # False

Changelog

v1.3 Current

January 2026

Added CARD_NUMBER entity type (7 entities total)
Supports UzCard (8600), HUMO (9860), Visa (4), Mastercard (51-55)
Detects card numbers in spoken format (sakkiz olti nol nol...)
Dataset cleaned - removed 3,211 misaligned samples
Fixed 2,888 invalid/noise labels
Training data: 475,185 samples (up from 400K)
Added driver license formats (HTA, HAA)
F1 Score: 96.12% (improved from 95.7%)

v1.2

January 2026

Added denormalized text support for ASR output
Improved address detection
Added more document ID formats

v1.0

January 2026

Initial release with 5 entity types
400K training samples
F1 Score: 95.7%

Model Architecture

Base Model: tahrirchi/tahrirchi-bert-base (Uzbek BERT)
Architecture: BERT for Token Classification
Parameters: ~110M
Hidden Size: 768
Attention Heads: 12
Layers: 12
Max Sequence Length: 256 tokens
Labels: 7 (TEXT, NAME, PHONE, DATE, ADDRESS, DOCUMENT_ID, CARD_NUMBER)

Training Data

The model was trained on 475,185 samples - the largest Uzbek PII dataset to date, covering 100+ domains including banking, healthcare, education, e-commerce, government services, and financial transactions.

😉 Important Note: No real user personal data was used in training this model. All names, phone numbers, addresses, passport numbers, and card numbers in the dataset are 100% synthetic - generated specifically for training purposes. Your data is safe, we just made up a lot of fake Uzbek people!

Data Source	Samples	Description
Original HuggingFace	~150K	Standard text from rubai-NER-150K-Personal
LLM Generated	~200K	Synthetic data with realistic PII patterns
Card Number Samples	~72K	UzCard, HUMO, Visa in various formats
Denormalized	~50K	Numbers written as words (ASR simulation)
Total	475,185	After cleaning

How to Train Such a Model

Training a PII detection model for a low-resource language like Uzbek requires careful attention to data preparation and augmentation. Here's a guide based on this model's development:

Step 1: Start with a Strong Base Model

Use a BERT model pre-trained on your target language. For Uzbek, tahrirchi/tahrirchi-bert-base provides strong language understanding. This dramatically reduces the amount of labeled data needed.

Step 2: Create a Diverse Dataset

Cover multiple domains (banking, healthcare, education, government, etc.) to ensure the model generalizes well. Include both formal and informal text styles.

Step 3: Data Augmentation

Augment your data with format variations:

Phone numbers: With/without spaces, dashes, country codes
Dates: Different formats (15-yanvar, 15.01.2025, etc.)
Names: Different orderings (First Last, Last First)
Card numbers: With/without spaces, different card types
Denormalized text: Numbers written as words for ASR compatibility

Step 4: Training Configuration

# Recommended hyperparameters (v1.3)
{
    "num_epochs": 5,
    "batch_size": 32,
    "gradient_accumulation_steps": 2,  # effective batch 64
    "learning_rate": 2e-5,
    "weight_decay": 0.01,
    "warmup_ratio": 0.1,
    "max_seq_length": 256,
    "eval_split": 0.05,
    "fp16": True
}

Step 5: Label Alignment

When tokenizing, carefully align labels with subword tokens. Use -100 for special tokens and subword continuations to exclude them from loss calculation.

Key insight: Training on both original text (with digits) and denormalized text (numbers as words) helps the model work with both written documents and ASR/speech transcription outputs. For card numbers, include samples with and without spaces.

Use Cases

Data Privacy Compliance - Detect PII before data sharing to comply with Uzbekistan's Personal Data Law
Financial Services - Detect and mask bank card numbers in transaction logs
Document Redaction - Automatically mask sensitive information in documents
Customer Support - Flag conversations containing personal data for special handling
ML Data Preparation - Clean training datasets by removing or masking PII
Log Sanitization - Filter PII from application logs before storage
Audit & Monitoring - Track PII exposure in communications and documents

Limitations

Latin script primarily - Cyrillic Uzbek text has limited support (62% accuracy)
Denormalized phone/card confusion - Phone numbers as words may sometimes be detected as cards
Addresses need context - Casual addresses with house numbers work; landmark-only references don't
Not detected: Email addresses, URLs, IP addresses (planned for v2)
Recommendations: For Cyrillic text, consider transliteration to Latin before processing. Validate card numbers with Luhn algorithm after detection.

License: CC BY-NC 4.0

This model is released under Creative Commons Attribution-NonCommercial 4.0. This license was chosen deliberately to balance open access with sustainable development.

✅ FREE Use (Auto-Approved)

Government - Ministries, public agencies, municipalities
Education - Universities, schools, research institutes
Non-profit - NGOs, foundations, charities
Individuals - Students, researchers, personal projects

⚠ Requires Approval

Commercial products and services
For-profit company internal tools
Revenue-generating applications
Contact: islomov49@gmail.com

Why CC BY-NC 4.0?

Working on open-source AI projects for low-resource languages is genuinely hard. It takes countless hours of data collection, annotation, training, and testing - often done in evenings and weekends alongside other responsibilities.

The CC BY-NC license is an experiment: Can someone working on Uzbek NLP earn enough to continue building useful tools for the community?

If this works, it means more models, more datasets, and more open resources for Uzbek language AI. For non-commercial users - government, education, researchers, individuals - please use it freely. That's why these tools are built.

Citation

@misc{rubai-pii-detection-2026,
  author = {Sardor Islomov},
  title = {Rubai PII Detection v1.3 - Uzbek Personal Information Detector},
  year = {2026},
  publisher = {Hugging Face},
  organization = {Rubai AI},
  url = {https://huggingface.co/islomov/rubai-PII-detection-v1-latin},
  note = {BERT-based NER model for detecting PII in Uzbek text, including bank card numbers}
}