Open Uzbek Text Correction Models

Rubai Corrector3 Models

A practical Uzbek text correction family built for real noise patterns: raw ASR transcripts, broken OCR from old books, lost apostrophes, mixed Uzbek/Russian lines, and domain-specific cleanup that people can fine-tune themselves.

Islomov Sardor
Islomov Sardor
AI Researcher & Engineer

Why Rubai Corrector Exists

These models are not generic text beautifiers. They were built around real Uzbek failure modes: apostrophes disappearing, punctuation collapsing, numbers and dates becoming hard to read, OCR damage from scanned books, and transcript lines that are readable but not ready to show to users.

The family started from one strong correction backbone and then split into two focused specializations: one for transcript display cleanup after Uzbek ASR, and one for OCR recovery from older Uzbek books where damaged Cyrillic text must become usable Latin Uzbek again.

🤗 Base Model on Hugging Face 📝 Transcript Model 📚 OCR Books Model 📜 Fine-Tuning Script on HF Support My Work

Shared technical base: google/byt5-small
Shared instruction format: correct: {text}
Core principle: noisy Uzbek text goes in, cleaner and more readable Uzbek text comes out.
Included here: direct links to all three checkpoints and the public fine-tuning script.

Model Family

3
Current Models
1
Shared Backbone
2
Specialized Directions
ByT5
Byte-Level Base
Foundation
rubai-corrector-base

The foundation checkpoint for Uzbek correction. This is the right starting point when you want to fine-tune your own corrector for banking, call-center text, OCR cleanup, or product UI text.

noise repair number formatting mixed-script cleanup
🤗 Open base checkpoint
ASR Display
rubai-corrector-transcript-uz

A ready-to-use postprocessor for Uzbek ASR output. It improves readability, restores punctuation, normalizes apostrophes, formats numbers and dates, and cleans mixed Uzbek/Russian transcript lines.

Rubai STT Kotib STT display-ready text
🤗 Open transcript checkpoint
Old Books OCR
rubai-corrector-ocr-books-uz

A specialized model for scanned Uzbek books. It is about OCR damage correction first and Cyrillic-to-Latin recovery second, with attention to broken words, punctuation, and multiline prose.

Cyrillic to Latin OCR repair archive reuse
🤗 Open OCR checkpoint

Selected Model Outputs

The examples below come from the released models and show what each one is actually meant to do in practice.

Base Model Examples

rubai-corrector-base
Noise repair from OCR / ASR-like corruption
Input

rnen universitetda oqiyrnan

Output

Men universitetda o'qiyman

Number and formatting normalization
Input

narxi yigirma besh ming so'm

Output

Narxi 25 000 so'm

Mixed-script cleanup
Input

privet kak делa

Output

Привет как дела

These examples show that the base model is already a useful correction tool by itself: it repairs noisy text, normalizes numbers, and handles mixed-script cleanup.

Transcript Model Examples

rubai-corrector-transcript-uz
Mixed Uzbek + Russian sentence recovery
Input

bugun yaxshi kun. segodnya xoroshiy den.

Output

Bugun yaxshi kun. Сегодня хороший день.

Russian segment restored inside Uzbek context
Input

men bozorga bordim. tam ya kupil xleb.

Output

Men bozorga bordim. Там я купил хлеб.

Display-text polish for product UI
Input

kadrlarda kranning mashina old oynasi ustiga qulaganligini ko'rish mumkin

Output

Kadrlarda kranning mashina old oynasi ustiga qulaganligini ko'rish mumkin.

Text number becomes display-ready number
Input

narxi yigirma besh ming so'm

Output

Narxi 25 000 so'm

These examples show why this model works as a transcript display normalizer: it cleans mixed Uzbek/Russian lines, restores Russian text, normalizes written numbers, and makes raw transcripts display-ready.

OCR Books Model Examples

rubai-corrector-ocr-books-uz
Broken letters repaired through full Cyrillic-to-Latin recovery
Input

Бошим айланиб, кўз олдим айланади. Ана шаҳар, ҳамма қўлида алифбе китоби, савод чнқаришга туртинмоқдалар.

Output

Boshim aylanib, ko'z oldim aylanadi. Ana shahar, hamma qo'lida alifbe kitobi, savod chiqarishga turtinmoqdalar.

Lexical corruption, quotes, and apostrophes fixed together
Input

"Сотасанми?", дедим, "Сотаман", деди. Пул билан савдо келишмагандан кейин, ош билан айирбош килмокчи бўлди. Кўндим. Ошни козон-позон, ўчок дамгирлари билан шу унга алишдим.

Output

"Sotasanmi?", dedim, "Sotaman", dedi. Pul bilan savdo kelishmagandan keyin, osh bilan ayirbosh qilmoqchi bo'ldi. Ko'ndim. Oshni qozon-pozon, o'choq damgirlari bilan shu unga alishdim.

Verse layout preserved instead of flattened
Input

Айт-чн, йигит, Кимни ўйладинг, Куйлаганинг ким эди, Айт, ким? Ким ўртади ҳажрида сени, Кимга орзунг бўлмоқ муяссар?

Output

Ayt-chi, yigit, Kimni o'ylading, Kuylaganing kim edi, Ayt, kim? Kim o'rtadi hajrida seni, Kimga orzung bo'lmoq muyassar?

These examples demonstrate the OCR model's strengths: it repairs OCR damage, recovers readable Latin Uzbek, and preserves verse-like structure when needed.

How The Models Were Built

1

Robust base first

The family began with a correction backbone trained through a multi-stage curriculum so it would learn realistic corruption patterns instead of only one benchmark-style task.

2

Transcript specialization

After the base stabilized, it was fine-tuned for ASR display cleanup using trusted audio-backed pairs, punctuation polish, and safer mixed Uzbek/Russian targets.

3

OCR books specialization

The OCR model was trained separately from the base because old scanned books need a different focus: broken-word repair, scan noise recovery, and Cyrillic-to-Latin output for readable modern reuse.

Fine-Tune From The Base Model

Project Goal

Build a strong Uzbek correction base, publish the results, and make experimentation easier

A key goal of this family is to provide a strong Uzbek correction base for experiments and turn those experiments into public examples that demonstrate real value. Rather than keeping everything behind a closed product, the aim is to show that Uzbek correction work can be practical, reproducible, and accessible.

rubai-corrector-base is already a good tool for correction tasks by itself, and it is also the checkpoint meant for adaptation. If someone wants to build a model for their own noisy text, they should not have to start from zero.

How To Start

Start from the base model and adapt it to your own data

  • Start from the base model instead of retraining a correction model from scratch.
  • Prepare paired data with noisy input and desired clean output.
  • Keep the same prefix: correct: {text}.
  • Use the provided tools and evaluate on the real task you actually care about.

The core point: the base model is already useful, the fine-tuning path is public, and Uzbek correction experiments should be easier to run than they might seem.

Quick Start

All models in the family use the same prefix, so switching between them is straightforward.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "islomov/rubai-corrector-transcript-uz"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

text = "ertaga 15 mart kuni soat 17 30 da uchrashamiz"
inputs = tokenizer("correct: " + text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Example:
# Ertaga, 15-mart kuni soat 17:30 da uchrashamiz.

Limitations

  • Transcript model: can be too editorial for literal transcript preservation.
  • Transcript model: more reliable on clean audio-backed data than on noisy mixed-script metadata.
  • OCR model: title pages, footnotes, glossary-heavy blocks, and repeated headers can still be hard.
  • OCR model: very damaged scans may still need manual review.
  • Base model: useful on its own, but primarily intended as a fine-tuning foundation.

Credits & Thanks

This work was led by Sardor Islomov, with Davron Ibrokhimov as co-author. Support and collaboration from MetaSell, Kotib, and Global Move are gratefully acknowledged.

Lead Author

Sardor Islomov

Co-Author

Davron Ibrokhimov

Support & Collaboration