A practical Uzbek text correction family built for real noise patterns: raw ASR transcripts, broken OCR from old books, lost apostrophes, mixed Uzbek/Russian lines, and domain-specific cleanup that people can fine-tune themselves.
These models are not generic text beautifiers. They were built around real Uzbek failure modes: apostrophes disappearing, punctuation collapsing, numbers and dates becoming hard to read, OCR damage from scanned books, and transcript lines that are readable but not ready to show to users.
The family started from one strong correction backbone and then split into two focused specializations: one for transcript display cleanup after Uzbek ASR, and one for OCR recovery from older Uzbek books where damaged Cyrillic text must become usable Latin Uzbek again.
Shared technical base: google/byt5-small
Shared instruction format: correct: {text}
Core principle: noisy Uzbek text goes in, cleaner and more readable Uzbek text comes out.
Included here: direct links to all three checkpoints and the public fine-tuning script.
The foundation checkpoint for Uzbek correction. This is the right starting point when you want to fine-tune your own corrector for banking, call-center text, OCR cleanup, or product UI text.
A ready-to-use postprocessor for Uzbek ASR output. It improves readability, restores punctuation, normalizes apostrophes, formats numbers and dates, and cleans mixed Uzbek/Russian transcript lines.
A specialized model for scanned Uzbek books. It is about OCR damage correction first and Cyrillic-to-Latin recovery second, with attention to broken words, punctuation, and multiline prose.
The examples below come from the released models and show what each one is actually meant to do in practice.
rnen universitetda oqiyrnan
Men universitetda o'qiyman
narxi yigirma besh ming so'm
Narxi 25 000 so'm
privet kak делa
Привет как дела
These examples show that the base model is already a useful correction tool by itself: it repairs noisy text, normalizes numbers, and handles mixed-script cleanup.
bugun yaxshi kun. segodnya xoroshiy den.
Bugun yaxshi kun. Сегодня хороший день.
men bozorga bordim. tam ya kupil xleb.
Men bozorga bordim. Там я купил хлеб.
kadrlarda kranning mashina old oynasi ustiga qulaganligini ko'rish mumkin
Kadrlarda kranning mashina old oynasi ustiga qulaganligini ko'rish mumkin.
narxi yigirma besh ming so'm
Narxi 25 000 so'm
These examples show why this model works as a transcript display normalizer: it cleans mixed Uzbek/Russian lines, restores Russian text, normalizes written numbers, and makes raw transcripts display-ready.
Бошим айланиб, кўз олдим айланади. Ана шаҳар, ҳамма қўлида алифбе китоби, савод чнқаришга туртинмоқдалар.
Boshim aylanib, ko'z oldim aylanadi. Ana shahar, hamma qo'lida alifbe kitobi, savod chiqarishga turtinmoqdalar.
"Сотасанми?", дедим, "Сотаман", деди. Пул билан савдо келишмагандан кейин, ош билан айирбош килмокчи бўлди. Кўндим. Ошни козон-позон, ўчок дамгирлари билан шу унга алишдим.
"Sotasanmi?", dedim, "Sotaman", dedi. Pul bilan savdo kelishmagandan keyin, osh bilan ayirbosh qilmoqchi bo'ldi. Ko'ndim. Oshni qozon-pozon, o'choq damgirlari bilan shu unga alishdim.
Айт-чн, йигит, Кимни ўйладинг, Куйлаганинг ким эди, Айт, ким? Ким ўртади ҳажрида сени, Кимга орзунг бўлмоқ муяссар?
Ayt-chi, yigit, Kimni o'ylading, Kuylaganing kim edi, Ayt, kim? Kim o'rtadi hajrida seni, Kimga orzung bo'lmoq muyassar?
These examples demonstrate the OCR model's strengths: it repairs OCR damage, recovers readable Latin Uzbek, and preserves verse-like structure when needed.
The family began with a correction backbone trained through a multi-stage curriculum so it would learn realistic corruption patterns instead of only one benchmark-style task.
After the base stabilized, it was fine-tuned for ASR display cleanup using trusted audio-backed pairs, punctuation polish, and safer mixed Uzbek/Russian targets.
The OCR model was trained separately from the base because old scanned books need a different focus: broken-word repair, scan noise recovery, and Cyrillic-to-Latin output for readable modern reuse.
A key goal of this family is to provide a strong Uzbek correction base for experiments and turn those experiments into public examples that demonstrate real value. Rather than keeping everything behind a closed product, the aim is to show that Uzbek correction work can be practical, reproducible, and accessible.
rubai-corrector-base is already a good tool for correction tasks by itself, and it is also the
checkpoint meant for adaptation. If someone wants to build a model for their own noisy text, they should not
have to start from zero.
correct: {text}.The core point: the base model is already useful, the fine-tuning path is public, and Uzbek correction experiments should be easier to run than they might seem.
All models in the family use the same prefix, so switching between them is straightforward.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_id = "islomov/rubai-corrector-transcript-uz"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
text = "ertaga 15 mart kuni soat 17 30 da uchrashamiz"
inputs = tokenizer("correct: " + text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Example:
# Ertaga, 15-mart kuni soat 17:30 da uchrashamiz.
This work was led by Sardor Islomov, with Davron Ibrokhimov as co-author. Support and collaboration from MetaSell, Kotib, and Global Move are gratefully acknowledged.
Sardor Islomov
Davron Ibrokhimov