TrOCR-LaTeX (fine-tuned on math handwriting)

Take your handwritten math and turn it into clean LaTeX code. This is a fine-tuned version of microsoft/trocr-base-handwritten, a transformer-based optical character recognition model, adapted to work with handwritten math images and structured math syntax.

Data

Fine-tuned on Google's MathWriting dataset. Contains over 500,000 digital inks of handwritten mathematical expressions obtained through either manual labelling or programmatic generation.

Intended use & limitations

You can use this model for OCR on a single math expression.

There is degraded performance on very long expressions (due to image preprocessing, 3:2 aspect ratio seems to work best).

Create an expression chunking scheme to split the image into subimages and process each to bypass this limitation.
In order to process multiple expressions, you need to chuck groups into single expressions.

How to use (PyTorch)

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image

# Helper funtion (path to either JPEG or PNG)
def open_PIL_image(image_path: str) -> Image.Image:
  image = Image.open(image_path)
  if image_path.split('.')[-1].lower() == 'png':
      image = Image.composite(image, PIL.Image.new('RGB', image.size, 'white'), image)
  return image


# Load model and processor from Hugging Face
processor = TrOCRProcessor.from_pretrained('tjoab/latex_finetuned')
model = VisionEncoderDecoderModel.from_pretrained('tjoab/latex_finetuned')


# Load all images as a batch
images = [open_PIL_image(path) for path in paths]

# Preprocess the images 
preproc_image = processor.image_processor(images=images, return_tensors="pt").pixel_values

# Generate and decode the tokens
# NOTE: max_length default value is very small, which often results in truncated inference if not set 
pred_ids = model.generate(preproc_image, max_length=128)
latex_preds = processor.batch_decode(pred_ids, skip_special_tokens=True)

Training Details

Mini-batch size: 8
Optimizer: Adam
LR Scheduler: cosine
fp16 mixed precision
- Trained using automatic mixed precision (AMP) with torch.cuda.amp for reduced memory usage.
Gradient accumulation
- Used to simulate a larger effective batch size while keeping per-step memory consumption low.
- Optimizer steps occurred every 8 mini-batches.

Evaluation

Performance was evaluated using Character Error Rate (CER) defined as:

CER = (Substitutions + Insertions + Deletions) / Total Characters in Ground Truth

✅ Why CER?
- Math expressions are structurally sensitive. Shuffling even a single character can completely change the meaning.
  - x^2 vs. x_2
  - \frac{a}{b} vs. \frac{b}{a}
- CER will penalizes small error in syntax.
Evalution yeilded a CER of 14.9%.

BibTeX and Citation

The original TrORC model was introduced in this paper:

TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Li et al.

You can find the source code in their repository.

@misc{li2021trocr,
      title={TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models}, 
      author={Minghao Li and Tengchao Lv and Lei Cui and Yijuan Lu and Dinei Florencio and Cha Zhang and Zhoujun Li and Furu Wei},
      year={2021},
      eprint={2109.10282},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

tjoab
/

latex_finetuned