TTS (Text-to-Speech) document reader from Microsoft, Adobe, Apple, and OpenAI have been serviced worldwide. They provide relatively good TTS results for general plain text, but sometimes skip contents or provide unsatisfactory results for mathematical expressions. This is because most modern academic papers are written in LaTeX, and when LaTeX formulas are compiled, they are rendered as distinctive text forms within the document. However, traditional TTS document readers output only the text as it is recognized, without considering the mathematical meaning of the formulas. To address this issue, we propose MathReader, which effectively integrates OCR, a fine-tuned T5 model, and TTS. MathReader demonstrated a lower Word Error Rate (WER) than existing TTS document readers, such as Microsoft Edge and Adobe Acrobat, when processing documents containing mathematical formulas. MathReader reduced the WER from 0.510 to 0.281 compared to Microsoft Edge, and from 0.617 to 0.281 compared to Adobe Acrobat. This will significantly contribute to alleviating the inconvenience faced by users who want to listen to documents, especially those who are visually impaired.
1. OCR (Optical Character Recognition)
- Used Nougat-small, based on a hierarchical vision transformer, to convert PDF files into mmd (markup) files.
- These files contain both regular text and LaTeX formulas.
- Since TTS models cannot directly read LaTeX (due to special characters), further processing is required.
2. Extract Formulas
- Nougat marks LaTeX formulas with special symbols (e.g., `\[ ... \]` or `\( ... \)`), making it possible to identify and separate all formulas from the text.
3. Fine-tuned T5 (LaTeX → Spoken English Translation)
- T5-small was fine-tuned on existing (LaTeX, spoken English) datasets.
- Treats formula reading as a translation task, converting mathematical LaTeX into natural spoken English.
- T5-small was chosen for speed and efficiency, while still producing good quality results.
4. Replace LaTeX with Spoken English
- The translated formulas are inserted back into the mmd file, replacing the LaTeX code with readable English text.
5. TTS (Text-to-Speech)
- The final processed text, now free of LaTeX, is passed to VITS, a TTS model.
- This generates speech without errors or skipped formulas.
@INPROCEEDINGS{HyeonICASSP25,
author={Hyeon, Sieun and Jung, Kyudan and Kim, Nam-Joon and Ryu, Hyun Gon and Do, Jaeyoung},
booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={MathReader : Text-to-Speech for Mathematical Documents},
year={2025},
pages={1-5},
keywords={Text recognition;Error analysis;Pipelines;Optical character recognition;Graphics processing units;Signal processing;Real-time systems;Mathematical models;Text to speech;Speech processing;OCR;T5;TTS;document reader;LaTeX},
doi={10.1109/ICASSP49660.2025.10890531}
}