CONTEXT-AWARE RULE-BASED MATH EXPRESSION NORMALISER AND VERBALIZER USING LATEX2TEXT FOR ENHANCED DOCUMENT PREPROCESSING

J.  Joice; C. Sathya

doi:10.70102/afts.2025.1834.918

Original scientific article

Published: December 2025

<< Prev | Next >>

PDF

https://doi.org/10.70102/afts.2025.1834.918

CONTEXT-AWARE RULE-BASED MATH EXPRESSION NORMALISER AND VERBALIZER USING LATEX2TEXT FOR ENHANCED DOCUMENT PREPROCESSING

Abstract

Blind students usually are subjected to a substantial impediment of reading and accessing electronic documents, especially data that are noisy and those that are carefully designed. Traditional NLP models severely underestimate or misinterpret mathematical expressions in which symbols are represented as notation. It is a critical problem in the educational field, accessibility, and report generation programs, where in-depth knowledge of mathematical content is a priority. State-of-the-art document summarisation systems tend to fail in noisy text, disordered document structures, and non-textual content, e.g., equations, images, and charts. This paper introduces a powerful preprocessing model that focuses on improving input quality, semantic coherence, and readability. The process consists of sophisticated text cleaning, discerning structuring, and an extensive content interpretation model. The paper presents a proposal to simplify and verbalise mathematical expressions using a rule-based, context-sensitive language called the Verbalizer Rule (VR). The system translates complex mathematical syntax into human-readable natural-language descriptions by pattern-matching expressions and translating semantic meaning using clues in the context. Experiments demonstrate that this method achieves much higher readability scores and summarisation quality than state-of-the-art models. In the assessment, the Proposed CARMEN model, using the ROUGE metrics 1, 2, and L, yields a ROUGE score of more than 0.8333 among the other verbalizers.

Keywords:

pre-processing formulae,

symbolic math,

LaTeX2Text,

textual representation,

NLP.

References

Mukhiddinov M, Kim SY. A systematic literature review on the automatic creation of tactile graphics for the blind and visually impaired. Processes. 2021 Sep 26;9(10):1726.

Aguinis H, Gottfredson RK, Joo H. Best-practice recommendations for defining, identifying, and handling outliers. Organizational research methods. 2013 Apr;16(2):270-301.

Rahimi I, Gandomi AH, Chen F, Mezura-Montes E. A review on constraint handling techniques for population-based algorithms: from single-objective to multi-objective optimization. Archives of Computational Methods in Engineering. 2023 Apr;30(3):2181-209.

Wang S, Cheah JH, Wong CY, Ramayah T. Progress in partial least squares structural equation modeling use in logistics and supply chain management in the last decade: a structured literature review. International Journal of Physical Distribution & Logistics Management. 2024 Oct 17;54(7/8):673-704.

Psomas E. Future research methodologies of lean manufacturing: a systematic literature review. International Journal of Lean Six Sigma. 2021 Nov 19;12(6):1146-83.

Zhu X, Zhang G, Sun B. A comprehensive literature review of the demand forecasting methods of emergency resources from the perspective of artificial intelligence. Natural Hazards. 2019 May 15;97(1):65-82.

Fiandrino S, Tonelli A, Devalle A. Sustainability materiality research: a systematic literature review of methods, theories and academic themes. Qualitative Research in Accounting & Management. 2022 Oct 21;19(5):665-95.

Liguda C, Pfeiffer T. Modeling math word problems with augmented semantic networks. In International Conference on Application of Natural Language to Information Systems 2012 Jun 26 (pp. 247-252). Berlin, Heidelberg: Springer Berlin Heidelberg.

Zhou Q, Huang D. Towards generating math word problems from equations and topics. In Proceedings of the 12th international conference on natural language generation 2019 (pp. 494-503).

10.

Chai CP. Comparison of text preprocessing methods. Natural language engineering. 2023 May;29(3):509 53.

11.

Amato A, Di Lecce V. Data preprocessing impact on machine learning algorithm performance. Open computer science. 2023 Jul 17;13(1):20220278.

12.

Helin R, Indahl UG, Tomic O, Liland KH. On the possible benefits of deep learning for spectral preprocessing. Journal of Chemometrics. 2022 Feb;36(2): e3374.

13.

Feichter C, Schlippe T. Investigating models for the transcription of mathematical formulas in images. Applied Sciences. 2024 Jan 29;14(3):1140.

14.

Peyrard M, Eckle-Kohler J. Optimizing an approximation of rouge-a problem-reduction approach to extractive multi-document summarization. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2016 Aug (pp. 1825-1836).

15.

Lee D, Shin MC, Whang T, Cho S, Ko B, Lee D, Kim E, Jo J. Reference and document aware semantic evaluation methods for Korean language summarization. InProceedings of the 28th International Conference on Computational Linguistics 2020 Dec (pp. 5604-5616).

16.

Jain R, Mavi V, Jangra A, Saha S. Widar-weighted input document augmented rouge. InEuropean Conference on Information Retrieval 2022 Apr 5 (pp. 304-321). Cham: Springer International Publishing.

17.

Ganesan K. Rouge 2.0: Updated and improved measures for evaluation of summarization tasks. arXiv preprint arXiv:1803.01937. 2018 Mar 5.

18.

Biswas R, Bamba U, Broad N. Wiki STEM Corpus. Kaggle; 2024 Apr 12.

19.

Latif A, Kim J. Evaluation and analysis of large language models for clinical text augmentation and generation. IEEE Access. 2024 Apr 3; 12:48987-96.

20.

Muludi K, Fitria KM, Triloka J. Retrieval-Augmented Generation Approach: Document Question Answering using Large Language Model. International Journal of Advanced Computer Science & Applications. 2024 Mar 1;15(3).

21.

Chai Y, Xie H, Qin JS. Text data augmentation for large language models: A comprehensive survey of methods, challenges, and opportunities. Artificial Intelligence Review. 2026 Jan;59(1):35.

22.

Wang JH, Zhuang JA. Text Summary Generation based on Data Augmentation and Contrastive Learning. In2024 IEEE International Conference on Big Data (BigData) 2024 Dec 15 (pp. 7218-7224). IEEE.

23.

Alsultan R, Sagheer A, Hamdoun H, Alshamlan L, Alfadhli L. PEGASUS-XL with saliency-guided scoring and long-input encoding for multi-document abstractive summarization. Scientific Reports. 2025 Jul 22;15(1):26529.

24.

Jia R, Zhang X, Cao Y, Lin Z, Wang S, Wei F. Neural label search for zero-shot multi-lingual extractive summarization. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022 May (pp. 561-570).

25.

Syed AA, Gaol FL, Boediman A, Matsuo T, Budiharto W. A survey of abstractive text summarization utilising pretrained language models. InAsian Conference on Intelligent Information and Database Systems 2022 Nov 28 (pp. 532-544). Cham: Springer International Publishing.

26.

Guo B, Gong Y, Shen Y, Han S, Huang H, Duan N, Chen W. Genius: Sketch-based language model pre training via extreme and selective masking for text generation and augmentation. arXiv preprint arXiv:2211.10330. 2022 Nov 18.

27.

Sun D, Lyu Y, Li J, Liu X, Kara D, Lebiere C, Abdelzaher T. The Irrational LLM: Implementing Cognitive Agents with Weighted Retrieval-Augmented Generation. In 2025 34th International Conference on Computer Communications and Networks (ICCCN) 2025 Aug 4 (pp. 1-9). IEEE.

28.

Phang J, Zhao Y, Liu PJ. Investigating efficiently extending transformers for long input summarization. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing 2023 Dec (pp. 3946-3961).

29.

Corallo G, Papotti P. Finch: Prompt-guided key-value cache compression for large language models. Transactions of the Association for Computational Linguistics. 2024 Nov 18; 12:1517-32.

30.

Guo Z, Adedigba AP, Mallipeddi R. Cluster-Aggregated Transformer: Enhancing lightweight parameter models. Engineering Applications of Artificial Intelligence. 2025 Nov 1; 159:111468.

31.

Asebriy Z, Raghay S, Bencharef O. An assistive technology for braille users to support mathematical learning: a semantic retrieval system. Symmetry. 2018 Oct 26;10(11):547.

32.

Rotard M, Knödler S, Ertl T. A tactile web browser for the visually disabled. InProceedings of the sixteenth ACM conference on Hypertext and hypermedia 2005 Sep 6 (pp. 15-22).

33.

Tian X, Wang J. Retrieval of scientific documents based on HFS and BERT. IEEE Access. 2021 Jan 5; 9:8708-17.

Citation

Copyright

This is an open access article distributed under the Creative Commons Attribution Non-Commercial License (CC BY-NC) License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Article metrics

Google scholar: See link

Issue 34, 2025

THE MODEL OF GREEN ENTREPRENEURSHIP FACTORS ON THE INTERNATIONALIZATION PERFORMANCE OF SMES IN CHINA: A CONCEPTUAL FRAMEWORK HORNED LIZARD-CATBOOST FRAMEWORK FOR CYBERBULLYING PREVENTION IN SOCIAL NETWORKS ON LEVERAGING GENERATIVE ARTIFICIAL INTELLIGENCE (GENAI) FOR BEHAVIOR LEARNING AND PERSONALIZED MARKETING OPTIMIZATION ENHANCING IP COMMERCIALIZATION PERFORMANCE IN SOCIAL SCIENCE ACADEMICS AND THE ROLE OF ENTREPRENEURIAL ORIENTATION, UNIVERSITY SUPPORT, AND SELF-EFFICACY DETERMINANTS OF EMPLOYEE ENGAGEMENT IN ORGANIZED RETAIL: AN ANALYTICAL STUDY See full issue

About us

Editorial policy

CONTEXT-AWARE RULE-BASED MATH EXPRESSION NORMALISER AND VERBALIZER USING LATEX2TEXT FOR ENHANCED DOCUMENT PREPROCESSING

Abstract

Keywords:

References

Citation

Copyright

Article metrics

Issue 34, 2025

Citations

Disclaimer