Resolving High-Vowel Ambiguity (⟨ï⟩ / ⟨i⟩ / ⟨ı⟩) in OCR-Derived Old Turkic Editions: An Edition-Metadata-Driven Disambiguation Layer


Views: 69 / PDF downloads: 25

Authors

DOI:

https://doi.org/10.32523/2664-5157-2026-2SI-229-241

Keywords:

Old Turkic, OCR, text normalization, TEI-P5, high vowels, graphemic ambiguity, digital philology, historical corpora, Turcology, disambiguation

Abstract

This study addresses the problem of high-vowel ambiguity in OCR-derived Old Turkic texts, in which the graphemic distinction between ⟨ï⟩, ⟨i⟩, and ⟨ı⟩ is frequently neutralized to ⟨i⟩. This issue arises from the limitations of existing normalization approaches, which fail to adequately capture edition-specific orthographic conventions or the variability across the editorial traditions. As a result, a significant portion of the original graphemic information is lost during digitization, thereby reducing the reliability of subsequent linguistic and philological analysis. To resolve this problem, the paper proposes a dedicated disambiguation layer integrated into a TEI-P5 two-layer encoding framework (orig/reg). The proposed layer operates strictly at the representation level and does not attempt to reconstruct phonology or modify the original OCR output. Instead, it combines edition-specific metadata with rule-based linguistic cues, including lexical allow-lists, morphological constraints, vowel harmony patterns, and loanword profiles. The model follows a deterministic priority structure, ensuring that each ambiguous case is resolved in a transparent, consistent, and reproducible manner. By design, the framework avoids probabilistic inference and prioritizes philological accountability over statistical generalization. The model was evaluated on a dataset of 4,485 tokens drawn from thirteen Old Turkic editions published between 1919 and 2023. Among these, 1,837 tokens exhibited unresolved high-vowel ambiguity after normalization. The proposed method successfully disambiguates approximately 88–93% of these cases, depending on the edition, while preserving unresolved forms through explicit TEI <unclear> annotation. This approach ensures that ambiguity is not obscured but remains visible and accessible for further philological evaluation and interpretation. Comparative analysis across editions further confirmed the stability of the method under varying orthographic conventions. The results demonstrate that a deterministic, edition-aware approach can significantly improve accuracy compared to baseline methods, without sacrificing transparency, reversibility, or interpretability. The study further highlights the importance of editorial traditions in shaping graphemic representation and shows that ambiguity can be effectively managed through structured metadata and constrained rule-based systems. The proposed framework offers a reproducible and extensible solution for enhancing the quality, consistency, and interoperability of OCR-derived historical corpora in Turcology and digital philology. Future work will focus on extending the framework to additional Turkic editions and integrating it with corpus-level search and annotation tools.

Downloads

Download data is not yet available.

Author Biography

E. Uçar, Friedrich-Schiller-Universität

Doctor of Philology, Рrofessor

Reference

Arat R.R., 1965. Eski Türk Şiiri [Old Turkic Poetry]. Ankara: Türk Tarih Kurumu Yayınları [Ankara: Turkish Historical Society Publications]. [in Turkish].

Bang W., 1923. Manichaeische Laien-Beichtspiegel [Manichaean Lay Confessional Mirror]. Le Muséon. 36. P. 137–242. [in German].

Carlson J. et al., 2023. Efficient OCR for building a diverse digital history. arXiv preprint arXiv:2304.02737.

Clauson G., 1972. An Etymological Dictionary of Pre-Thirteenth Century Turkish. Oxford: Clarendon Press.

Dietz S. et al., 2015. Die alttürkische Xuanzang-Biographie V: Nach der Handschrift von Leningrad, Paris und Peking sowie nach dem Transkript von Annemarie v. Gabain (Hrsg., Übers., Komm.) [The Old Turkic Xuanzang Biography V: Based on the Manuscripts from Leningrad, Paris and Beijing and the Transcript by Annemarie von Gabain (ed., trans., comm.)]. Wiesbaden: Harrassowitz Verlag. [in German].

Erdal M., Gippert J., Röhrborn K., Zieme P., Nevskaya I., Knüppel M., Özertural Z., Taube J., 2003. Vorislamische Alttürkische Texte: Elektronisches Corpus. [Electronic resource]. Available at: https://vatec2.fkidg1.uni-frankfurt.de

Geng S., 1989. A study of one newly discovered folio of the Uighur Abhidharmakośaśāstra. Central Asiatic Journal. 33. P. 36–45.

Hamilton J.R., 1971. Le conte bouddhique du bon et du mauvais prince en version ouïgoure [The Buddhist Tale of the Good and the Bad Prince in Uighur Version]. Paris: Klincksieck. [in French].

Kaya C., 2023. Uygurca Altun Yaruk: Belgeler. Ankara: Türk Dil Kurumu Yayınları [Uighur Altun Yaruk: Documents. Ankara: Turkish Language Association Publications]. [in Turkish].

Le Coq A. von., 1919. Kurze Einführung in die uigurische Schriftkunde [A Short Introduction to Uighur Paleography]. Mitteilungen des Seminars für Orientalische Sprachen an der Friedrich-Wilhelms-Universität zu Berlin [Proceedings of the Seminar for Oriental Languages at the Friedrich Wilhelm University of Berlin, West Asian Studies]. Westasiatische Studien. 22. P. 93–109. [in German].

Özateş Ş. et al., 2025. Building foundations for natural language processing of historical Turkish: Resources and models. arXiv preprint arXiv:2501.04828. (Accessed: 20.04.26)

Röhrborn K., 1971. Eine uigurische Totenmesse: Text, Übersetzung, Kommentar [A Uighur Funeral Mass: Text, Translation]. Berliner Turfantexte 1. Berlin: Akademie Verlag. [in German].

TEI Consortium, 2024. TEI P5: Guidelines for Electronic Text Encoding and Interchange (Version 4.7.0, tei-c.org). (Accessed: 20.04.26)

Uçar E., 2020. Türkiye’deki Eski Uygurca Metin Neşirleri İçin Kullanılacak Harfçevrim ve Yazıçevrim Kılavuzu [Grapheme and Transliteration Guide for Old Uighur Text Editions in Turkey]. Journal of Old Turkic Studies. 4(1). P. 231–250. [in Turkish].

Uçar E., 2021. Türkiye’deki Manihey Harfli Eski Uygurca Neşirler İçin Harfçevrim ve Yazıçevrim Kılavuzu [Grapheme and Transliteration Guide for Manichaean Script Old Uighur Editions in Turkey]. Journal of Old Turkic Studies. 5(1). P. 161–194. [in Turkish].

Uçar E., 2026. A Normalization Layer for Old Turkic Text Editions in OCR-Based Workflows. (forthcoming).

Zieme P. et al., 2022. Avalokiteśvara-Sūtras: Edition altuigurischer Übersetzungen nach Fragmenten aus Turfan und Dunhuang [Avalokiteśvara Sutras: Edition of Old Uighur Translations Based on Fragments from Turfan and Dunhuang]. Berliner Turfantexte 50. Turnhout: Brepols. [in German].

Downloads

Published

2026-06-22

How to Cite

Uçar, E. . (2026). Resolving High-Vowel Ambiguity (⟨ï⟩ / ⟨i⟩ / ⟨ı⟩) in OCR-Derived Old Turkic Editions: An Edition-Metadata-Driven Disambiguation Layer. Turkic Studies Journal, 229–241. https://doi.org/10.32523/2664-5157-2026-2SI-229-241

Issue

Section

Textology of Turkic Written Monuments