Typological Determination of the Architecture of National Language Corpora: a Comparative Analysis
DOI:
10.26577/EJPh202220262Abstract
The article examines national language corpora as strategic resources of modern corpus linguistics and digital humanities. The relevance of the study is determined by the need for a typologically oriented comparative analysis of national corpora that takes into account the structural characteristics of languages in corpus design. The aim of the research is to conduct a comparative analysis of the National Corpus of the Kazakh Language, the National Corpus of the Russian Language, and the Turkish National Corpus in order to determine how typological language structure influences corpus architecture and levels of linguistic annotation. The methodological framework includes descriptive and comparative methods, elements of qualitative corpus analysis, and parametric comparison of corpus size, genre composition, and levels of morphological, morphosyntactic, and semantic annotation. The results demonstrate that corpora of agglutinative languages prioritize detailed morphological annotation ensuring accurate morpheme segmentation and lemmatization, whereas for the inflectional Russian language multi-level morphosyntactic and semantic annotation plays a central role. The study substantiates the dependence of corpus architecture on typological language characteristics. The scientific contribution of the research lies in proposing a typologically grounded analytical model for comparing national language corpora. The practical significance of the findings consists in their applicability to the further development of the National Corpus of the Kazakh Language, as well as to educational practice and applied tasks of natural language processing.
Keywords: corpus linguistics; national language corpus; language typology; morphological annotation; corpus architecture.








