The information is prepared on the basis of data from the information-analytical system RSF, informative part is represented in the author's edition. All rights belong to the authors, the use or reprinting of materials is permitted only with the prior consent of the authors.



Project Number22-28-20215

Project titleCreation of the speech corpus of the Baltic-Finnic languages of Karelia

Project LeadRodionova Aleksandra

AffiliationKarelian Research Centre of the Russian Academy of Sciences,

Implementation period 2022 - 2023 

Research area 08 - HUMANITIES AND SOCIAL SCIENCES, 08-453 - Linguistics

KeywordsSpeech corpus, Veps language, Karelian language, corpus linguistics, word-sense disambiguation, text tagging, multimedia map



The coming decade (2022-2032) will be the world's Decade of Indigenous Languages, which will primarily focus on the rights of native speakers of indigenous languages. In its strategic recommendations for the Decade, the Los Pinos Declaration emphasizes the right of indigenous peoples, including to receive education in their own language and participate in public life, using their languages ​​as preconditions for the survival of indigenous languages, many of which are now are on the verge of extinction. The declaration also points to the potential of digital technologies to support the use and preservation of these languages. Linguistic corpuses are created to preserve linguistic wealth and further study the languages ​​of indigenous peoples. In 2016, employees of the Institute of Linguistics, Literature and History and the Institute of Applied Mathematical Research, KarRC RAS, started to create a multilingual corpus called the Open Corpus of the Veps and Karelian Languages ​​(VepKar). The main goal of this project is to create a speech (sounding) corpus of the Baltic-Finnich speech on the basis of the Open Corpus of the Veps and Karelian languages ​​(VepKar). The developed speech module will be a collection of spoken texts in different dialects of the Karelian and Veps languages, equipped with transcription, markup and translation into Russian. The relevance of the research is justified by the need for further development of the Open Corpus of the Veps and Karelian languages ​​(VepKar), which is widely in demand both in scientific research and in the development of literary forms of the Karelian and Veps languages. On the other hand, it is associated with insufficient development of the problems of the phonetic and phonological systems of the Karelian and Vepsian dialectal speech, which is caused by the lack of the required amount of high-quality linguistic audio material. The application of modern technologies and techniques to the field material accumulated over many decades, together with the latest data, will make it possible to fill a number of gaps previously identified by linguists in this system. The scientific novelty of the project is justified by the lack of speech corpora of the Baltic-Finnic languages. Digitization of archival and field audio samples of Karelian and Vepsian speech in the Speech Corpus format will in the future be able to simplify the processing and storage of materials, will make it possible to introduce into scientific circulation and present to the open access unique audio materials reflecting the state of the Karelian and Vepsian dialects since the middle of the last century. These materials are stored in the Audio Archive of the Institute of Linguistics, Literature and History of the KarRC RAS ​​and are in dire need of digitization in order to ensure their further storage. In the process of work, it is planned to develop new software modules for the VepKar corpus aimed at processing and analyzing the audio material of the Karelian and Veps languages. One of the results of the project will be the development of a multimedia map of the dialects of the Baltic-Finnic languages ​​of Karelia, which will be able to provide an opportunity for anyone, without leaving home, to get acquainted with various variations of the languages ​​of the indigenous peoples of the republic. A word pronunciation module for including audio recordings in the VepKar dictionary entry will be created. The results of the planned project aimed at the study of newly written Karelian and Veps languages, as well as the possibility of preserving and popularizing dialects, will undoubtedly be in wide demand in Karelia not only for scientific research and language construction, but also in the spheres of education, culture, tourism, and also by ordinary users of the resource.

Expected results
The library of digitized audio recordings in different dialects of the Karelian and Veps languages ​​(at least 100 texts) with the aim of introducing previously unpublished materials into scientific community. VepKar module designed for publishing, editing and searching audio recordings in the corpus. The module will support the display of markup made by external programs (for example, ELAN). This module will allow KarRC employees to add archival audio recordings to the VepKar corpus in the future, and the international community of linguists will have access to the constantly updated VepKar speech corpus. A study of the phonetic and phonological systems of Vepsian and Karelian speech in the synchronous and diachronic aspects will be carried out on the basis of audio materials. This will make it possible to determine the standards and to establish the rules and norms of the newly-written Baltic-Finniс languages ​​of the region, as well as to clarify the available data on the development and state of dialects of the Karelian and Veps languages. Word audio recording module in order to add an audio pronunciation to the VepKar dictionary. In the future, the audio of the most frequent words of the Karelian and Veps languages in the VepKar dictionary will be added. The creation of a multimedia map of the dialects of the Veps and Karelian languages ​​will make it possible to present the whole variety of the living and lost Baltic-Finniс dialectal speech of Karelia. This map can be used for educational purposes, as well as for the development of tourism in the region.