INFORMATION ABOUT PROJECT,
SUPPORTED BY RUSSIAN SCIENCE FOUNDATION

The information is prepared on the basis of data from the information-analytical system RSF, informative part is represented in the author's edition. All rights belong to the authors, the use or reprinting of materials is permitted only with the prior consent of the authors.

 

COMMON PART


Project Number22-28-20215

Project titleCreation of the speech corpus of the Baltic-Finnic languages of Karelia

Project LeadRodionova Aleksandra

AffiliationKarelian Research Centre of the Russian Academy of Sciences,

Implementation period 2022 - 2023 

Research area 08 - HUMANITIES AND SOCIAL SCIENCES, 08-453 - Linguistics

KeywordsSpeech corpus, Veps language, Karelian language, corpus linguistics, word-sense disambiguation, text tagging, multimedia map


 

PROJECT CONTENT


Annotation
The coming decade (2022-2032) will be the world's Decade of Indigenous Languages, which will primarily focus on the rights of native speakers of indigenous languages. In its strategic recommendations for the Decade, the Los Pinos Declaration emphasizes the right of indigenous peoples, including to receive education in their own language and participate in public life, using their languages ​​as preconditions for the survival of indigenous languages, many of which are now are on the verge of extinction. The declaration also points to the potential of digital technologies to support the use and preservation of these languages. Linguistic corpuses are created to preserve linguistic wealth and further study the languages ​​of indigenous peoples. In 2016, employees of the Institute of Linguistics, Literature and History and the Institute of Applied Mathematical Research, KarRC RAS, started to create a multilingual corpus called the Open Corpus of the Veps and Karelian Languages ​​(VepKar). The main goal of this project is to create a speech (sounding) corpus of the Baltic-Finnich speech on the basis of the Open Corpus of the Veps and Karelian languages ​​(VepKar). The developed speech module will be a collection of spoken texts in different dialects of the Karelian and Veps languages, equipped with transcription, markup and translation into Russian. The relevance of the research is justified by the need for further development of the Open Corpus of the Veps and Karelian languages ​​(VepKar), which is widely in demand both in scientific research and in the development of literary forms of the Karelian and Veps languages. On the other hand, it is associated with insufficient development of the problems of the phonetic and phonological systems of the Karelian and Vepsian dialectal speech, which is caused by the lack of the required amount of high-quality linguistic audio material. The application of modern technologies and techniques to the field material accumulated over many decades, together with the latest data, will make it possible to fill a number of gaps previously identified by linguists in this system. The scientific novelty of the project is justified by the lack of speech corpora of the Baltic-Finnic languages. Digitization of archival and field audio samples of Karelian and Vepsian speech in the Speech Corpus format will in the future be able to simplify the processing and storage of materials, will make it possible to introduce into scientific circulation and present to the open access unique audio materials reflecting the state of the Karelian and Vepsian dialects since the middle of the last century. These materials are stored in the Audio Archive of the Institute of Linguistics, Literature and History of the KarRC RAS ​​and are in dire need of digitization in order to ensure their further storage. In the process of work, it is planned to develop new software modules for the VepKar corpus aimed at processing and analyzing the audio material of the Karelian and Veps languages. One of the results of the project will be the development of a multimedia map of the dialects of the Baltic-Finnic languages ​​of Karelia, which will be able to provide an opportunity for anyone, without leaving home, to get acquainted with various variations of the languages ​​of the indigenous peoples of the republic. A word pronunciation module for including audio recordings in the VepKar dictionary entry will be created. The results of the planned project aimed at the study of newly written Karelian and Veps languages, as well as the possibility of preserving and popularizing dialects, will undoubtedly be in wide demand in Karelia not only for scientific research and language construction, but also in the spheres of education, culture, tourism, and also by ordinary users of the resource.

Expected results
The library of digitized audio recordings in different dialects of the Karelian and Veps languages ​​(at least 100 texts) with the aim of introducing previously unpublished materials into scientific community. VepKar module designed for publishing, editing and searching audio recordings in the corpus. The module will support the display of markup made by external programs (for example, ELAN). This module will allow KarRC employees to add archival audio recordings to the VepKar corpus in the future, and the international community of linguists will have access to the constantly updated VepKar speech corpus. A study of the phonetic and phonological systems of Vepsian and Karelian speech in the synchronous and diachronic aspects will be carried out on the basis of audio materials. This will make it possible to determine the standards and to establish the rules and norms of the newly-written Baltic-Finniс languages ​​of the region, as well as to clarify the available data on the development and state of dialects of the Karelian and Veps languages. Word audio recording module in order to add an audio pronunciation to the VepKar dictionary. In the future, the audio of the most frequent words of the Karelian and Veps languages in the VepKar dictionary will be added. The creation of a multimedia map of the dialects of the Veps and Karelian languages ​​will make it possible to present the whole variety of the living and lost Baltic-Finniс dialectal speech of Karelia. This map can be used for educational purposes, as well as for the development of tourism in the region.


 

REPORTS


Annotation of the results obtained in 2023
In the reporting year, the project’s linguists continued to work with materials from the ILLH Phonogram archive, an audit of the Livvi-Karelian collection was carried out, and the Karelian collection itself was subjected to description and analysis of the materials. The project participants continued work on filling the Speech corpus with audio recordings of Karelian and Vepsian speech. To fill the corpus with audio recordings of Karelian and Vepsian speech, three main sources were selected at the second stage of the project: 1. audio collections of the Phonogram archive of the ILLH KarRC RAS. During the reporting period, the project linguists (A. Rodionova and N. Pellinen) carried out an audit of Livvi-Karelian and Karelian Proper audio recordings: settlements were identified in which recordings of Karelian speech were made; records were distributed by year; the presence of digitized copies and transcripts for individual tapes is noted; they were compared with published speech samples. A brief overview of Karelian audio materials in the Livvi-Karelian language stored in the collections of the Phonogram archive of the ILLH KarRC RAS is presented in the publication by A. Rodionova. The sound recording engineer continued the digitization of Karelian and Vepsian samples (more than 10 hours from 1960-1970). 2. audio recordings of radio broadcasts in the Livvi-Karelian and Karelian Proper dialects of the Karelian language, prepared by employees of the ST&RBC “Karelia”. The project's linguists selected interviews with speakers of various Livvi-Karelian and Karelian Proper dialects. 3. field audio recordings made by project linguists during the reporting period in places of compact residence of Karelians: three field trips were made (the village of Mikhailovskoye (Olonets District) and to the Muyezersky District of the Republic of Karelia). In order to replenish the corpus map with new audio fragments and ensure a decrease in public interest in the dialects of the Karelian and Veps languages, at the beginning of this year project participants announced the appearance of the “Listening to my native dialect” marathon. Everyone was invited to the latter, who could register both themselves and take part in the role of collectors. The project linguists were led by Marathon experts. In total, during the second year of work, the speech corpus was filled with 59 audio fragments, representing a variety of Karelian and Vepsian oral dialect speech. In the reporting year, the project implementers developed an application to determine the dialect specificity of texts in the Karelian language. From samples of Karelian dialect speech of the Republic of Karelia, additional collection and verification of materials from the “Murreh” dialect database (http://murreh.krc.karelia.ru/) was carried out in order to further solve pressing problems of Karelian dialectology. All audio fragments prepared in 2023 are reflected on the Audio map (http://dictorpus.krc.karelia.ru/ru/corpus/audiotext/map): 18 audio samples in the Livvi-Karelian dialect (Vedlozero, Vidlitsa, Kotkozero, Nekkula, Rypushkalitsa, Syamozero, Tulmozero dialects), 2 audio samples in the Ludian dialect (Central Ludian and Mikhailovskoye dialects), 24 audio samples in the Karelian Proper dialect (Tikhvin, Panozero, Maslozero, Rugozero, Tolmachi, Poduzhemye, Porosozero, Voknavolok, Reboly dialects) and 8 audio samples in the Veps language (Northern Veps dialect). In addition, the corpus contains 7 audio samples of speech of Karelians of Border Karelia, resettled in the 1940s. to Finland (Salmi, Impilakhti, Suoyarvi, Korbiselga, Suistamo dialects). Some samples are also supplemented with photographs of settlements and informants, which allows the user to immerse themselves in the atmosphere of Karelian and Vepsian villages (http://dictorpus.krc.karelia.ru/ru/corpus/text/4276). In addition to educational purposes and the task of maintaining the vitality of dialect speech, the map can be actively used for educational purposes, for example, in courses for teaching Karelian and Vepsian dialectology. It is planned to continue filling the Audio map after the completion of this project. In parallel with the filling of the Speech corpus, thanks to the module developed at the first stage of the project for the informant to voice a prepared list of vocabulary words and phrases on the VepKar website and directly in the dictionary entry of the corpus, during the reporting period it was possible to voice 5 thousand words in the Livvi-Karelian dialect and more than 2 thousand words - in the Ludian dialect of the Karelian language. All records are stored in the VepKar corpus database and available online (general list http://dictorpus.krc.karelia.ru/ru/dict/audio, in dictionary entries http://dictorpus.krc.karelia.ru/ru/dict/lemma/45492). Based on the dialect markers of the Karelian language identified at the research stage of the work, the module “Determination of dialect affiliation” was developed (http://dictorpus.krc.karelia.ru/ru/experiments/dialect_dmarker). This application allows you to determine the dialect of the Karelian language for text entered by the user. During the reporting period, 4 presentations took place at international (1), all-Russian (1) and regional (2) conferences. Published 5 articles (2 international RSCI, 3 RSCI), 1 abstract and 1 manual. Other ways of publishing the results of the project include publications in the media in Russian and national languages (about 20): ST&RBC “Karelia”, SAMPO TV 360, printed publications “Karjalan Sanomat” (“News of Karelia”), “Oma Mua” (“Native Land”), website of Karelian Research Center RAS.

 

Publications

1. Pellinen N.A. Социолингвистический портрет информанта в проекте «Создание Речевого корпуса прибалтийско-финских языков Карелии» Краеведческие чтения: Краеведение в образовании, просвещении, науке. Материалы XVII научной конференции (15-16 февраля 2023 года). Петрозаводск: НБ РК, 2023. [Электронное издание], С. 130-133 (year - 2023)

2. Rodionova A.P. Людиковские диалектные материалы в Открытом корпусе вепсского и карельского языков (ВепКар) Краеведческие чтения: Краеведение в образовании, просвещении, науке. Материалы XVII научной конференции (15-16 февраля 2023 года). Петрозаводск: НБ РК [Электронное издание], С. 134-138 (year - 2023)

3. Rodionova A.P., Krizhanovskaya N.B., Pellinen N.A. Речевой корпус ВепКар как инструмент сохранения диалектной речи прибалтийско-финских народов Карелии Ежегодник финно-угорских исследований, Т. 17. Вып. 3. С. 343-351. (year - 2023)

4. Rodionova A.P., Novak I.P. ВепКар: от словаря к корпусу и от корпуса к словарю Сборник докладов Круглого стола по вопросам терминологии, орфографии и топонимии на языках коренных народов Республики Карелия, 19 мая 2023 г., г. Петрозаводск. [Электронная публикация], С. 17-23 (year - 2023)

5. Rodionova A.P., Pashkova T.V. Коллекции ливвиковских диалектных материалов Фонограммархива Института языка, литературы и истории Карельского научного центра РАН Финно-угорский мир, Т. 15, № 2. С. 189–199. (year - 2023) https://doi.org/10.15507/2076-2577.015.2023.02.189-19

6. Rodionova A.P. Речевой корпус прибалтийско-финских языков Карелии: архитектура и возможности Тезисы LI Международной научной филологической конференции имени Людмилы Алексеевны Вербицкой, секция «Прикладная и математическая лингвистика». СПб: СПбГУ., С. 632-633 (year - 2023)

7. Krizhanovskaya N.B., Krizhanovskiy A.A. ВепКар: руководство для пользователей : учебное пособие : учебное электронное издание Федеральный исследовательский центр «Карельский научный центр Российской академии наук», Институт прикладных математических исследований КарНЦ РАН. — Петрозаводск : КарНЦ РАН, 40 с. (year - 2023)

8. - комментарий о начале Марафона записей карельской и вепсской речи (на русском языке) ГТРК Карелия, https://www.youtube.com/watch?v=m-QQW85U8U4&t=1s (year - )

9. - комментарий о начале Марафона записей карельской и вепсской речи (на людиковском наречии карельского языка) ГТРК Карелия, https://www.youtube.com/watch?v=tYH611xhZE0&t=1s (year - )

10. - комментарий о Марафоне записей вепсской и карельской речи (на русском языке) САМПО ТВ 360, https://sampotv360.ru/2023/03/28/zhitelej-karelii-priglashayut-prisoedinitsya-k-marafonu-slushayu-rodnoj-govor/ (year - )

11. - комментарий о промежуточных итогах Марафона записей вепсской и карельской речи «Слушаю родной говор» ГТРК Карелия, https://www.youtube.com/watch?v=KhXx9bqYGsQ; https://www.youtube.com/watch?v=eAymiFTEVpQ&t=97s (year - )

12. - комментарий на людиковском наречии карельского языка об итогах Марафона записей вепсской и карельской речи «Слушаю родной говор» ГТРК Карелия, https://www.youtube.com/watch?v=KdlBe_s7hd8&t=1s (year - )

13. - комментарий о начале Марафона записей карельской и вепсской речи (на людиковском наречии карельского языка) ГТРК Карелия, https://vk.com/speechvepkar?w=wall-218387061_23 (year - )

14. - комментарий об итогах Марафона записей карельской и вепсской речи (на людиковском наречии карельского языка) ГТРК Карелия, https://vk.com/speechvepkar?w=wall-218387061_34 (year - )

15. - Liity mukaan puheiden keräämiseen Karjalan Sanomat, Karjalan Sanomat. № 4 (16570). С. 9 (year - )

16. - Kuulen omua pakinua Oma Mua, Oma mua. № 4 (1644). С. 3. (year - )

17. - VepKar-korpuksesta apua opiskeluun Karjalan Sanomat, Karjalan Sanomat. № 8 (16574). С. 13. (year - )

18. - Murrekartta täyttyy uusista puhenäytteistä Karjalan Sanomat, Karjalan Sanomat. № 32 (16598). С. 14. (year - )

19. - Murrekartasta kuulee nykyihmisten ääniä Karjalan Sanomat, Karjalan Sanomat. № 40 (16606). С. 12. (year - )

20. - Kuulen omad paginad- marafonan tuloksed Oma Mua, Oma mua. № 41 (1681). С. 7. (year - )

21. - Сюжет о Марафоне «Слушаю родной говор, или Как обычные люди помогли карельским учёным» Карельские истории Алены Сянтти (Karjalaisie juttuja), https://www.youtube.com/watch?v=z2iBaJVTBaA&t=22s (year - )

22. - О полевом сезоне 2023 года ГТРК Карелия, https://vk.com/viestitkarjala?w=wall-48634186_17676; https://vk.com/viestitkarjala?w=wall-48634186_17715 (year - )

23. - Rekikunnat tutkivat taas karjalaisia Karjalan Sanomat, Karjalan Sanomat. № 27. (year - )

24. - Информация о полевом выезде в Муезерский р-н РК Сайт КарНЦ РАН, http://www.krc.karelia.ru/news.php?id=5131&plang=r (year - )


Annotation of the results obtained in 2022
In the reporting year, the structure of the VepKar corpus database was changed, which made it possible to upload audio recordings of texts and individual words into it. In parallel, the linguists of the project determined a list of dialects of the Karelian and Vepsian languages for further selection of audio materials in order to transfer them to the Speech Corpus. Karelian and Vepsian speakers are living/lived outside the republic (Murmansk, Vologda, Leningrad, Novgorod, Tver regions) and their dialects are no less valuable for solving the problems of Karelian and Vepsian dialectology, it was decided to expand the boundaries of the project to the North-West of the Russian Federation. To fill the corpus with audio recordings of Karelian and Vepsian speech, three main sources were chosen at the first stage of the project: 1) audio collections of the Phonogram Archive of the ILLH Karelian Research Center of the Russian Academy of Sciences. During the reporting period, more than 10 hours of samples of Karelian and Vepsian speech recorded in 1959-1990 on tape were digitized: Kestenga, Yushkozero, Poduzhemye, Porosozero, Tikhvin, Valday, Vesyegonsk and Tolmachy dialects of the Karelian proper dialect of the Karelian language, as well as northern and Middle Veps dialects. In addition, audio recordings made by the Institute’s reseachers already in digital format in 2003-2021 were selected to fill the corpus: Keret, Oulanga, Kestenga, Voknavolok, Reboly, Tolmachy dialects of the Karelian dialect proper, Middle Lude, Southern Lude and Mikhailovskoye dialects of the Lude dialect, and also Southern Veps dialects; 2) audio recordings of radio broadcasts in the Livvi dialect of the Karelian language, prepared by employees of the STRBC “Karelia”. The linguists of the project selected interviews with speakers of different Livvi dialects: Syamozero, Tulmozero, Vedlozero, Vidlitsa, Kotkozero, Rypushkala and Nekkula dialects; 3) field audio recordings made by the project linguists during the reporting period in the places of compact residence of Karelians: Medvezhyegorsk, Olonets, Kondopoga districts, terr. of Kostomuksha (Republic of Karelia) and Rameshkovsky district (Tver region). For each of the records selected for the corpus, the boundaries for breaking into fragments were determined, cut and loaded into the database. The project participants filled in the metadata of the audio recordings in detail, transcribed the audio fragments and translated them into Russian. A complete grammatical (grammatical forms of each word of the text were determined) and semantic (meanings of each word of the text were noted) markup was made. The result of the work done is the "Speech corpus" module (http://dictorpus.krc.karelia.ru/ru/corpus/speech_corpus), which presents the texts of the corpus, accompanied by audio recordings, as well as the search filters necessary for work (search by language / dialect, place and year of recording, informant and collector, source). In total, during the first year of work, the Speech corpus was filled with 50 audio fragments, representing a variety of Karelian and Veps oral dialect speech. At the same time, various sources of filling subcorpuses were tested. For the Karelian and Lude subcorpuses proper, archival and fresh field records were used, purposefully made to solve the project's tasks. The Livvi subcorpus was filled with materials from Karelian radio broadcasts. The Vepsian subcorpus included only archival data. At the next stage of work, it is planned to replace the sources within the sub-corpuses. Of particular value to the project participants is a fragment of a recording of the Valday speech, the only one found to date. To facilitate the work of users with the Speech corpus and to provide a visual representation of the sound material, a multimedia audio map of dialects of the Baltic-Finnish speech of Karelia and adjacent regions was developed (http://dictorpus.krc.karelia.ru/ru/corpus/audiotext/map). All the prepared audio fragments were reflected on the map: 15 audio samples in the Livvi dialect (Vedlozero, Vidlitsa, Kotkozero, Nekkula, Rypushkala, Syamozero, Tulmozero dialects), 7 audio samples in the Lude dialect (Middle Lude, Southern Lude and Mikhailovskoye dialects), 21 audio samples in the Karelian dialect proper (Valdai, Vesyegonsk, Voknavolok, Dyorzha, Keret, Kestenga, Oulanga, Padany, Poduzhemye, Porosozero, Reboly, Tikhvin, Tolmachy, Yushkozero dialects) and 7 audio samples in the Vepsian language (Northern Veps, Middle Veps Eastern, Middle Veps Western and Southern Veps dialects). In addition, an image loading module was developed, which will allow filling the map with expedition photographs of different years in the future, thanks to which the user will be able to immerse himself in the atmosphere of Karelian and Vepsian villages (http://dictorpus.krc.karelia.ru/ru/corpus/text/4276). In addition to educational purposes and the task of maintaining the viability of dialect speech, the map can be actively used for educational purposes, for example, in courses for teaching Karelian and Vepsian dialectology. In parallel with the development of the “Speech corpus”, a module was developed and implemented for the informant to voice the prepared list of vocabulary words and phrases on the VepKar website and directly in the dictionary entry, which made it possible to voice 3 thousand words of Livvi and more than 1 thousand words of Lude dialects of the Karelian language during the reporting period. All records are stored in the VepKar hull database. The metadata of the audio recording are the speaker's name (informant from the VepKar database), recording date, word identifier. Recording is possible in the field. As part of the actual research phase of the project, the task was to identify dialectal markers of Karelian and Vepsian speech, mainly on the basis of audio fragments loaded into the corpus. In the process of deciphering the records, the main phonetic dialect-differentiating features were identified, which should include, first of all, the features of the systems of ascending descending diphthongs and front-lingual fricative consonants, the final vowel of the initial and inflectional forms of words, the features of the alternation system of consonants, etc. In the course of the morphological marking of the transcribed texts, the main dialectal features of the grammatical systems of the Karelian and Vepsian dialects were determined, which include the features of case systems, differences in the number of tense forms of the conditional mood, the features of the formation of reflexive verb forms, etc. Semantic markup, accordingly, made it possible to identify lexical inter-dialect correspondences, i.e. lexemes have been identified that have a different meaning in dialects, or concepts have been identified for naming which different words are used in dialects. During the reporting period 3 presentations were made at international (1), interregional (1) and regional (1) conferences. Published 2 articles (WoS, HAC), 1 article prepared and sent to the editors of the journal (WoS). Other ways of publishing the results of the project implementation include publications in the media in Russian and national languages of Karelian Republic (6 in total): STRBC “Karelia”, printed edition “Karjalan Sanomat”, website KarRC RAS, group “Young researchers of ILLH” (vkontakte).

 

Publications

1. Aleksandra P. Rodionova О коллекциях людиковских диалектных материалов Фонограммархива ИЯЛИ КарНЦ РАН Ученые записки Петрозаводского государственного университета, № 7, Т. 44, С. 64–70 (year - 2022) https://doi.org/10.15393/uchz.art.2022.818

2. Novak Irina Petrovna, Krizhanovskaya Natalia Borisovna Система восходящих дифтонгов в говорах карельского языка Карелии: сравнение методов кластеризации Вестник угроведения, 2022. Т. 12. № 3. С. 486–496 (year - 2022) https://doi.org/10.30624/2220-4156-2022-12-3-486-496

3. - Marina Tolstyh. Puhemalleja on esillä teksteissä ja äänitteissä Karjalan Sanomat, Karjalan Sanomat, № 21 (16537), 2022, c. 8 (year - )

4. - VepKar vaalii vähemmistökieliä Karjalan Sanomat, Karjalan Sanomat. № 42 (16558), 2022, с. 9 (year - )

5. - комментарий о Речевом корпусе для сайта Сайт КарНЦ РАН, http://www.krc.karelia.ru/news.php?id=4853 (year - )

6. - комментарий о проекте РНФ «Создание речевого корпуса прибалтийско-финских языков Карелии» и экспедиционном выезде в Медвежьегорский район Сайт КарНЦ РАН, http://www.krc.karelia.ru/news.php?id=4719&plang=r (year - )

7. - ТВ-интервью об экспедиционном выезде в паданский куст деревень Медвежьегорского района (на карельском языке) ГТРК "Карелия", https://www.youtube.com/watch?v=Xezh62OarGQ (year - )

8. - цикл загадок для населения по материалам экспедиции в паданский куст деревень Медвежьегорского р-на Страница VK "Молодые ученые ИЯЛИ", Публикации с 20.-27.06.2022: https://vk.com/youngresearchers_illh (year - )