Meadow Mari corpora


Welcome to the start page of Meadow Mari language corpora: the Main corpus of literary Meadow Mari (press) and the Corpus of social media in Meadow Mari language.

Details To the main corpus To the social media corpus

Meadow Mari corpora

Toggle navigation

This is the main page of the website where linguistic corpora of Meadow Mari language are located. Currently, two corpora are available: the corpus of contemporary written literary Meadow Mari (“the Main corpus”) and the corpus of social media in Meadow Mari. They differ in what kind of texts the contain, but have mostly identical annotation and search capabilities. Here is a brief comparison:

Main corpus Social media corpus
Language Meadow Mari Meadow Mari and Russian
Size 2.63 million words 3.59 million words (the Mari part)
15.11 million words (the Russian part)
Texts contemporary press, Wikipedia (up to May 2019) open posts and comments by Mari-speaking vkontakte users (up to May 2019)
Language variety in most cases, standard written literary Meadow Mari or close to it language of digital communication: closer to the spoken variety, influenced by the dialects and Russian language, contains numerous code switching instances
Annotation
  • automatic morphological annotation (lemmatization, part of speech, all inflectional features), 90.7% words analyzedonly tokens that do not contain digits or Latin characters are taken into account
  • no disambiguation
  • annotation of Russian loanwords
  • annotation of several lexical/semantic classes and word formation: animate/human nouns, body parts, transport, different classes of proper names, several nominal derivational suffixes
  • glossing
  • Russian translation of lemmata
  • automatic morphological annotation (lemmatization, part of speech, all inflectional features), 86.4% words analyzedonly tokens that do not contain digits or Latin characters are taken into account
  • no disambiguation
  • annotation of Russian loanwords
  • annotation of several lexical/semantic classes and word formation: animate/human nouns, body parts, transport, different classes of proper names, several nominal derivational suffixes
  • glossing
  • Russian translation of lemmata
Metadata
  • title of the text
  • author or title of the newspaper
  • creation year (exact date in the case of newspapers)
  • genre
  • group name (for groups)
  • publicly available user metadata: sex (for everyone); if available, also birth year (grouped in 5-year spans); real names and nicknames of the users are hidden
  • creation year
  • message type (post/comment)
  • language (tagged automatically, independently for each sentence)

Apart from the corpora available here, there also exist corpora and other resources for Mari developed by Jeremy Bradley.

You can find more detailed information about Meadow Mari Social media corpus and its development in this paper. Please consider citing this paper if your research is based on this corpus:

Timofey Arkhangelskiy. 2019. Corpora of social media in minority Uralic languages. Proceedings of the fifth Workshop on Computational Linguistics for Uralic Languages, pages 125–140, Tartu, Estonia, January 7 - January 8, 2019.

What is a corpus?

A language corpus is a collection of texts in that language which has been enriched with additional linguistic information, called annotation, and, preferably, equipped with a search engine. Here you will find a short list of frequently asked questions about the Meadow Mari corpora.

— Who needs corpora?

First of all, corpora are used by linguists. The search engine and annotation of corpora are designed in such a way that you can make linguistic queries such as “find all nouns in the genitive case” or “find all forms of the word пырыс followed by a verb”. Apart from linguists, corpus can be a useful tool for language teachers, language learners, and even the native speakers.

— Can I use the corpus as a library?

No, these corpora are not designed for that. When you work with a corpus, you make a query, i.e. search for a particular word, phrase or construction, and get back all sentences that contain what you searched for. By default, the sentences are showed in random order. You can expand the context of each of the sentences you get, i.e. look at their neighboring sentences. However, you may do so only a limited number of times for each sentence. Therefore, it is impossible to read an entire text in the corpus. This is done for copyright protection.

— Can I use the corpus as a dictionary?

Each Mari word in the corpus has Russian translation (no English translations are available at the moment). However, they are only provided as auxiliary information for users who do not speak Mari. The translations in the corpus are kept short and simple by design, they do not list all senses and do not provide usage examples like real dictionaries. If you want to know how to translate a word, the right way to do so is consulting a dictionary, e.g. here.

— What is morphological annotation and how do you get it?

The corpora located here are lemmatized and morphologically annotated. Lemmatization means that each word in the texts is annotated with its lemma, i.e. dictionary/citation form. Morphological annotation means that each word is annotated for its grammatical features, such as part of speech, number, case, tense, etc. Since the corpora in question are too large for manual annotation to be feasible, they were annotated automatically with a program called morphological analyzer. The analyzer uses a manually compiled grammatical dictionary and a formalized description of Meadow Mari inflection. The analyzer together with the necessary materials is freely available in my bitbucket repository. Automatic annotation unfortunately means that, first, out-of-vocabulary words are not annotated, and, second, that some words have several ambiguous analyses. Russian sentences in the social media corpus were annotated with the mystem analyzer.

Mari language

Meadow (or Eastern) Mari is one of several Mari languages, which form a group in the Uralic family. The number of Meadow Mari speakers is estimated at 356,000. Mari languages use Cyrillic orthography based on the Russian alphabet, with several additional letters. There is limited vowel harmony in Mari. Almost all morphological markers are suffixes that mostly attach to the stem agglutinatively. Nominal grammatical categories are number, case, and possessiveness. The direct object can be marked either in the nominative or in the accusative (DOM). The default word order in the sentence is SOV (subject – object – verb).

Tagset

The grammatical features of the words in the corpora are marked with short tags. Here is the full list of tags used in Meadow Mari corpora. Both corpora have identical set of tags.

  • A — adjective
  • ADV — adverb
  • CONJ — conjunction
  • IMIT — ideophone
  • INTRJ — interjection
  • N — noun
  • NUM — numeral
  • PARENTH — parenthetic word
  • PART — particle
  • PN — proper noun (subtype of nouns)
  • POST — postposition
  • PREDIC — predicative
  • PRO — pronoun
  • V — verb
  • 1 — 1st person
  • 1pl — 1pl possessive suffix
  • 1sg — 1sg possessive suffix
  • 2 — 2nd person
  • 2pl — 2pl possessive suffix
  • 2sg — 2sg possessive suffix
  • 3 — 3rd person
  • 3pl — 3pl possessive suffix
  • 3sg — 3sg possessive suffix
  • abbr — abbreviation
  • acc — accusative case
  • add — additive particle
  • anim — animate noun
  • attr — any attributive
  • attr_an — general attributive in -an
  • attr_le — attributive in -le
  • attr_loc — locative attributive in -se
  • attr_neg — negative attributive in -di̮me
  • body — body part
  • case_comp — case compounding
  • caus — causative (-i̮kt-)
  • com — comitative
  • comp — comparative (-rak)
  • cvb — any converb
  • cvb.consec — consecutive converb
  • cvb.gen — general converb
  • cvb.neg — negative converb
  • cvb.prec — precedence converb
  • cvb.sim — converb of simultaneity
  • dat — dative case
  • dem — demonstrative pronoun
  • emph — emphatic particle
  • famn — family name
  • gen — genitive case
  • hort — hortative particle
  • hum — human
  • ill — illative case
  • imp — imperative
  • indef — indefinite pronoun
  • inf — infinitive
  • lat — lative
  • loc — locative/inessive
  • missp — typo
  • neg — negative form
  • nmlz — nominalization
  • nom — nominative case
  • nonposs — non-possessive form
  • npst — non-past tense
  • opt — optative/desiderative mood
  • ord — ordinal number
  • pass — passive
  • patrn — patronymic
  • pers — peronal pronoun
  • persn — personal (given) name
  • pl — plural
  • pl.assoc — associative plural
  • plen — full form of adjectives/numerals
  • pst — first past tense
  • pst2 — second past tense
  • ptcp — any participle
  • ptcp.act — active participle
  • ptcp.neg — negative participle
  • ptcp.pass — passive participle
  • ptcp.prosp — prospective participle
  • refl — reflective pronoun
  • rus — Russian borrowing (or borrowing through Russian)
  • rus_afx — Russian affix with native stem
  • sg — singular
  • short — short form of adjectives/numerals
  • supernat — noun that denotes a supernatural beingThis category is a byproduct of animacy/humanness annotation. Since it is not clear whether these cases should be classified as human, we put them in a separate box, so that the user can decide that for themselves.
  • sim — similitive case (in -la)
  • topn — toponym (geographical name)
  • transport — transport

The tagset for the Russian-language part (Russian sentences in the social media corpus) can be found in the Russian National Corpus.

Authors

The corpora and morphological analyzer are developed and maintained by Timofey Arkhangelskiy. The first versions of the corpora were released in 2019 as part of his postdoctoral project supported by Alexander von Humboldt Foundation. The background picture was kindly provided by Aigul Zakirova. The corpora are hosted by the School of linguistics at HSE, Moscow.

Contacts


If you have questions, would like to propose collaboration, or noticed an error in the corpusexcept typos in blogs and social media: these text are left "as is", please contact Timofey Arkhangelskiy. You can also use the Meadow Mari morphological analyzer and the tsakorpus corpus platform, which are open source and freely available.