The Bulgarian National Corpus: Theory and Practice in Corpus Design

Svetla Koeva; Ivelina Stoyanova; Svetlozara Leseva; Rositsa Dekova; Tsvetana Dimitrova; Ekaterina Tarpomanova

doi:10.15398/jlm.v0i1.33

Authors

Svetla Koeva Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences
Ivelina Stoyanova Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences
Svetlozara Leseva Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences
Rositsa Dekova Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences
Tsvetana Dimitrova Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences
Ekaterina Tarpomanova Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences

Keywords:

corpus design, Bulgarian National Corpus, computational linguistics

Abstract

The paper discusses several key concepts related to the development of corpora and reconsiders them in light of recent developments in NLP. On the basis of an overview of present-day corpora, we conclude that the dominant practices of corpus design do not utilise adequately the technologies and, as a result, fail to meet the demands of corpus linguistics, computational lexicology and computational linguistics alike.

We proceed to lay out a data-driven approach to corpus design, which integrates the best practices of traditional corpus linguistics with the potential of the latest technologies allowing fast collection, automatic metadata description and annotation of large amounts of data. Thus, the gist of the approach we propose is that corpus design should be centred on amassing large amounts of mono- and multilingual texts and on providing them with a detailed metadata description and high-quality multi-level annotation.

We go on to illustrate this concept with a description of the compilation, structuring, documentation, and annotation of the Bulgarian National Corpus (BulNC). At present it consists of a Bulgarian part of 979.6 million words, constituting the corpus kernel, and 33 Bulgarian-X language corpora, totalling 972.3 million words, 1.95 billion words altogether. The BulNC is supplied with a comprehensive metadata description, which allows us to organise the texts according to different principles. The Bulgarian part of the BulNC is automatically processed (tokenised and sentence split) and annotated at several levels: morphosyntactic tagging, lemmatisation, word-sense annotation, annotation of noun phrases and named entities. Some levels of annotation are also applied to the Bulgarian-English parallel corpus with the prospect of expanding multilingual annotation both in terms of linguistic levels and the number of languages for which it is available. We conclude with a brief evaluation of the quality of the corpus and an outline of its applications in NLP and linguistic research.