Corpus Studies

A corpus is a collection of texts which are the object of a literary or linguistic study.

A corpus is defined as a collection of authentic texts which are the object of a literary or linguistic study. In modern corpus linguistics, these texts are held in electronic format and can be accessed in a flexible and fast way using corpus processing software. Although most definitions highlight that a corpus has to be assembled for a specific purpose, even the World Wide Web can be considered as corpora, as long as the content is at the center of linguistic study.

Corpus Creation and Processing

Designers in corpus creation have to make informed decisions on the types of language they want to include in their corpora and in which proportions. Previously, there was a postulation in which a corpus has to be representative of a particular type of language production. But, this representation is difficult to achieve and applies to textual data. There are various corpus types based on different principles. A general purpose single language corpus might include transcriptions of both spoken and written language. Monolingual, bilingual, and multilingual corpora are made up of texts in one, two, or more languages respectively.

There also may be a trade-off between fewer but more useful, full-length texts, and compromised partial texts. The selection of texts can be done either randomly or manually. Selected texts may be converted to electronic form either through typing or scanning. Depending on the use of the corpus, structural or linguistic annotations such as parts-of-speech tagging, syntactic and semantic annotation are desirable. Corpus-based approaches, in this regard, allow large amounts of empirical data to be processed by researchers for different purposes, such as testing and observation.

Corpus-based Translation Studies (CTS)

This new approach was introduced by Mona Baker, and early works investigated recurrent features that distinguish translations from non-translated language production. These features are called universals of translation. In this sense, translated texts tend to be more explicit, use more conventional grammar and lexis, and be simpler than either source text or other texts in the target language. Distinctive behavior or style of translators also have to be taken into consideration, as the choice of corpora and analysis techniques reflect different understandings of style in translation.

Corpora can provide valuable aid in various fields such as specialized translation. Aside from written translation, corpora linguistics has applications in audiovisual and interpreting as well. Compiling often requires transcription and audio or video alignment. Even with access to existing resources, there might be methodological issues given its complex nature. Despite these challenges, there are valuable outcomes, and CTS continues to offer contributions to translation studies.