BaTelÒc: A Text Base for the Occitan Language


BaTelÒc: A Text Base for the Occitan Language

Authors: Myriam Bras & Marianne Vergez-Couret
Edition: CLLE-ERSS, UMR 5263 Université Toulouse Jean Jaurès & CNRS

Language Documentation & Conservation Special Publication No. 9 (January 2016):
Language Documentation and Conservation in Europe
ed. by Vera Ferreira and Peter Bouda, pp. 133–149

Year: 2016
File: PDF
Country: Occitania, Europe

Language Documentation, as defined by Himmelmann (2006), aims at compiling and preserving linguistic data for studies in linguistics, literature, history, ethnology, sociology. This initiative is vital for endangered languages such as Occitan, a romance language spoken in southern France and in several valleys of Catalonia and Italy. The documentation of a language concerns all its modalities, covering spoken and written language, various registers and so on. 

Nowadays, Occitan documentation mostly consists of data from linguistic atlases, virtual libraries from the modern to the contemporary period, and text bases for the Middle Ages. BaTelÒc is a text base for modern and contemporary periods. With the aim of creating a wide coverage of text collections, BaTelÒc gathers not only written literary texts (prose, drama and poetry) but also other genres such as technical texts and newspapers. Enough material is already available to foresee a text base of hundreds of millions of words. 

BaTelÒc not only aims at documenting Occitan, it is also designed to provide tools to explore texts (different criteria for corpus selection, concordance tools and more complex enquiries with regular expressions). As for linguistic analysis, the second step is to enrich the corpora with annotations. Natural Language Processing of endangered languages such as Occitan is very challenging. It is
not possible to transpose existing models for resource-rich languages directly, partly because of the spelling, dialectal variations, and lack of standardization. 

With BaTelÒc we aim at providing corpora and lexicons for the development of basic natural language processing tools, namely OCR and a Part-of-Speech tagger based on tools initially designed for machine translation and which take variation into account.

Occitan is a Romance language, spoken in southern France and in several valleys of Spain and Italy. The number of speakers is hard to estimate: According to several studies, it can be evaluated between 600,000 to 2,000,000 (Martel 2007, Sibille 2010). Occitan is not a unitary language, it has several varieties. The most accepted classification of Occitan dialects was suggested by Bec (1995) and includes Auvernhàs, Gascon, Lengadocian, Lemosin, Provençau, and Vivaro-aupenc.

Occitan is not standardized as a whole. Nevertheless, it is written since the Middle Ages and has a very important literary tradition. Its literature has been translated to other languages (Mistral, Boudou, Rouquette, Manciet, etc.). Although much less socialized than it was before the Second World War, Occitan is now present in newspapers, on the internet, on he radio and television, and in some public schools and universities. Non-governamental organizations maintain and spread Occitan: the Felibrige, the Institut d’Estudis Occitans, the associative network of imersive schools Calandreta, the linguistic training institute for adults – Centre de Formacion Professionala Occitan. However, Occitan has no official status in France.

In this paper, we present a text base for the Occitan language, called BaTelÒc. Nowadays, Occitan documentation mostly consists in dialectological data (several regional linguistic atlases are gathered in the THESOC database searchable online 2 ), digitized  lexicographic data (few bilingual dictionaries are searchable online 3 ), virtual libraries (books in PDF format) from the modern (Bibliotheca Tholosana Occitana, 4 16 th –18 th ) to the contemporary period (CIEL d’Òc, 5 19 th –21 st ), and machine-readable texts of the Middle Ages (Concordance of Medieval Occitan (Ricketts et al. 2001) and Linguistic Corpus of Old Gascon (Field 2013)). In addition, the CIRDOC (International Center for Occitan Documentation) is developing a multimedia library, Occitanica 6 which offers access to a multiplicity of sources: written texts, images, virtual exhibitions, documentary films, sound records, etc. 

The BaTelÒc project aims at complementing those resources with machine readable texts for modern and contemporary periods (see Bras 2006, Bras & Thomas 2011 for a description of the text base experimental version). It aims at developing a wide range
of text collections by gathering written literary texts (prose, drama and poetry) and others
genres such as technical texts and newspapers, and also by embracing dialectal and spelling
variations.


. #2016

Comments