Cataloguing linguistic diversity: Glottolog/Langdoc

Home » conference » programme » abstracts » Cataloguing linguistic diversity: Glottolog/Langdoc

Nordhoff, Sebastian, Max Planck Institute for Evolutionary Anthropology, Germany, sebastian_nordhoff@eva.mpg.de

Hammarström, Harald, Max Planck Institute for Evolutionary Anthropology, Germany, harald_hammarstroem@eva.mpg.de

Overview

Glottolog/Langdoc is a comprehensive database linking 180k bibliographical references to 21k languoids (language families, languages, dialects). It provides extensive query possibilities for human users and subscribes to the principles of Linked Open Data (Heath & Bizer 2011) as far as machine users are concerned.

The aim of Glottolog/Langdoc to provide near-total bibliographical coverage of descriptive resources to the world’s languages. Every reference is treated as a resource, as is every ‘languoid’ (Good & Hendryx-Parker 2006). References are linked to the languoids which they describe, and languoids are linked to the references described by them.

Computational treatment and modeling of language resources has so far mainly concentrated on major languages with a research tradition in NLP and some commercial viability. When we leave the industrialized countries, languages resources become very scarce. Treebanks or annotated corpora seem like fanciful ideas when the sum total of resources treating a language amounts to a description of its verbs and a treatise of its phonology from a local university, which is for instance the case of the Niger-Congo language Aduge. Before one can start thinking about developing a WordNet or similar larger resources for these languages, one must take stock of the resources which exist, however arcane they might be. This is one of the aims of the Glottolog/Langdoc project. The resources are tagged for resource type (grammar, word list, text collection etc), macroarea (roughly, continents), and language.

Use cases

Four different user groups can be distinguished: language diversity researchers, statisticians, Semantic Web engineers, and linguistic empiricists.

Linguistic diversity researchers

The first groups covers linguists interested in the world-wide distribution of linguistic diversity (cf. Evans & Levinson 2009), for the largest part typologist. These researchers are for instance interested in the distribution of subject, verb, and object in the languages of the world (SVO, SOV, VSO, OVS, OSV, VOS), or in the size of the phonemic inventory. The emerging patterns can be related to human cognition on the one hand hand (SOV and SVO have distinct processing advantages Hawkins 2004) and known migration patterns of humans as they settled the global land mass (Atkinson 2011). In order to acquire the necessary data points, description of various languages have to be perused, respecting genetic and geographical. This means that substantial bibliographical information has to be collected. Glottolog/Langdoc aims at providing near-total coverage of literature of the world’s lesser known languages, including grey literature. Note that Glottolog/Langdoc only provides the bibliographical records, not the references themselves. All bibliographical information can be downloaded as txt, html or bibtex. Zotero integration is also provided. The provision of references is complemented by links to sites where a copy could be obtained (WorldCat, GoogleBooks, Open Library).

Statistical analysis

The links established between 180k references and 21k languoids allow for statistical analyses of the following kinds:

What is the descriptive coverage of a particular language?
What is the descriptive coverage of a particular language family or area (Hammarström & Nordhoff, in press)?


	Austronesian	non-Austronesian	Total
grammar	93 (17.82)%	114 (13.82%)	207 (15.37%)
grammar sketch	104 (19.92%)	148 (17.94%)	252 (18.71%)
phonology or sim.	55 (10.54%)	54 (6.55%)	109 (8.09%)
wordlist or less	270 (51.72%)	509 (61.70%)	779 (57.83%)

How many languages have so far been described (Hammarström & Nordhoff 2011)?
Status of the least described language families in the world (Hammarström 2010)?
In which geographic area is the research focus on phonology, in which area is syntax deemed more interesting?

Semantic Web engineers

Next to XHTML, Glottolog/Langdoc data are also available as RDF, making use of a number of established ontologies such as rdfs, skos,¹ gold,² lexvo,³, wgs84,⁴ bibo,⁵ and Dublin core.⁶ This means that Glottolog/Langdoc data can be integrated into other projects making use of the aforementioned ontologies.

Empiricists

There are a number of researchers who are at unease with the current way how languages are defined and language codes assigned by the current registrar, SIL international. Some spurious languages do get codes, while some existing languages do not get a code. Some language families see a multiplication of their members (e.g. there are over 40 Quechuan languages) while this ‘splitter’ approach is not observed in other areas of the world (Nordhoff & Hammarström 2011). SIL draws on the Ethnologue⁷ for the set of living languages it provides codes for. The Ethnologue lists about 7000 languages, but does not always disclose the information the inclusion is based on. As a result, there are a number of ‘languages’ which do not seem to relate to anything in the real world, dangling pointers so to speak. They have an Ethnologue entry, a name, a code assigned by SIL, but no referent. Glottolog has the stated goal to give a reference for every language it includes. The project distinguishes ‘established languoids’, for which scientific documentation is available, for ‘provisional languoids’, for which we do not have dedicated descriptions. The long term goal is to get rid of ‘provisional languoids’ by either finding a description, making them ‘established’, or by discarding them. This will allow linguists to always be able to ascertain on what grounds a given language is argued to exist.

Technologies

Glottolog/Langdoc draws on over 20 input bibliographies, which are enriched with information about document type and languages covered using machine-learning techniques (Hammarström, 2011). The project uses the pylons web framework. Glottolog/Langdoc supports content-negotiation (XHTML and RDF) and is part of the Linguistic Linked Open Data Cloud (Chiarcos et al. 2012).

Comparison

There are a number of related projects with slightly different foci. The WEBALL project for instance⁸ lists genetic, bibliographic, and geographic information about the languages of Africa, but not beyond. We incorporate a recent version of the WEBALL database in Glottolog/Langdoc. OLAC⁹ aggregates references to linguistic resources, but has a slightly different data model. Furthermore, OLAC uses federated data aggregated via OAI-MHP while Glottolog/Langdoc uses a static repository. OLAC does include genetic information, but this seems to be ad hoc. For instance, OLAC includes Austronesian and Eastern Malayo-Polynesian, but not the lower grouping Oceanic (still over 1000 languages). Furthermore, the coverage of OLAC and Glottolog/Langdoc is different. OLAC has more information on major languages, an area Glottolog/Langdoc disregards. Many of the higher numbers from OLAC come in fact from a single documentation project hosted at the MPI for Psycholinguistics in Nijmegen. These high numbers have technical reasons (every recording of the MPI archives is counted as one resource) and do not translate to an equally high number of publications. The following table gives the top languages as far as number of references are concerned for OLAC and Glottolog/Langdoc. A dagger † signals languages which profit from overcounting of resources from corpus1.mpi.nl.


OLAC	Glottolog/Langdoc
English (9044)	Swahili (1826)
German (5770)	Hausa (1542)
Dutch (5239)	Nama (Namibia) (1270)
Japanese (4317)	Afrikaans (1155)
Spanish (3331)	Central Yupik (1129)
Turkish (3091)	Standard French (1059)
French (1964)	Zulu (1048)
Yuracare (1576) †	South Levantine Arabic (990)
Oxchuc Tzeltal (1390) †	Tlingit (987)
Central Yupik (1312)	Gwich’in (905)
Turkish Sign Language (1157) †	Yoruba (900)
Yele (1152) †	Kabyle (879)
Aleut (1135)	Aleut (740)
Beaver (1080) †	Thai (730)
Dutch Sign Language (1063) †	Koyukon (728)
Tlingit (1028)	Xhosa (728)
North Alaskan Inupiatun (1024)	Akan (709)
Gwich’in (955)	Pulaar (703)
Czech (943)	Ewe (693)
Polish (892)	Tswana (690)

Multitree’s¹⁰ focus is on gathering all genetic classifications of the world’s languages, not so much on references. Glottolog uses Multitree information for lower levels, but an own, more conservative, classification for higher levels. Multitree has 141 families while the Glottolog main tree has 429 (including isolates).

The Ethnologue¹¹ finally lists languages and references, but does not subscribe to the document-centric approach Glottolog/Langdoc employs. These projects cover similar ground, but with different specializations. As an example, Ethnologue lists 45 references for Hausa, while Glottolog lists 1826; The figures for Thai are 37 and 730, respectively. Ethnologue has a similar number of languages families as Multitree (132), which is less ‘splitting’ than Glottolog. The Ethnologue lists population figures and language development; this information is not found in Glottolog.

The current technical limitations mean that there is a substantial duplication of work. It is hoped that the use of RDF and related technologies will lead to a decrease of this duplication of work. There is for instance no need that all these projects keep their own database mapping geographical coordinates and countries to languages, neither is there a need to have several databases of language names. Publication of these resources according to the principles of Linked Data (Chiarcos et al. 2012) will mean that the data can easily be repurposed and integrated into other applications.

Glottolog/Langdoc provides URIs for references and languoids so that other resources can easily link to or retrieve from Glottolog/Langdoc. While Glottolog/Langdoc takes a critical stance towards ISO 639-3, all relevant information can nevertheless also be accessed via the ISO 639-3 code.

References

Atkinson, Q. D. (2011). Phonemic Diversity Supports a Serial Founder Effect Model of Language Expansion from Africa. Science 332: 346.

Chiarcos, C., S. Nordhoff, and S. Hellmann, eds. (2012). Linked Data in Linguistics. Representing Language Data and Metadata. Springer. Companion volume of the Workshop on Linked Data in Linguistics 2012 (LDL-2012), held in conjunction with the 34th Annual Meeting of the German Linguistic Society (DGfS), March 2012, Frankfurt/M., Germany.

Evans, N., and S. Levinson (2009). The myth of language universals: Language diversity and its importance for cognitive science. Cognitive and Brain Sciences 32: 429-492.

Fabre, A. (2005). Diccionario Etnolingüístico y guía Bibliográfica de los Pueblos Indigenas Sudamericanos. Book in Progress at http://butler.cc.tut.fi/ fabre/BookInternetVersio/Alkusivu.html accessed May 2005.

Good, J., and C. Hendryx-Parker (2006). Modeling Contested Categorization in Linguistic Databases. Proceedings of the EMELD 2006 Workshop on Digital Language Documentation: Tools and Standards: The State of the Art. Lansing, Michigan. June 20-22, 2006 http://www.linguistlist.org/emeld/workshop/2006/papers/GoodHendryxParker-Modelling.pdf.

Hammarström, H. (2010). The Status of the Least Documented Language Families in the World. Language Documentation & Conservation 4: 177-212.

Hammarström, H. (2011). Automatic Annotation of Bibliographical References for Descriptive Language Materials. In P. Forner, J. Gonzalo, J. Kekäläinen, M. Lalmas, and M. de Rijke (eds.), Proceedings of the CLEF 2011 Conference on Multilingual and Multimodal Information Access Evaluation, LNCS, vol. 6941. Berlin: Springer, pp. 62-73.

Hammarström, H., and S. Nordhoff (2011). How many languages have so far been described? Paper presented at NWO Endangered Languages Programme Conference, Leiden, April 2011.

Hammarström, H., and S. Nordhoff (in press). Achievements and Challenges in the Description of the Languages of Melanesia. In M. Klamer and N. Evans (eds.), Melanesian languages on the Edge of Asia, Special Issue of Language Documentation & Conservation.

Hawkins, J. A. (2004). Efficiency and Complexity in Grammars. Oxford UP.

Heath, T., and C. Bizer (2011). Linked Data – Evolving the Web into a Global Data Space. San Rafael: Morgan & Claypool.

Maho, J. (2001). African Languages Country by Country: A Reference Guide, Göteborg Africana Informal Series, vol. 1. Department of Oriental and African Languages, Göteborg University, 5th ed.

Nordhoff, S., and H. Hammarström (2011). Glottolog/Langdoc: Defining dialects, languages, and language families as collections of resources. In Proceedings of the First International Workshop on Linked Science 2011, CEUR Workshop Proceedings, vol. 783CEUR Workshop Proceedings, vol. 783. URL http://iswc2011.semanticweb.org/fileadmin/iswc/Papers/Workshops/LISC/nordhoff.pdf.

Notes

1.http://www.w3.org/2004/02/skos/

2.http://linguistics-ontology.org

3.http://lexvo.org

4.http://www.w3.org/2003/01/geo/wgs84_pos

5.http://bibliontology.com/

6.http://dublincore.org/

7.http://www.ethnologue.com

8.http://sumale.vjf.cnrs.fr/Biblio/

9.http://www.language-archives.org/

10.http://multitree.org/

11.http://www.ethnologue.com