No source: created in electronic format.
Glottolog/Langdoc is a comprehensive database linking 180k bibliographical references to 21k languoids (language families, languages, dialects). It provides extensive query possibilities for human users and subscribes to the principles of Linked Open Data (Heath & Bizer 2011) as far as machine users are concerned.
The aim of Glottolog/Langdoc to provide near-total bibliographical coverage of descriptive resources to the world’s languages. Every reference is treated as a resource, as is every ‘languoid’ (Good & Hendryx-Parker 2006). References are linked to the languoids which they describe, and languoids are linked to the references described by them.
Computational treatment and modeling of language resources has so far mainly concentrated on major languages with a research tradition in NLP and some commercial viability. When we leave the industrialized countries, languages resources become very scarce. Treebanks or annotated corpora seem like fanciful ideas when the sum total of resources treating a language amounts to a description of its verbs and a treatise of its phonology from a local university, which is for instance the case of the Niger-Congo language Aduge. Before one can start thinking about developing a WordNet or similar larger resources for these languages, one must take stock of the resources which exist, however arcane they might be. This is one of the aims of the Glottolog/Langdoc project. The resources are tagged for resource type (grammar, word list, text collection etc), macroarea (roughly, continents), and language.
Four different user groups can be distinguished: language diversity researchers, statisticians, Semantic Web engineers, and linguistic empiricists.
The first groups covers linguists interested in the world-wide distribution of linguistic diversity (cf. Evans & Levinson 2009), for the largest part typologist. These researchers are for instance interested in the distribution of subject, verb, and object in the languages of the world (SVO, SOV, VSO, OVS, OSV, VOS), or in the size of the phonemic inventory. The emerging patterns can be related to human cognition on the one hand hand (SOV and SVO have distinct processing advantages Hawkins 2004) and known migration patterns of humans as they settled the global land mass (Atkinson 2011). In order to acquire the necessary data points, description of various languages have to be perused, respecting genetic and geographical. This means that substantial bibliographical information has to be collected. Glottolog/Langdoc aims at providing near-total coverage of literature of the world’s lesser known languages, including grey literature. Note that Glottolog/Langdoc only provides the bibliographical records, not the references themselves. All bibliographical information can be downloaded as txt, html or bibtex. Zotero integration is also provided. The provision of references is complemented by links to sites where a copy could be obtained (WorldCat, GoogleBooks, Open Library).
The links established between 180k references and 21k languoids allow for statistical analyses of the following kinds:
Next to XHTML, Glottolog/Langdoc data are also available as RDF, making use of a number
of established ontologies such as rdfs, skos,
There are a number of researchers who are at unease with the current way how languages
are defined and language codes assigned by the current registrar, SIL international.
Some spurious languages do get codes, while some existing languages do not get a code.
Some language families see a multiplication of their members (e.g. there are over 40
Quechuan languages) while this ‘splitter’ approach is not observed in other areas of the
world (Nordhoff & Hammarström 2011). SIL draws on the Ethnologue
Glottolog/Langdoc draws on over 20 input bibliographies, which are enriched with information about document type and languages covered using machine-learning techniques (Hammarström, 2011). The project uses the pylons web framework. Glottolog/Langdoc supports content-negotiation (XHTML and RDF) and is part of the Linguistic Linked Open Data Cloud (Chiarcos et al. 2012).
There are a number of related projects with slightly different foci. The WEBALL project for instanceOLACAustronesian and Eastern
Malayo-Polynesian, but not the lower grouping Oceanic
(still over 1000 languages). Furthermore, the coverage of OLAC and Glottolog/Langdoc is
different. OLAC has more information on major languages, an area Glottolog/Langdoc
disregards. Many of the higher numbers from OLAC come in fact from a single
documentation project hosted at the MPI for Psycholinguistics in Nijmegen. These high
numbers have technical reasons (every recording of the MPI archives is counted as one
resource) and do not translate to an equally high number of publications. The following
table gives the top languages as far as number of references are concerned for OLAC and
Glottolog/Langdoc. A dagger † signals languages which profit from overcounting of
resources from corpus1.mpi.nl.
Multitree’s
The Ethnologue
The current technical limitations mean that there is a substantial duplication of work. It is hoped that the use of RDF and related technologies will lead to a decrease of this duplication of work. There is for instance no need that all these projects keep their own database mapping geographical coordinates and countries to languages, neither is there a need to have several databases of language names. Publication of these resources according to the principles of Linked Data (Chiarcos et al. 2012) will mean that the data can easily be repurposed and integrated into other applications.
Glottolog/Langdoc provides URIs for references and languoids so that other resources can easily link to or retrieve from Glottolog/Langdoc. While Glottolog/Langdoc takes a critical stance towards ISO 639-3, all relevant information can nevertheless also be accessed via the ISO 639-3 code.
Atkinson, Q. D. (2011). Phonemic
Diversity Supports a Serial Founder Effect Model of Language Expansion from
Africa. Science 332: 346.
Chiarcos, C., S. Nordhoff, and S. Hellmann,
eds. (2012). Linked Data in Linguistics. Representing
Language Data and Metadata. Springer. Companion volume of the Workshop
on Linked Data in Linguistics 2012 (LDL-2012), held in conjunction with the 34th
Annual Meeting of the German Linguistic Society (DGfS), March 2012,
Frankfurt/M., Germany.
Evans, N., and S. Levinson (2009). The
myth of language universals: Language diversity and its importance for cognitive
science. Cognitive and Brain Sciences 32: 429-492.
Fabre, A. (2005). Diccionario
Etnolingüístico y guía Bibliográfica de los Pueblos Indigenas Sudamericanos.
Book in Progress at http://butler.cc.tut.fi/ fabre/BookInternetVersio/Alkusivu.html
accessed May 2005.
Good, J., and C. Hendryx-Parker (2006).
Modeling Contested Categorization in Linguistic Databases. Proceedings of the EMELD 2006 Workshop on Digital Language Documentation:
Tools and Standards: The State of the Art. Lansing, Michigan. June
20-22, 2006 http://www.linguistlist.org/emeld/workshop/2006/papers/GoodHendryxParker-Modelling.pdf.
Hammarström, H. (2010). The Status of
the Least Documented Language Families in the World. Language
Documentation & Conservation 4: 177-212.
Hammarström, H. (2011). Automatic
Annotation of Bibliographical References for Descriptive Language Materials. In
P. Forner, J. Gonzalo, J. Kekäläinen, M. Lalmas, and M. de Rijke (eds.),
Proceedings of the CLEF 2011 Conference on Multilingual and Multimodal
Information Access Evaluation, LNCS, vol. 6941. Berlin:
Springer, pp. 62-73.
Hammarström, H., and S. Nordhoff
(2011). How many languages have so far been described? Paper presented at NWO
Endangered Languages Programme Conference, Leiden, April 2011.
Hammarström, H., and S. Nordhoff (in
press). Achievements and Challenges in the Description of the Languages of
Melanesia. In M. Klamer and N. Evans (eds.), Melanesian languages on the Edge of
Asia, Special Issue of Language Documentation &
Conservation.
Hawkins, J. A. (2004). Efficiency and Complexity in Grammars. Oxford UP.
Heath, T., and C. Bizer (2011). Linked Data - Evolving the Web into a Global Data Space.
San Rafael: Morgan & Claypool.
Maho, J. (2001). African Languages Country by Country: A Reference Guide, Göteborg Africana Informal Series, vol. 1. Department of
Oriental and African Languages, Göteborg University, 5th ed.
Nordhoff, S., and H. Hammarström (2011). Glottolog/Langdoc:
Defining dialects, languages, and language families as collections of resources.
In Proceedings of the First International Workshop on Linked Science 2011, CEUR Workshop Proceedings, vol. 783CEUR
Workshop Proceedings, vol. 783. URL
http://iswc2011.semanticweb.org/fileadmin/iswc/Papers/Workshops/LISC/nordhoff.pdf.