Print Friendly

Nakhimovsky, Alexander, Colgate University, USA,
Good, Jeff, Department of Linguistics, University at Buffalo, USA,
Myers, Tom, N-Topus Software, USA,

Tools for language documentation

The two main outputs of traditional language documentation are lexicons and corpora of texts. In the last two decades, it has been proposed to augment these in a number of ways (Himmelmann 1998, 2006):

  • The primary data for language documentation should be made available in the form of (digital) audio or video recordings made in the field.
  • Corpora should consist of time-aligned and annotated transcripts of those recordings. Time alignment makes explicit how a given set of annotations relates to a media segment. Since there is a one-to-one correspondence between text and media segments, the latter can be searched by their text and annotations.
  • Text annotations should generally take the form of Interlinear Glossed Text (IGT) (see Palmer & Erk 2007). An example of IGT, from Haspelmath (1993), is shown below. The format has a tree-like structure, which can be represented via a ‘nested’ table. Its basic components include a transcriptional representation of data from the language being described which is further broken down into words and their component morphemes. Each sentence is associated with a free translation, and each word is associated with (possibly specialized) glosses.

Existing tools supporting the production of lexicons and time-aligned corpora are not, at present, well integrated. One testimony to this is the continuing use of Toolbox (formerly known as Shoebox), a database utility optimized for the creation of IGT and lexicons initially developed in the 1980s by SIL International (SIL). Toolbox lacks native support for time-aligned annotation and, more strikingly, proper data validation. However, one of its key features, the integration of a text database with a lexical database to facilitate automated glossing, has yet to be effectively replicated in any other widely used tool. A recent major revision of Field Language Explorer (FLEx), also from SIL, positions this tool to fill this gap, however (see Rogers 2010 for a recent review). FLEx is now cross-platform and internally uses a native XML format, which can form the basis for its integration into a set of interoperable tools.

In Europe, there is a major center for the development of language documention and language archiving technology (LAT) software at the Max Planck Institute for Psycholinguistics. One of the tools they have produced, ELAN, has become widely adopted for the creation of time-aligned annotations (see Berez 2007 for a review). Toolbox, ELAN and FLEx are the most commonly used specialized tools for language documentation. While all three do some things well, none covers the entire range of language documentation tasks. For example, ELAN has no support for lexicon building, and Toolbox and FLEx have no support for time alignment or audio playback that is of great help in transcribing a media file. In addition, none provides direct support for the creation of publishable outputs, a gap that is most frequently filled by Microsoft Word, even though it offers no ready means of interoperating with custom linguistic software.

While ELAN has import/export modules for Toolbox, there was, until recently, no way to share data between ELAN and FLEx. In 2009, one of the authors (TJM) led the development of a FLEx-import module that was integrated into a release ELAN. This effort was severely handicapped by the limitations of FLEx’s native data storage format at that time, which has been partly addressed in its present XML format. Since then, we have been working with both ELAN and FLEx teams to establish a means for lossless two-way interchange of files between ELAN and FLEx. This will make it possible for a linguist to, for example, exploit the time-alignment functionality of ELAN and the lexicon and parsing functionalites of FLEx without losing information produced by either tool when exporting/importing data between them.

The main obstacle to achieving this goal is that the two programs have very different internal data models: ELAN has the capability to create structures that are not replicable in FLEx (for instance, to represent overlapping utterances from more than one speaker), while FLEx allows words to be associated with a wealth of grammatical information that cannot be encoded using ELAN. At the same time, both ELAN and FLEx contain some overlapping information in their representations of IGT. Our proposed solution to achieve full interoperation is to provide both ELAN and FLEx with a unified underlying representation that will represent data from both programs. Some parts of that representation will be ignored by ELAN, and other parts will be ignored by FLEx, but new tools can be developed to ensure that a lossless round trip between the two programs will be possible (see Cochran et al. 2007 for an earlier discussion). This approach can also broaden the interoperability between ELAN and Toolbox, and indeed facilitate interoperability between both Toolbox and FLEx as well.

An obvious choice for a unifying representation are Resource Description Framework (RDF) graphs of the sort associated with the Semantic Web (Allemang & Hendler 2010). One of the main purposes of RDF is specifically to merge heterogeneous representations of overlapping data. It also has a standard query language and a rapidly growing arsenal of tools for development.

We would like to emphasize the generality of this solution: given two programs with overlapping but partially incompatible data models, RDF, perhaps augmented by Web Ontology Language (OWL), can be used as an interlingua that can represent both. (See Farrar & Langendoen 2009 for discussion of the use of OWL in descriptive linguistics.) When this representation is accessed by one of the programs (e.g., by using SPARQL, the query language for Semantic Web data – see DuCharme 2011), parts of the representation will be ignored, but the editing done on the overlapping part can be preserved for eventual access by the other program.

Since May 2011, we have maintained a discussion group in which both ELAN and FLEx developers participate, allowing the development of a concrete implementation of this proposed solution. A key area of progress has been devising an appropriate system for assigning globally unique identifiers to be used in both ELAN and FLEx to allow the overlapping data available to both tools to be effectively tracked.

Serving diverse communities through interoperation

An endangered language documentation project typically results in two collections of materials, one addressed to scholars, the other addressed to the local community. The requirements for the scholarly collection are quite precisely defined in terms of allowable data formats, required metadata, and specifications of access restrictions (see, e.g., Johnson 2004; Austin 2006; Nathan 2011). By contrast, there has been little discussion of a general systematic approach to creating materials for local communities, nor even an attempt to create principled specifications for them (though see Nathan 2006; Nakhimovsky & Myers 2003). As a result, materials for local communities are frequently developed as a separate effort, unrelated to the scholarly archive.

The approach presented in the first section of the paper creates an opportunity for a more general solution. We suggest that a unified RDF representation is well-suited for the creation of materials for local communities due to rapidly developing trends in data dissemination technology, including the increasing adoption of Semantic Web technologies as a means of data exchange, the widespread of deployment of Cloud services, and the increasing use of social media in even relatively marginalized communities especially via mobile devices like smartphones (see, e.g., the BOLD project for an example as well as Nathan 2010).

The overall strategy for developing community materials could be as follows. Using SPARQL queries, extract suitable materials from the overall RDF collection (e.g., streamable media, community-selected texts) and assemble them, possibly dynamically, into a community website to be accessed via the internet or a local server. Such a model would allow community members to reassemble the data to suit their needs while also ensuring they have access to the most up-to-date analyses of the linguists. Moreover, in principle, the same basic techniques that permit tool interoperation via a unified RDF representation could also allow communities to enrich the RDF themselves, thereby adding information (e.g., metadata or cultural annotations) to the dataset and allowing them to take a more active role in the construction of the documentary record of their language.


This work has been supported by the National Science Foundation [0553546 A.N.; 1065619 A.N., T.M; 0715246, J.G.]


Allemang, D., and J. Hendler (2010). Semantic Web for the working ontologist: Effective modeling in RDFS and OWL (second edition). Amsterdam: Elsevier.

Austin, P. K. (2006). Data and language documentation. In J. Gippert, N. P. Himmelmann & U. Mosel (eds.), Essentials of language documentation. Berlin: Mouton de Gruyter, pp. 87-112.

Berez, A. (2007). Technology review: EUDICO Linguistic Annotator (ELAN). Language Documentation and Conservation 1: 283-289.

BOLD. (accessed March 4, 2012).

Cochran, M., J. Good, D. Loehr, S. A. Miller, S. Stephens, B. Williams, and I. Udoh (2007). Report from TILR Working Group 1 : Tools interoperability and input/output formats. Toward the Interoperability of Language Resources Workshop, Stanford, California, July 2007.

DuCharme, B. ( 2011). Learning SPARQL. Sebastopol, CA: O’Reilly.

ELAN. (accessed March 4, 2012).

Farrar, S., and D. T. Langendoen (2009). An OWL-DL implementation of GOLD: An ontology for the Semantic Web. In Andreas Witt & Dieter Metzing (eds.), Linguistic modeling of information and markup languages: Contributions to language technology. Berlin: Springer, pp- 45-66.

FLEx. (accessed March 4, 2012).

Haspelmath, M. (1993). A Grammar of Lezgian. Berlin: Mouton.

Himmelmann, N. (1998). Documentary and descriptive linguistics. Linguistics 36: 161-195.

Himmelmann, N. P. (2006). Language documentation: What is it and what is it good for? In J. Gippert, N. P. Himmelmann & U. Mosel (eds.), Essentials of language documentation. Berlin: Mouton de Gruyter, pp. 1-30.

Johnson, H. (2004). Language documentation and archiving, or how to build a better corpus. In P. K. Austin (ed.), Language documentation and description, volume 2, 140-153. London: SOAS.

LAT. (accessed March 4, 2012).

Nakhimovsky, A. D., and T. Myers (2003). Digital video annotations for education. Paper presented at the International Conference on Engineering Education, Valencia, Spain.

Nathan, D. (2006). Thick interfaces: Mobilizing language documentation with multimedia. In J. Gippert, N. P. Himmelmann & U. Mosel (eds.), Essentials of language documentation. Berlin: Mouton de Gruyter, pp. 363-379.

Nathan, D. (2010). Archives 2.0 for endangered languages: From disk space to MySpace. International Journal of Humanities and Arts Computing 4: 111-124.

Nathan, D. (2011). Digital Archiving. In P. K. Austin & J. Sallabank (eds.), The Cambridge handbook of endangered languages. Cambridge: Cambridge UP, pp. 255-274.

Palmer, A., and K. Erk (2007). IGT-XML: An XML format for interlinearized glossed texts. Proceedings of the Linguistic Annotation Workshop. Stroudsburg, Pennsylvania: Association for Computational Linguistics. 176-183.

Rogers, Ch. (2010). Technology review: Fieldworks Language Explorer (FLEx) 3.0. Language Documentation and Conservation 4: 74-84.