TXSTEP – an integrated XML-based scripting language for scholarly text data processing

Home » conference » programme » abstracts » TXSTEP – an integrated XML-based scripting language for scholarly text data…

XML

Ott, Wilhelm, Universität Tübingen, Germany, wilhelm.ott@uni-tuebingen.de

Ott, Tobias, Stuttgart Media University, Germany, ott@hdm-stuttgart.de

Gasperlin, Oliver, pagina GmbH publication technologies, Germany, oliver.gasperlin@pagina-tuebingen.de

Introduction

With TXSTEP, we present and put up to discussion the prototype of a new, powerful XML-based tool for scholarly research in the text-based humanities. Its architecture is based on more than 40 years of experience in supporting humanities projects at the University of Tübingen and beyond.

The purpose of TXSTEP is not to provide another toolbox containing ready-made solutions for pre-defined problems. Of course, tools like these are adequate for many purposes; but we see no urgency to add a further one to the existing packages of this kind.

In fact, TXSTEP has been designed as a high performing scripting environment for the serious humanities scholar and other professionals in text data processing who face problems not easily solvable by XSLT or other means. TXSTEP gives them complete control over every detail of the data processing part of their projects.

Humanities software: basic requirements

Software for serious humanities research has to have certain basic qualities:

it must be easy to handle, so that the scholar who is an expert in his field, but not in programming or computer science, can use it safely;
it must be flexible enough to be adapted to the special requirements of each project, be it a philological analysis of a text or the preparation of a critical editon;
it should support not only single phases of a project, but all its stages and steps, including (for an editorial project) first transcription of the sources and collation of the transcribed texts, evaluation of the variant readings, constitution of the edition text and the critical apparatus, up to (and even beyond) the preparation of the final publication of text, apparatuses, and indexes.

TXSTEP tries to take into account these somewhat contradicting requirements by defining the fundamental operations necessary for the processing of textual data, and by providing a separate program module for each of these basic functions, which can be used without any knowledge of conventional programming or scripting languages.

The solution: 1. Modularity

These modules may be combined almost arbitrarily: each module reads from and writes to a single basic file structure. This allows to combine these modules like Unix filters in arbitrary ways.

Where necessary, the single modules can be adapted to special requirements by the user, who may change default parameters (e.g. for providing a sort key for a non-latin alphabet) or provide additional ones (e.g. for the omission of the definite article in the sort key for titles in bibliographic records).

However limited the scope of the single modules may be, the flexibility of their combination can be illustrated by the fact that, for example, there is no dedicated program for generating an alphabetical word list. For this purpose the user has to combine the module for text decomposition (for which he has to provide the parameters defining the single elements and the sort keys), the SORT module, and the module which reduces identical or partially identical records contained in the sorted file to single index entries, and adds – when required – informations like frequency counts and/or references to the source text.

The modules provided by TXSTEP include:

collation of different versions of a text; output of the differences in a synoptic list (for eye inspection) and for automatic processing in a file showing an appropriate structure;
text correction and enhancement not only by an interactive editor, but also in batch mode, e.g. by means of correction instructions prepared beforehand (by manual transcription, or by program, e.g. the collation module);
decomposing texts into elements (e.g. word forms) according to rules provided by the user;
building logical entities (e.g. bibliographic records) consisting of more than one element or line of text;
sorting such elements or entities according to the sort keys provided by the preparatory modules, accounting also for non-latin alphabetical rules and other sorting criteria;
preparing indexes by generating entries from the sorted elements;
transforming textual data by selecting records or elements, by replacing strings or text parts, by rearranging, complementing or abbreviating text parts;
integrating external information into a file by means of acronyms;
updating crossreferences;
converting textual data from TUSTEP files into file formats used by other systems (e.g. for statistical analysis or for electronic publication) and vice versa.

As the output of any one of these modules may serve as input to any other module, the range of research problems for which this system may be helpful is quite wide.

The solution: 2. XML interface to an established text processing and analysis suite

In fact, TUSTEP, the Tübingen System of Text Processig tools, has been developed in the past 40 years along these lines. It has been and still is successfully used for many humanities projects in the German speaking part of the world, as may be detect by visiting www.tustep.org.

But, since TUSTEP’s syntax is proprietary, not intuitive and supposed to be difficult to learn, users tend to help themselves with other – often less effective – tools or less specific programming languages.

TXSTEP gives an answer to this situation by providing a user-friendly XML-syntax, allowing beginners and advanced programmers to utilize the whole scope of TUSTEP services in a modern, established scripting environment. The benefits are obvious: support of an open standard, widespread dissemination, programming in every XML-editor, syntax highlighting, code completion and intelligible APIs. Moreover, TXSTEP is aided by the fact that there is no need to change the program’s actual core. TUSTEP itself is open source, as TXSTEP is soon going to be as well.

The TXSTEP prototype

Development of TXSTEP began in 2009, when Tobias Ott, research associate and lecturer at the ‘Stuttgart Media University’ and CEO of pagina GmbH (a service provider for publishing houses) first came up with the idea to build an XML interface to the syntax of TUSTEP commands. This would all at once remove most of the barriers usually preventing people from using TUSTEP:

it would offer an up-to-date established syntax,
it would allow to draft TUSTEP scripts using the same XML-editor as when writing XSLT or other XML based scripts,
it would let you enjoy the typical benefits of working with an XML editor, like content completion, highlighting, showing annotations, and, of course, verifying your code,
it would offer – to a certain degree – a self teaching environment by commenting on the scope of every step,
it would avoid many syntactical errors, even compared to the original TUSTEP scripting environment.

In the meantime, this idea resulted in a prototype of TXSTEP which we plan to demonstrate in more detail during the poster session. The prototype already contains the most important features of all the modules of TUSTEP listed above.

Not contained in TXSTEP is TUSTEPs typesetting module, which has been designed to meet the ambitious layout demands of publications in humanities research, including those needed by critical editions. The user may however use it in the original TUSTEP environment for publishing in print the results gained by TXSTEP, or he may even include it – in original TUSTEP syntax – into his TXSTEP scripts.

One of the features of TXSTEP is it’s capability to process almost all forms of textual data, whether this being XML-data or plain text files. Therefore, even if textual data have to be processed in the first place in order to gain, for example, TEI-data or to enhance the markup of insufficiently tagged XML data, TXSTEP is at it’s place.

The proposed demo is based on the mentioned prototype and shows the achieved state of our work in progress. The demonstration of TXSTEPs functionality will include tasks which can not easily be performed by existing XML tools.