Print Friendly

Lay, Marie Hélène, University of Poitiers, France, marie-helene.lay@univ-poitiers.fr

Introduction

The efficiency of search engines is based on the principle that the information sought can be retrieved by ‘looking for words’ conveying the information and that these words can be identified thanks to the string of characters they are comprised of.  This view takes for granted that the words are always spelt in the same way and that they comply with orthographic rules.

Such is not the situation which prevails for the texts produced during the French Renaissance period. Therefore the availability of older texts for purposes of archiving and disseminating the cultural heritage tradition raises a particular problem. In texts edited in French before the 18th century, spellings are not consistent, as proper spelling has not been ‘invented’ yet. One and the same word may therefore be spelt in a variety of forms. This is not only a time-related variation, as would be expected from the evolution of the language between the 15th and the 17th century. In one and the same book many different spellings may be identified for the one and the same word: for the word côté, either coté, cotté, cote, costé, or couste could be used, the verb savoir may be spelt either scavoir or sçavoir, ‘je sais may be spelt ‘ie sçay, and its past participle ‘su may appear as ‘sceu.

It is therefore necessary to adapt search engines based on word form identification if they are to render the service expected. Several strategies can be envisaged and the purpose of this paper is to focus on those which resort to linguistic expertise, either included in the documents themselves (by annotation) or into the search engine (by query extension). The solutions considered are produced in the context of the Virtual Humanistic Library Project and its evolution (www.bvh.univ-tours.fr ). This part of the project called VARIALOG, is financed by a Google Digital Humanities Research Award.

Methodological alternatives

The BVH/VHL context, considered here, is that of a highly expert environment of a relatively moderate size aiming at a complete editorial treatment and the dissemination of annotated and validated resources. Within this context, two solutions have been designed:

  1. Texts annotation with linguistic information gained from lemmatization. The forms retrieved, whatever their spelling, are lemmatized under a canonic form which then becomes the pivot of further requests: for example the lemma for nuit groups together forms like nuits (which is ‘regular french’) or nuyctz (old written form). A first solution, HUMANISTICA (Lay 2000) was based on the adaptation of a probabilistic tagger/lemmatizer. The results achieved were satisfactory but the adaptation of the analyzer had to be started all over again to take into account specific features of this high heterogeneous corpus. Another solution, ANALOG (Lay 2010a, 2010b) was therefore developed. It provides an annotation computer assisted environment, and is currently being used in the BVH project. But the enrichment of text through linguistic annotation is a slow and costly process. Though this solution is very useful to go on producing a reference environment, it is nonetheless desirable to provide efficient query tools on texts already available but not yet annotated.
  2. Query extension, without requiring the lemmatization process. The aim is not to produce exactly the right forms (like in EEBO -VosPos-, Impact, ToTrTaLe, LGeRM, or for old czech, or old German projects). We will do so, in order to help in an editorial process (DISSIMILOG), but here, we just want to spot all the written forms which could correspond to a query, being insensitive to variation.

VariaLog : Principles

To solve the problem of spelling variation, one has to go back to observational evidence. Two directions may be taken in this respect: either observe the texts or observe the variants attested for a given form.

  1. Concerning text observation, the aim is to evaluate the number of forms for a given text which do not correspond to the norm. Moreover, one must take into consideration the extent to which the texts can be compared. We intend to illustrate this with two short extracts from Montaigne and Rabelais, two authors of paramount significance.
  2. Concerning the observation of variants attested for one word, the idea is to formulate the rules which govern the production of abnormal forms. We will then build rules to extend queries, turning the search of a word into the search of all the forms assumed by this word, and match the results with forms in texts.

    Comparing the searched forms and their spelling in text, a typology of the situations occurring may be offered. The form being searched is the same one (raisons/raisons) or the link can be very weak (impératrice/empériere). Between these two types, a whole gradation of situations can be organised on a linguistic basis: relations between sounds and different ways of spelling in modern french (c=ss; n=nn; r=rr; s=z; t=th; ai=ei,ai,ey,ay,oi,oy; [uv]=u,v; u=eu), flexionnal history (serais/seray/serois) and morphological history (hôpital/hospitalier; forêt/forestier; advis/avis). Due to the structural instability of this linguistic data, equivalences between character strings are difficult to track statistically and no model-based approach can be developed. But linguistic knowledge helps recognize regular replacement patterns, which can be turned into rules.

  3. The next point which needs to be taken into account is the relevance of the rules: they have to help find all the forms concerned (low silence, good recall), and to avoid generating too much noise (good precision).

To test the first results, a small corpus of  7 words (vices/une/face/fesse/lu/vu/souverain) has been transformed by the substitution rules mentionned above. The results do contain all the relevant forms, but the 7 words have been extended to 118445 forms. There is obviously some correlation between the length of the word and the number of generated words due to the combinatory process.

The solution chosen to fix that problem is to describe, for each rule, the context in which the substitution is allowed. This aims at constraining their application strongly, and limits their productivity. This contextualisation is based on a good knowledge of the linguistic process involved. In the example given below, 8 simple rules are transformed into 9 more complex rules. Most of the time, one simple rule will be derived into 5 to 15 contextualised rules.

(?<=[bdflmnprstv])u=eu ain=ein,ain,eyn,ayn
(?<=[aeiouy])c(?=[eiy])=ss ^s(?=[eiy]) = c
(?<=[aeiouy])ss(?=[eiy])=c (?!^.+)v = u
(?!^.+)n(?<!.+$)=nn ^u = v
(?!^.+)r(?<!.+$)=rr s$ = z

The results achieved are satisfactory: the rules produce all the linguistically permissible variants, and the number of variants is much lower. The 7 words generate 37 forms.

VariaLog : Tool description

The tool itself is thought to be really user-friendly especially for the tuning of rules and the evaluation of their consequences (efficiency and non regression tests). It is a free available java program which first transforms a list of words into an extended list of forms, using that for a rules set. Having done this, the need is to localise the different forms attested in the old spelling in a text, according to the requested form. The output file of this last part of the process is an html file with a graphical highlighting (or bold character) of the identified variant. Moreover, each form is connected to a bubble showing the rules used to derive the variant. A table containing the summary of the used rules for the text is also available: the human validation process is quite friendly. This tool is being put forward to be integrated to an XTF platform.

Conclusion

Using a rule-based approach, VariaLog is designed to identify all the written forms that are likely to correspond to a query, since it is insensitive to variations in spelling. The recall rate (tested on 5000 forms) provides evidence that all the linguistically permissible variants in French are produced by the rules, so long as the problem is simply one of spelling (nuit/nuyct)and not a morphological one (e.g. impératrice/empérière).  As far as precision is concerned, the rules may sometimes generate more ambiguity than anticipated. If  ‘o’ becomes ‘ou’, then, école becomes écoule, which is not an acceptable variation, but volant becomes voulant, which is an acceptable variation; as a result, volant will correspond to vouloir (‘want’), thus increasing the ambiguity of this form which means  already ‘flying, robbing, wheel, flounce, shuttlecock’. The generated ambiguity is no different from standard ambiguities, even in an orthographic environment.

 Already used to search several French dialects, VariaLog can be used to process any form of spelling variation, in any language. One just needs to adjust one’s own specific spelling rules or dictionary. Our aim is to help locate spelling variation efficiently. User feedback is most welcome.

References

Baron, A., and Rayson, P. (2009). Automatic standardization of texts containing spelling variation, how much training data do you need?In M. Mahlberg, V. González-Díaz, and C. Smith (eds.), Proceedings of the Corpus Linguistics Conference, CL2009. University of Liverpool, UK, pp. 20-23.

Burnard, L. (1995). Text Encoding for Information Interchange – An Introduction to the Text Encoding Initiative.Proceedings of the Second Language Engineering Conference, 1995.

Craig, H., and R. Whipp (2010). Old spellings, new methods: automated procedures for indeterminate linguistic data.Literary and Linguistic Computing 25(1): 37-52.

Demonet, M.-L., and M. H. Lay (2011). Digitizing European Renaissance prints: a 3-year experiment on image-and-text retrieval. International Workshop on Digital Preservation of Heritage (IWDPH07). Kolkata, 2007.

Erjavec, T. (2011). Automatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene.Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, 2011, Portland, pp. 33-38.

Hana, J., A. Feldman, and K. Aharodnik (2011). A Low-budget Tagger for Old Czech.Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, 2011, Portland, pp. 10-18.

Lay-Antoni, M.-H. et  al. (2000),  Adaptation d’un lemmatiseur au corpus rabelaisien: naissance d’Humanistica.jadt 2000, Lausanne.

Lay, M.-H., et al. (2010). Pour une exploration humaniste des textes: AnaLog.jadt 2010, Rome.

Sánchez Marco, C., G. Boleda, and L. Padró  (2011). Extending the tool, or how to annotate historical language varieties.ACL-HLT Workshop, 2011, Portland, pp. 1-9.

Scheible, S., R. J. Whitt,M.  Durrell, and B. Bennett (2011). Evaluating an ‘off-the-shelf’ POS-tagger on Early Modern German text. ACL-HLT Workshop, 2011, Portland, pp. 10-18.

Souvay, G. and J.M. Pierrel (2009). LgeRM: Lemmatisation des mots en moyen français. TAL 50(2): 149-172.

Thaisen, J. (2011). Probabilistic Analysis of Middle English Orthography: the Auchinleck Manuscript.Digital Humanities Conference Abstracts, 2011, Stanford.

www.bvh.univ-tours.fr/

http://www.c-tei.org

http://www.bvh.univ-tours.fr/XML-TEI/index.asp

http://www.bvh.univ-tours.fr/Epistemon/philologic.asp

http://xtf.cdlib.org/documentation/programming

http://www.monkproject.org/

http://eebo.chadwyck.com/help/whatis_wh.htm

http://impactocr.wordpress.com/

http://panini.northwestern.edu/mmueller/vospos.pdf