Tecling: Technologies for Linguistic Analysis

Warning: This program and the documentation have been updated on 1 May, 2018.

All sorts of errors may arise. Please report if you find anything strange.

KWiCo:

This project is part (or a "spin-off") of Jaguar, which is currently under development. KWiCo is a corpus indexing algorithm. It takes a corpus as input and produces a table with an index of the corpus, thus significantly reducing the time needed to retrieve concordances, especially when the corpus is very large. In the case of Wikipedia, for instance, given its size one cannot use tools that scan the whole space of the corpus to find a string of text for efficiency reasons. Popular tools for finding words in a text, such as Unix’s GREP command (Hume, 1988), for instance, would not be of help with this scale of data. An index of the corpus, instead, maps the positions of all lexical units, and makes the retrieval of contexts a very fast process. Many procedures have been developed for corpus indexing in the tradition of information retrieval (Baeza-Yates & Ribeiro-Neto, 1999; Manning et al., 2008). There are also some open-source programs for indexing, such as Apache-Lucene’s SolR ( http://lucene.apache.org/solr/ ). This is just another approach to the problem. Why is KWiCo different? Well, it may not look simple in this page, but it is. And it is also very fast.

KWiCo's algorithm is implemented here as a Perl script. At the moment it is actually a lot of different scripts that have to be run manually one after the other. Of course this is temporary. We are in the process of putting all together in a single system, which is Jaguar ( http://www.tecling.com/jaguar ). This first draft of documentation only describes the use of Kwico by the command line, as it focuses only on the essentials.

This documentation provides a general description of the algorithm and its implementation. You can also jump directly to the download section or the part that explains how to use it. Before you do anything, keep in mind that this is not meant to be used in Windows!

Examples of concordances extracted with Kwico, in html format.

Instructions for download and use

The Download section for KWiCo:

All the thing conveniently wrapped in a single zip file: KWiCode.zip
This is the content of the file:

Script	Purpose
wiklean.pl	Converts WikiDump files into plain text
convertFiles.pl	Converts directories from Latin1 (ISO-8859-1) to UTF-8
errorchecker.pl	It is used to detect errors in POS-taggers
divide.pl	Divides a large corpus in segments
indexCorp.pl	Starts the indexing process
compile.pl	Finishes the indexing process
kwico.pl	With the corpus indexed, retrieves the contexts
index.pl	An perl script that executes all scripts automatically

Understanding the whole process

The following are the main steps of the indexing method:

the corpus is divided in parts
a part is just a directory in the computer’s file system, and has a correlative number code as name
each directory contains n files (n~100), also named with correlative numbers
each file contains m tokens (m~10000, in a tagged file with each token is displayed in one line )

The limits between texts in the corpus are ignored in the index, because the whole corpus is treated as single block of text. This is just an arbitrary decision with no effects for our purposes (users interested in text boundaries may consider changing this).

Apart from this partitioning of the corpus, there is a number of tables that register the positions of all the instances of vocabulary units in the corpus. This includes two types of tables:

form-lemma tables: associate the word forms as they appear in corpus with their lemma forms
lemma-file tables: relate each lemma with its positions in the corpus

These tables are distributed in a folder tree. The index have two folders, one for lemmas (lem folder), and another for forms (for folder). Each folder have a recursive structure with the first three letters of each word indexed from corpus. Each letter represent one folder. So, we can find the words forms starting with the same three letters in the same path. For instance, we extracted a word form table from the folder tree /path/to/corpus/index/for/a/v/i :

avioneta
        NC avioneta     544
avionetas
        NC avioneta     219

This is the information about the wordforms avioneta (‘light aircraft’) and its plural form avionetas. It identifies both word forms with the common noun (‘NC’) lemma, and records the frequency of occurrence of each word form in the corpus.

In the case of the ‘lemma-file’ tables, they are also classified by the first three letters of the words, and their structure is shown in the following table, with a fragment of the index table for the lexical unit avioneta:

Directory	File	Frequency	Token position
1	634	1	578
...	...	...	...
2	1113	2	513, 590
2	1743	1	180
...	...	...	...

According to our index structure, this table indicates the directory, the file, the frequency and the token positions of the lexical unit in each file, as comma separated numbers when there is more than one in the same file (purists may prefer a different table for such column. This would be possible with a simple change: the cells of this column would contain an integer that relates us to a different table that contains only one token position per record and another field for the integer that brought us to this second table; conventional or not, both solutions are equaly arbitrary and for the time being, we leave as is).

So, let's get our hands dirty. Let's apply this KWiCo indexer to Wikipedia in English, as a test. First we will download a more or less recent version, like the following 12Gb file (!!!):
https://dumps.wikimedia.org/enwiki/20160801/enwiki-20160801-pages-articles.xml.bz2

Once uncompressed, this will produce a rather large file. Check that you have enough space on your disk before you continue.

First step: cleaning the corpus

Sometimes, corpora come already tagged and ready to be used. But most of the times, some preprocessing needed. For instance, we may have to clean the corpus of different sorts of tagging, as it is the case with Wikipedia. The cleaning procedure will reduce it to less than a half its size, but the tagger will need three times that amount of space.

There is a script that will help us get rid of all the metadata: wiklean.pl(see the documentation in the same script).

As this is a rather large corpus, it would be better if we do a small test first with a fragment only:

head -10000 enwiki-20160801-pages-meta-current.xml > wikipediafile.xml

Let's first run the script with the sample and check if everything looks fine. Only then should we proceed with the full corpus:

perl wiklean.pl wikipediafile.xml wikiclean.txt

(Notice we don't use the > symbol before wikiclean.txt).

In the case of our test with the English Wikipedia, the uncompressed plain text file resulting from the cleaning process has a size of 58Gb. This is a lot of text. Take a look inside the corpus to see if everything seems all right (change the 10000 value to any number of lines):

head -10000 fullclean.txt | tail

In the event of encoding problems (ie, always) simply apply:

iconv -c -f utf8 -t Latin1 file.txt > fileiso.txt

and then back to utf8:

iconv -c -f Latin1 -t utf8 fileiso.txt > file.txt

Of course it is a suboptimal solution because it will make you loose the corrupted characters.

If everything looks fine after that, then at this point you should either delete or compress the file(s) with the original xml data in order to save disk in space. This includes any file converted in Latin1 left behind.

Second step: POS-tagging the corpus

Obviously, this step is not necessary if the corpus is already POS-tagged, which would the case for instance with the TenTen series. Else, we will proceed with tokenization, lemmatization and Part-of-Speech (POS) tagging. TreeTagger (Schmid, 1994) was used for these tasks because it is a widely known, freely available and language independent tool. Applied to the Spanish Wikipedia, we assume it will produce something in the lines of:

Word form	POS-tag	Lemma
La	ART	el
historia	NC	historia
del	PDEL	del
buzo	NC	buzo
que	CQUE	que
es	VSfin	ser
...	...	...

Other tools can be used for the same purpose, such as Freeling (Atserias et al., 2006), as long as the output is formated in the same way: a tab-separated three column table with the form, the POS-tag and the lemma.

Once the corpus has been cleaned from extraneous codes, we are ready to apply the tagger. Assuming we are using TreeTagger, this can be achieved with the following command (where "~" stands for the place you installed the program):

~/treetagger/cmd/tree-tagger-english wikiclean.txt > wiki.tag

The above command is for English, but if you are working with other language, say, French, then you should use:

~/treetagger/cmd/tree-tagger-french wikiclean.txt > wiki.tag

Third step: indexing the corpus

Eventually, the tagging process will finish (it will take hours for TreeTagger to tag such a corpus and it will take a lot of resources: check the temperature of your machine regularly issuing the "sensors" comand). Once it finishes, we will be ready for the next step, which is to divide the file in smaller files. This is done with the script divide.pl:

perl divide.pl inputfile

The argument inputfile is the name of your file, wiki.tag in the case of our previous example. There is also a second optional argument, the number of lines of each file, to offer some control of the number of files created on disc. The default value is 10000 lines. It will necer split a sentence. The result is in a "inputfile_partes" folder.

And now the next part... the indexing process propper. This is done with the indexCorp.pl script, like this, where inputfile is the name of the same file that was processed by divide.pl:

perl indexCorp.pl inputfile

The result will be a new folder called something like "inputfile_tables". Now is the time for even another process (we are almost there!) done by the compile.pl script.

Important: If you run out of memory, you can use a second parameter to compile forms first, and then, in a second round, compile only the lemmas. Default behaviour is to attempt the processing of both lemma and forms at the same time.

perl compile.pl inputfile (form/lemma)

Again, the argument is the same file you used with divide.pl. Once this process is finished, a new folder "inputfile_index" will be created. This is the new index of your corpus and it will be used by the kwico.pl script later to extract the concordances. Now it is important that you remember to delete the "inputfile_tables" folder created by indexCorp.pl once you finish running compile.pl.

By now you are probably thinking: "Why so many scripts?". Well, there is an "allautomat.pl" script which runs all processes automatically taking the path to the corpus as a single argument, but we thought it would be better to explain the process step by step in case problems arise on the way.

And that's it! You have finally indexed your corpus. Now you are ready to exploit it, using our fine KWiCo tool:

perl kwico.pl --corpus [path] --query gouda (--form/lemma) (--verbose) (--max 100) (--random) (--context sentence) (--html) (--mwe)

This means that we are looking for the word gouda (a type of cheese) in an indexed corpus (one you already indexed). The arguments in parentheses are optional. We indicate that the query is in lemma form, which is the default behaviour; with (--v), that we want the verbose results (not the default); that we want a maximum of 100 results; that the context is a sentence (or a symmetric window of n tokens) and, finally, that we want the lines in html format (also not the default behavior). The query can also be a multi-word expression. If this is the case, param --mwe has to be added.

Oh, it is fast!


*» The universe is not perfect, but it's working on it.* ABOUT RESEARCH SOLUTIONS SOFTWARE CONTACT		Technologies for Linguistic Analysis