WICACHO - A semantic tagger

WICACHO: A project for semantic tagging Version: April 24, 2026 . This new version works both in Linux and Windows.
Announcement: we are about to present a new version that combines the two semantic taggers that have so far been published by Tecling.com: Tatatag (2025), the tagger that uses Wiktionary, and Wicacho (2026), the very one you are looking at now, that uses Wikipedia. The new version is called Wikacho and as soon as it gets published, the two predecessors will then be casted as deprecated and no longer maintained.

Server temperature now: 35ºC (demo disabled if temp. > 55ºC)

Examples: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

©TECLING 2026.
This website offers documentation, code, data and web-demo of the semantic tagger described in the following paper:
Nazar, R.; Renau, I. (2026). Wikipedia used as a semantic tagger: some preliminary results in Spanish. Procesamiento del Lenguaje Natural, n. 76, p. 279-292.

Documentation and source code

Before you use this, bear in mind that this is all very experimental and things may fail from time to time. Also, it is advisable to take a look at the paper before anything else. We will do our best to keep this documentation clear and up to date. If you find some error or you need help, send an email to rogelio dot nazar at gmail dot com

Step 0: Dependencies

The Perl interpreter: If you are a Windows user, you will need to install Perl. We tested the code in Windows and it works well, but it is necessary to have the Perl interpreter on the machine it is running on. In the case of Windows users, sadly it needs to be installed separately, given the fact that it does not come by default in the operating system, as in the case of Linux and other Unix-like environments. Among other options, Windows users can download and install the Perl interpreter for free, from here: https://strawberryperl.com/

The POS-tagger: Wicacho uses UDPipe, which also needs to be installed independently from here: https://ufal.mff.cuni.cz/udpipe. Wicacho does not use the POS-tagger for the generation of the models, though, because it would be too costly. If you want to use a different POS-tagger, it must either comply with the Universal Dependencies standard, or you would have to adapt Wicacho's code of to ensure compatibility.

Instructions for running the semantic tagger

Step 1: Download the code

If you just want to use the semantic tagger in your local PC, download and unzip the code, data and model files:

wicacho.zip : the source code
data.zip (1.4Mb): essential data files
models.zip : the models generated by the system

The content of the data and model files must be placed in a 'data' folder in the same place as the source code, or, if you prefer, change the code to read the data from somewhere else.

Step 2: Adjust paths to the location of the POS-tagger on your machine

You may have to adjust the code to make sure the paths to the installation of the POS-tagger are correct. Search for the $postagger and $langmodel variables, defined like this:

my $postagger = "/opt/udpipe/src/udpipe";
my $langmodel = "/opt/udpipe/languages_models/spanish-ancora-ud-2.5-191206.udpipe";

For efficiency, the program avoids applying the POS-tagger twice to the same text. A file 'input.txt.tag' is created the first time you apply wicacho.pl to a text 'input.txt', containing the result of the POS-tagger, so the next time you analyze the same text (for instance, after making changes in the script), before it calls the POS-tagger, it will see if there is some file with the same name but with the '.tag' extension added, and if so, it will assume it is the same document.

Step 3: Run the semantic tagger

Once you have a POS-tagger up and running, you can run the tagger with the following command:

perl wicacho.pl input.txt (Spanish) > output.htm

The first argument, input.txt, must be a UTF8-encoded Spanish text file. The 'Spanish' argument is not needed, as for the moment it is the only language it can operate with. The output of the script is an html file like those you see on the demo.
For the time being, the system process one document at a time (be patient: the new version, coming soon, allows for batch processing of folders).

Happy tagging!

Instructions for the generation of models

Step 1

In order to replicate the generation of the models, download the above mentioned source code and data files, plus a copy of the Wikipedia index file (eswiki-20241020-abstract.xml, compressed as eswikiabstractszip.zip to 250Mb), an XML document that contains only the title and the first paragraph of each page. It may look something like this:

Step 2

Run the conversion script:

perl xml2csv.pl eswiki-20241020-abstract.xml > wikitable.csv

This script strips all the XML code out of the Wikipedia file and presents a table with one entry per line.

Step 3

In order to generate the models you need first to delete any eventual past model or comment the lines in the code where these models are defined (where the %files hash is declared). This is because the training of the program has two phases, and the absence of models makes the program think it is in the first phase.
The only data neeed for this first phase of the training is the 'hard-coded' data, i.e., the data files with the '.txt' extension, contained in the data.zip file. Place the unzipped data in the same folder as the Perl scripts and run:

perl wicacho.pl -train wikitable.csv > raw.csv

This script assigns every Wikipedia entry to one of the categories of the CPA Ontology. It is, thus, an information extraction and word sense disambiguation process.
It may take well up to 30 minutes for the training to complete, depending on your hardware.
Here is a random sample of 1000 results. The sample shows 97% precision and 64% recall. Recall is low at this point but this is not a problem, because the leftovers will be reanalyzed in the second pass of the training, after the statistical models are generated.

Step 4

Generate the first statistical model:

perl model.pl raw.csv > model-31march2026.csv

This script takes only a few minutes to execute.

Step 5

Generate the second statistical model:

perl morfeo.pl raw.csv > strongpairs.csv

This one takes only seconds to execute.
The script Morfeo is part of another of our projects, which we hope will be documented soon with its own website. In the meantime, we use this version of the script as a complement of Wicacho's statistical model.
This version of Morfeo actually produces two results. One is the output that gets written ( strongpairs.csv ), which is a list of strong hyponym-hypernym pairs extracted from raw.csv, and the other is a model of suffixes, silently produced by the script. You will find this model in the same folder and the name follows this pattern: morfodel-1773603880.csv, where the number is the result of the time() function (number of seconds since 'the Epoch').

Step 6

In this step we are going to apply the second phase of the training procedure, this time using the statistical models just generated. This is to increase the proportion of classified Wikipedia entries by retrying to classify those that were left with the UNKNOWN tag in step 3.
It is necessary, at this point, to manually edit the code of the wicacho.pl script. Find the lines where the %files hash table is declared and update the file names with those you just generated.

my %files = ( 
'pila' => 'masterPil.txt',
'apellido' => 'masterSurname.txt',
'lugar' => 'masterPlace.txt',
'stoplist' => 'stoplist.txt',
'semtypes' => 'semantypes_es.txt',
'triggers' => 'triggers_es29mar2026.txt',
'model' => 'model-31march2026.csv',  
'morfeo' => 'morfodel-1773666228.csv',
'strongpairs' => 'strongpairs.csv',
);

Specifically, you have to update the filenames of the keys 'model', 'morfeo' and 'strongpairs', i.e., the names of the statistical models just created.
Notice that there is a difference between files with the .txt extension and those with .csv: the first ones are those that are ``hard-coded'' into the system, while the latter are generated during training.

perl wicacho.pl -train raw.csv > database.csv

The execution is similar to what was done in step 3, the difference being that now the input file is raw.csv (the result of step 3) and the output is a new file database.csv, which is the new result.
This new result is similar to raw.csv, but with a new column after the category initially assigned. Many times the new category is the same as the originally assigned, but it also happens very frequently that the category was initially UNKNOWN and the statistical model in those cases is able to risk a category and improve recall. This is the file that will be used in the final step.

Step 7, and final

You have created all the necessary materials so you are ready to apply the semantic tagger to any text in Spanish.
The only task left is to open again the wicacho.pl script and add the line defining the database (the file that resulted from step 6):

my %files = ( 
'pila' => 'masterPil.txt',
'apellido' => 'masterSurname.txt',
'lugar' => 'masterPlace.txt',
'stoplist' => 'stoplist.txt',
'semtypes' => 'semantypes_es.txt',
'triggers' => 'triggers_es29mar2026.txt',
'model' => 'model-31march2026.csv',  
'morfeo' => 'morfodel-1773666228.csv',
'strongpairs' => 'strongpairs.csv',
'database' => 'database.csv',
);

This only needs to be done once, the first time you use it. Afterwards, you can tag all the text you want using the same configuration.
For instructions on how to run the semantic tagger go to the beginning of this document.

WICACHO: A project for semantic tagging