WICACHO: A project for semantic tagging

Version: April 21, 2026. We updated the documentation and now users can download the models generated by the system and in this way bypass the steps needed to generate them. Instruction to do so are left here anyway for documentation and replication purposes.

Announcement: we are about to present a new version that combines the two semantic taggers that have so far been published by Tecling.com: Tatatag (2025), the tagger that uses Wiktionary, and Wicacho (2026), the very one you are looking at now, that uses Wikipedia. The new version is called Wikacho and will be published very soon. As soon as this happens, the two predecesors will then be classified as deprecated and no longer be maintained.
Server temperature now: 35ºC (demo disabled if temp. > 55ºC)



Conservative?

Examples: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
©TECLING 2026.
This website offers documentation, code, data and webdemo of the semantic tagger described in the following paper:
Nazar, R.; Renau, I. (2026). Wikipedia used as a semantic tagger: some preliminary results in Spanish. Procesamiento del Lenguaje Natural, n. 76, p. 279-292.

We will do our best to keep this documentation up to date. If you need help, send some email to rogelio dot nazar at gmail dot com

Documentation and source code

Before you use this, bear in mind that:

Instructions for runnig the semantic tagger


Step 1: Download the code

If you just want to use the semantic tagger in your local PC, download and unzip the code, data and model files:

The content of the data and model files must be placed in a 'data' folder in the same place as the source code (or otherwise change this code to read the data somewhere else).


Step 2: Install a POS-tagger on your machine


The script will call a POS-tagger that is supposed to be already installed on the system where it is running. Here we use UDPipe. You may have to adjust the code to make sure the paths to the installation are correct. Search for the $udpipe and $langmodel variables, defined like this:
my $udpipe = "/opt/udpipe/src/udpipe";
my $langmodel = "spanish-ancora-ud-2.5-191206.udpipe";

You can use other taggers that comply with the Universal Dependencies standard, or otherwise modify this code to interpret another format.

For efficiency, the program avoids applying the POS-tagger twice to the same text. A file 'input.txt.tag' is created the first time you apply wicacho.pl to a text 'input.txt', contaning the result of the POS-tagger, so the next time you analyze the same text (for instance, after making changes in the script), before it calls the POS-tagger, it will see if there is some file with the same name but with the '.tag' extension added, and if so, it will assume it is the same document.

Step 3: run the semantic tagger

Once you have a POS-tagger up and running, you can run the tagger with the following command:

perl wicacho.pl input.txt (Spanish) > output.htm

The first argument, input.txt, must be a UTF8-encoded Spanish text file. The 'Spanish' argument is not needed, as for the moment it is the only language it can operate with. The output of the script is an html file like those you see on the demo.
For the time being, the system process one document at a time (be patient: the new version, coming soon, allows for batch processing of folders).

Happy tagging!





Instructions for the generation of models


Step 1

In order to replicate the generation of the models, downaload the above mentioned source code and data files, plus a copy of the Wikipedia index file (eswiki-20241020-abstract.xml, compressed as eswikiabstractszip.zip to 250Mb), an XML document that contains only the title and the first paragraph of each page. It may look something like this:


Step 2


Run the conversion script:
perl xml2csv.pl eswiki-20241020-abstract.xml > wikitable.csv

This script strips all the XML code out of the Wikipedia file and presents a table with one entry per line.

Step 3


Place the script for information extraction in the same folder as your unzipped data.zip and run:
perl wicacho.pl -train wikitable.csv > raw.csv 

This script assigns every Wikipedia entry to one of the categories of the CPA Ontology. It is, thus, an information extraction and word sense disambiguation process.
It may take well up to 30 minutes for the training to complete, depending on your hardware.
Here is a random sample of 1000 results. The sample shows 97% precision and 64% recall. Recall is low at this point but this is not a problem, because the leftovers will be reanalyzed in a second pass, after the statistical models are generated.

Step 4

Generate the first statistical model:
perl model.pl raw.csv > model-31march2026.csv 

This script takes only a few minutes to execute.

Step 5

Generate the second statistical model:
perl morfeo.pl raw.csv > strongpairs.csv 

This one takes only seconds to exectute.
The script Morfeo is part of another of our projects, which we hope will be documented soon with its own website. In the meantime, we use this version of the script as a complement of Wicacho's statistical model.
This version of Morfeo actually produces two results. One is the output that gets written ( strongpairs.csv ), which is a list of strong hyponym-hypernym pairs extracted from raw.csv, and the other is a model of suffixes, silently produced by the script. You will find this model in the same folder and the name follows this pattern: morfodel-1773603880.csv, where the number is the result of the time() function (number of seconds since 'the Epoch').
Notice that this script uses the iconv command. If you want to use this on Windows, you would have to find a similar alternative. As already stated, testing this code on Windows machines is pending work, but that too will be done, eventually.

Step 6

In this step we are going to apply the training procedure again, this time using the statistical models just generated. This is to increase the proportion of classified Wikipedia entries by retrying to classify those that were left with the UNKNOWN tag in step 3.
It is necessary, at this point, to manually edit the code of the wicacho.pl script. Find the lines where the %files hash table is declared and update the filenames:
my %files = ( 
'pila' => 'masterPil.txt',
'apellido' => 'masterSurname.txt',
'lugar' => 'masterPlace.txt',
'stoplist' => 'stoplist.txt',
'semtypes' => 'semantypes_es.txt',
'triggers' => 'triggers_es29mar2026.txt',
'model' => 'model-31march2026.csv',  
'morfeo' => 'morfodel-1773666228.csv',
'strongpairs' => 'strongpairs.csv',
);

Specifically, you have to update the filenames of the keys 'model', 'morfeo' and 'strongpairs', i.e., the names of the statistical models just created.
Notice that there is a difference between files with the .txt extension and those with .csv: the first ones are those that are ``hard-coded'' into the system, while the latter are generated during training.
perl wicacho.pl -train raw.csv > database.csv

The execution is similar to what was done in step 3, the difference being that now the input file is raw.csv (the result of step 3) and the output is a new file database.csv, which is the new result.
This new result is similar to raw.csv, but with a new column after the category initially asigned. Many times the new category is the same as the originally asigned, but it also happens very frequently that the category was initially UNKNOWN and the statistical model in those cases is able to risk a category and improve recall. This is the file that will be used in the final step.

Step 7

Install a POS-tagger in your machine.
As POS-tagger, for now this system uses UDPipe, but there are other alternatives (see Project Porcus for some of them).
If you want to use a POS-tagger other than UDPipe, however, you would have to adapt the code of wicacho.pl to ensure compatibility.

Step 8, and final


You have created all the necessary materials so you are ready to apply the semantic tagger to any text in Spanish.
The only task left is to open again the wicacho.pl script and add the line defining the database (the file that resulted from step 6):
my %files = ( 
'pila' => 'masterPil.txt',
'apellido' => 'masterSurname.txt',
'lugar' => 'masterPlace.txt',
'stoplist' => 'stoplist.txt',
'semtypes' => 'semantypes_es.txt',
'triggers' => 'triggers_es29mar2026.txt',
'model' => 'model-31march2026.csv',  
'morfeo' => 'morfodel-1773666228.csv',
'strongpairs' => 'strongpairs.csv',
'database' => 'database.csv',
);

This only needs to be done once, the first time you use it. Afterwards, you can tag all the text you want using the same configuration.
For instructions on how to run the semantic tagger go to the beginning of this document.