WICACHO: A project for semantic tagging
Version: April 21, 2026. We updated the documentation and now users can
download the models generated by the system and in this way bypass the steps needed
to generate them.
Instruction to do so are left here anyway for documentation and replication purposes.
|
Announcement: we are about to present a new version that combines
the two semantic taggers that have so far been published by Tecling.com: Tatatag (2025), the tagger that uses Wiktionary, and Wicacho (2026), the very one you are looking at now, that uses Wikipedia. The new version is called Wikacho and will be published very soon. As soon as this happens, the two predecesors will then be classified as deprecated and no longer be maintained.
|
Server temperature now: 35ºC (demo disabled if temp. > 55ºC)
Examples: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
©TECLING 2026.
This website offers documentation, code, data and webdemo of the semantic tagger described in the following paper:
Nazar, R.; Renau, I. (2026). Wikipedia used as a semantic tagger: some preliminary results in Spanish. Procesamiento del Lenguaje Natural, n. 76, p. 279-292.
We will do our best to keep this documentation up to date. If you need help, send some email to rogelio dot nazar at gmail dot com
Documentation and source code
Before you use this, bear in mind that:
- This is work in progress and things are changing all the time.
- It is best to take a look at the paper before anything else.
- This code has been developed and tested in Linux. Testing in other operating systems is pending work.
- Wicacho uses a POS-tagger (UDPipe), so you will have to install it. It does not use the POS-tagger for the generation of the models, though, because it would be too costly.
- You will also need the iconv command, available in Linux platforms (this is temporary, though,
as it will not be used in the future version).
Instructions for runnig the semantic tagger
Step 1: Download the code
If you just want to use the semantic tagger in your local PC,
download and unzip the code, data and model files:
The content of the data and model files must be placed in a 'data'
folder in the same place as the source code (or otherwise change this
code to read the data somewhere else).
Step 2: Install a POS-tagger on your machine
The script will call a POS-tagger that is supposed to be already installed
on the system where it is running. Here we use
UDPipe. You may have to adjust the code to make sure the paths to the installation
are correct. Search for the $udpipe and $langmodel variables, defined like this:
my $udpipe = "/opt/udpipe/src/udpipe";
my $langmodel = "spanish-ancora-ud-2.5-191206.udpipe";
You can use other taggers that comply with the
Universal Dependencies
standard, or otherwise modify this code to interpret another format.
For efficiency, the program avoids applying the POS-tagger twice to the same text.
A file 'input.txt.tag' is created the first time you apply wicacho.pl to a text 'input.txt',
contaning the result of the POS-tagger, so the next time you analyze the same
text (for instance, after making changes in the script), before it calls the
POS-tagger, it will see if there is some file with the same name but with the '.tag' extension added, and if so, it will assume it is the same document.
Step 3: run the semantic tagger
Once you have a POS-tagger up and running, you can run the tagger with the following command:
perl wicacho.pl input.txt (Spanish) > output.htm
The first argument, input.txt, must be a UTF8-encoded Spanish text file.
The 'Spanish' argument is not needed, as for the moment it is the only language
it can operate with. The output of the script is an html file
like those you see on the demo.
For the time being, the system process one document at a time
(be patient: the new version, coming soon, allows for batch processing of folders).
Happy tagging!
Instructions for the generation of models
Step 1
In order to replicate the generation of the models, downaload the above mentioned
source code and
data files, plus a copy of the Wikipedia index file (eswiki-20241020-abstract.xml, compressed as eswikiabstractszip.zip to 250Mb), an XML document that contains only the title and the first paragraph of each page. It may look something like this:
Step 2
Run the conversion script:
perl xml2csv.pl eswiki-20241020-abstract.xml > wikitable.csv
This script strips all the XML code out of the Wikipedia file and presents
a table with one entry per line.
Step 3
Place the script for information extraction in the same folder as your unzipped data.zip and run:
perl wicacho.pl -train wikitable.csv > raw.csv
This script assigns every Wikipedia entry to one of the categories of the
CPA Ontology. It is, thus, an information extraction and word sense disambiguation process.
It may take well up to 30 minutes for the training to complete, depending on your hardware.
Here is a random sample of 1000 results. The sample shows 97% precision and 64% recall. Recall is low at this point but this is not a problem, because the leftovers will be reanalyzed in a second pass, after the statistical models are generated.
Step 4
Generate the first statistical model:
perl model.pl raw.csv > model-31march2026.csv
This script takes only a few minutes to execute.
Step 5
Generate the second statistical model:
perl morfeo.pl raw.csv > strongpairs.csv
This one takes only seconds to exectute.
The script Morfeo is part of another of our projects,
which we hope will be documented soon with its own website.
In the meantime, we use this version of the script as
a complement of Wicacho's statistical model.
This version of Morfeo actually produces two results.
One is the output that gets written ( strongpairs.csv ), which is
a list of strong hyponym-hypernym pairs extracted from raw.csv,
and the other is a model of suffixes, silently produced by the
script. You will find this model in the same folder and the
name follows this pattern: morfodel-1773603880.csv, where
the number is the result of the time() function (number of
seconds since 'the Epoch').
Notice that this script uses the iconv command. If you
want to use this on Windows, you would have to find a similar
alternative. As already stated, testing this code on Windows machines is pending work, but
that too will be done, eventually.
Step 6
In this step we are going to apply the training procedure again, this time
using the statistical models just generated. This is to increase the proportion
of classified Wikipedia entries by retrying to classify those that were left
with the UNKNOWN tag in step 3.
It is necessary, at this point, to manually edit the code of the wicacho.pl
script. Find the lines where the %files hash table is declared and update the filenames:
my %files = (
'pila' => 'masterPil.txt',
'apellido' => 'masterSurname.txt',
'lugar' => 'masterPlace.txt',
'stoplist' => 'stoplist.txt',
'semtypes' => 'semantypes_es.txt',
'triggers' => 'triggers_es29mar2026.txt',
'model' => 'model-31march2026.csv',
'morfeo' => 'morfodel-1773666228.csv',
'strongpairs' => 'strongpairs.csv',
);
Specifically, you have to update the filenames of the keys
'model', 'morfeo' and 'strongpairs', i.e., the names of the
statistical models just created.
Notice that there is a difference between
files with the .txt extension and those with .csv: the first ones are those that are ``hard-coded'' into the system, while the latter are generated during training.
perl wicacho.pl -train raw.csv > database.csv
The execution is similar to what was done in step 3, the difference
being that now the input file is raw.csv (the result of step 3) and
the output is a new file database.csv, which is the new result.
This new result is similar to raw.csv, but with a new column
after the category initially asigned. Many times the new category
is the same as the originally asigned, but it also happens very frequently that
the category was initially UNKNOWN and the statistical model in those cases is
able to risk a category and improve recall. This is the file that will be used
in the final step.
Step 7
Install a POS-tagger in your machine.
As POS-tagger, for now this system uses UDPipe, but there are
other alternatives (see
Project Porcus for some of them).
If you want to use a POS-tagger other than UDPipe, however, you would have
to adapt the code of wicacho.pl to ensure compatibility.
Step 8, and final
You have created all the necessary materials so you are ready to apply the semantic tagger to any text in Spanish.
The only task left is to open again the wicacho.pl script and add the line defining the database (the file that resulted from step 6):
my %files = (
'pila' => 'masterPil.txt',
'apellido' => 'masterSurname.txt',
'lugar' => 'masterPlace.txt',
'stoplist' => 'stoplist.txt',
'semtypes' => 'semantypes_es.txt',
'triggers' => 'triggers_es29mar2026.txt',
'model' => 'model-31march2026.csv',
'morfeo' => 'morfodel-1773666228.csv',
'strongpairs' => 'strongpairs.csv',
'database' => 'database.csv',
);
This only needs to be done once, the first time you use it.
Afterwards, you can tag all the text you want using the same configuration.
For instructions on how to run the semantic tagger go to the beginning
of this document.