Kind - The Taxonomy Project

Documentation and Source Code

Latest update: June 1st, 2025

1. Introduction

The 2025 version of Kind, the taxonomy induction algorithm, is based on two main components: a first algorithm, based on rules, for the extraction of hypernyms from a dictionary, and a second, statistically based, word sense disambiguation algorithm essential for the coherent construction of the hypernymy chains. Compared to old versions, this one is better at handling the polysemy of the nouns at all levels. Take, for example, the input noun airplane, in English. Kind will classify it as a kind of aircraft, and then as a kind of vehicle and so on, until a maximum level of generality or abstraction is reached: in this case, the noun is a kind of entity (and not a property or an event, etc.).


airplane

Sense 1 of airplane
----> aircraft a powered heavier-than-air aircraft with fixed wings
--------> vehicle a vehicle capable of atmospheric flight due to interaction with the air, such as buoyancy or lift
------------> machine
----------------> artifact
--------------------> inanimate
------------------------> physical object
----------------------------> entity

Sense 2 of airplane
----> game a game to encourage small children to eat, in which the parent or carer pretends a spoonful of food is an aircraft flying into the child's mouth
--------> activity
------------> event
----------------> eventuality 


In the example we can also see how Kind handles polysemy at the entry level, but it also does at higher levels. This is the case, for instance, of the entry noun vodka, which is classified as a kind of spirit and, despite this being a polysemous noun, in the next level it is correctly classified a as a kind of liquid.

2. Materials

The whole logic of this project is to connect nouns, obtained from the dictionary, to a ``Core Ontology'' (Coreont, for short), a selection of ca. 300 of the most general nouns in a language, described in the next subsection. The second material is a dictionary, which in this case is the Wiktionary, but in principle any other dictionary could also be used. Finally, the third material is a large reference corpus, that is, a corpus of the general language, from which a language model is obtained. In our case, we used the EnTenTen18, but any other large corpus could also be used. Here, we only describe in detail the Coreont, because the Wiktionary and the EnTenTen corpus were made by the community in one case and by a third party in the other.

2.1. The Coreont (a.k.a. Core Ontology, a.k.a. the Top-Ontology)

As already mentioned in the introduction, the starting point of this process is what we call the Coreont, a tree-shaped arrangement of ca. 300 of the most general concepts in any Western language, starting with those such as entity, property, event, group and so on (Figure 1).


Figure 1. A fragment of the Coreont

The Coreont is based on the CPA Ontology (http://pdev.org.uk), developed by Patrick Hanks for the Pattern Dictionary of English Verbs, but we freely adapted it to our purposes, adding or eliminating terms according to their frequency in corpus. We then translated it to Spanish and French, and will eventually translate it also to other languages. In order to replicate experiments in another language, adapting this Coreont would be the starting point.

The following table shows a fragment of the Coreont file in text format. Notice how the tabs indicate the depth of the hypernymy chains. Thus, the noun Animate comes after three tabs, and under other categories. This means that it is a child of physical object, which comes after two tabs, and a grandchild of entity, which comes after only one tab. Notice, also, that some of the nouns are written in upper case letters. We use upper case to distinguish the nouns that were in the original CPA Ontology from those added later by us.

# coreont_en.txt
# English version of the Coreont.
# Take a look at the documentation before you do anything else:
# http://www.tecling.com/kind

Anything
	Entity
		Abstract Entity
			Concept
				Proposition
					Narrative
				Rule
					Permission
				Dispute
				Information
				study
				discipline
				line
				circle
			Information Source
				Document
					Agreement
					article
					map

...

Table 1: . A fragment of the Coreont as a text file with only the top-nodes.

The versions of the Coreont on the languages in which we have experimented so far can be downloaded as a single zip file in the following link:
coreont.zip
It can also be download individually by language. In this case, it is best to use the secondary button of the mouse and then `download as...' so you can open them in a code editor. Otherwise, the browser may not be able to handle the special characters contained in the Spanish version.

  1. coreont_en.txt (English Coreont)
  2. coreont_es.txt (Spanish Coreont)

3. Source code

The rest of this documentation consists of the explanation of each of the modules of the source code, which is all written in the Perl programming language. The explanation is based on the English version of the project but the code should be useful for other languages as well, with minor adjustments. The source-code files are zip-compressed, so they should be inflated after download. Of course, this is not intended for Windows users. We assume you are using Linux or Unix. We are unsure this would ever work in a Windows environment. If you want to try anyway, you should first install the Perl interpreter should also be installed before running the scripts (https://www.perl.org/).
At the moment, the source code is divided in the following modules:

3.1. Pre-processing of the Wiktionary This code is comprised of two scripts, prepared by our colleague Daniel Mora. These scripts are used to download and clean up the Wiktionary.
kind-dictcleanup.zip

This zip file contains the following:
3.1.1. filter_monolexical_titles.pl: This script reads the Wiktionary XML dump (e.g., enwiktionary-20250301-pages-articles.xml) and extracts the definitions that start with a hash (#) from the English noun section of pages that have a monolexical title.
Usage:
`$ perl filter_enwiktionary.pl `
3.1.2. clean_enwiktionary.pl: This script reads the output from the previous script, and cleans up using a serie of regular expressions all parentheses and wikipedia templates, along with fixed strings and metadata.
Usage:
`$ perl clean_enwiktionary.pl `
Here you have a copy of the result of the clean up of the English Wiktionary:
enwiktionary_cleaned.zip
Warning: If you are planning to replicate experiments in a different language, take into account that the different Wiktionaries are not all identical. Each language introduces subtle differences that must be taken into account.

3.2. Extraction of hypernyms from dictionaries
This script takes as input the output the previous process. The result is the same dictionary but now with the hypernym extracted from each definition.
kind-hypextract.zip
You'll notice the script uses a file called ``cleanouns.txt''. This is just the list of entries of the Wiktionary, which can be generated using grep:

grep -P '^[^\t]' enwiktionary_cleaned.txt > nouns_en.txt

The reason why we were using this file is to filter infrequent words, which we did using the EnTenTen corpus. The explanation is that a hypernym should never be a very infrequent word. The previous command will produce the full list of entries (``nouns_en.txt''), with approximately 211,000 nouns. Here, however, you have the least reduced to 44,000 nouns according to frequency (i.e., we retained the most frequent ones in the EnTenTen corpus):
cleanouns.zip

3.3. Creation of a language model
kind-langmodel.zip
This is a series of scripts that take a corpus as input, in our case, the EnTenTen, and produces a co-occurrence language model. The result of this script is a table that connects each noun to other semantically-related words. This is used by the next script, employed by the word sense disambiguation module.

This file contains the following scripts:
3.3.1. firstpass.pl.pl: This script is controlled by the next one, langmodel.pl, so you do not have to run it.
3.3.2. lanmodel.pl: This is the script you have to run. It takes a compressed corpus like the EnTenTen as input, inflate each file, and for each file it produces a co-occurrence matrix using the firstpass.pl script. The code contains a little more detail in the form of comments.
3.3.3. filtermodel.pl: This one takes the tables resulting from of 3.3.2 and refines them. It need to be run twice. In the first run it creates a stoplist that is used in the second run.
3.3.4. listready.pl : This one reads the result of the previous one, filtermodel.pl, and produces the list of co-occurrences that will be used by the next script.
3.3.5. weighter.pl: This is the final script, that takes the output of the previous one and computes the Mutual Information coefficient. With this step the model is ready.

Here is a sample of the corpus to let you reproduce experiments (legal reasons prevent us from sharing it in full):
entensample.zip

And here is a copy of the result, a tiny English language model (20Mb uncompressed):
minimod.zip

3.4. Population of the coreont
This script takes as input the output of the previous modules, i.e., (2) the hypernyms extracted from the dictionary and (3) the language module. The output is a single taxonomy that connects all the nouns.
kind-popul.zip
This script uses the unfiltered list of entries generated earlier with grep (``nouns_en.txt''), a short stoplist (``shorstoplist_eng.txt''), that is one of the results of the language model generation script, and of course an English version of the Coreont, already described.

Here you can download the full taxonomies in English and Spanish:
EnglishTaxonomyMay11-2025.zip
SpanishTaxonomyOctober6-2024.zip

Disclaimer: None in our team has ever revised or corrected these resources. We have no manpower for that and it would be pointless, as they are huge. We know for a fact that approximately 15% of it is incorrect, so use it with care.


We are aware that this documentation is extremely succinct, but we will be expanding it in the following days!!!
If you would like to send inquiries you are welcome to do so at rogelio (dot) nazar (at) gmail (dot) com .

Back to top

Back home