Kind - The Taxonomy Project

Documentation and Source Code

Last edition: December 14, 2020
We just jettisoned the embed module: it was too complicated and added nothing to the final classification.
At some point later we will document this ex-module (which will be renamed Wembe, for word-embedding):

This documentation is based on our latest paper:

+ Nazar, R.; Balvet, A.; Ferraro, G.; Marín, R.; Renau, I. (2020). Pruning and repopulating a lexical taxonomy: experiments in Spanish, English and French. Journal of Intelligent Systems, vol. 30 num. 1, pp. 376-394.

1. Introduction

The taxonomy induction algorithm is composed of a series of modules that run sequentially. They are controlled by a central processor tasked with producing a full taxonomy chain for an input noun. Take, for example, the input noun airplane, in English. Kind will classify it as a kind of aircraft, and then as a kind of vehicle and so on, until a maximum level of generality or abstraction is reached: in this case, the noun is a kind of entity (and not a property or an event, etc.).

1. -- > entity
2. ---- > physical object
3. ------ > inanimate
4. -------- > artifact
5. ---------- > vehicle
6. ------------ > aircraft
7. -------------- > airplane

In order to produce these chains, the central processor articulates the output of the different modules using a top-down and a bottom-up approach. That is, given an input noun, the algorithm starts with the top-down approach by asking the most general questions: Is it an entity, a property or an event? Then, if it is an entity, it asks if it is an abstract or concrete entity, and then if it is animate or inanimate and so on, until it reaches a medium point of abstraction such as, say, it finds out that the noun refers to a type of vehicle, a type of weapon, a type of fish, etc. The latter are nouns that are general enough to be included in a ``Core Ontology'' (Coreont, for short), a selection of ca. 300 of the most general nouns in a language, described in Section 2 of this documentation. It is at this point that it starts the inverse procedure, i.e. the bottom-up strategy, searching for the most specific or immediate hypernyms. With all data gathered, the algorithm chooses the best way to connect the two segments of the taxonomy chain.

The rest of this documentation consists of the explanation of each of the modules that constitute the Kind algorithm. Each of these modules performs a sub-processes of hypernymy identification and, as already stated, the central processor is charged with the decision-making process combining the results of all previous modules and finally connecting lexical units to a basic structure containing the most general categories of the language.

All the source code of the algorithm is contained in a single Perl script, with no dependencies.

This script also needs a series of data files (all in text format) which are also available in this website. For convenience, we have everything compressed in a single zip file, in the following address:

In addition, there are also a series of evaluation datasets available on Section 10.

2. The Coreont (a.k.a. Core Ontology, a.k.a. the Top-Ontology)

As already mentioned in the introduction, the starting point of this process is what we call the Core Ontology (or Coreont, for short), a tree-shaped arrangement of ca. 300 of the most general concepts in any Western language, starting with those such as entity, property, event, group and so on (Figure 1). This structure is used by the top-down modules.

Figure 1. A fragment of the Coreont

The Coreont is loosely based on the CPA Ontology (, developed by Patrick Hanks for the Pattern Dictionary of English Verbs (Hanks & Pustejovsky, 2005), but we freely adapted it to our purposes, adding or eliminating terms according to their frequency in corpus. We then translated it to Spanish and French, and will eventually translate it also to the next languages we will be working on. In order to replicate experiments in another language, adapting this Coreont would be the starting point.

The following table shows a fragment of the Coreont file in text format. Notice how the tabs indicate the depth of the hypernymy chains. Thus, the noun animate comes after three tabs, and under other categories. This means that it is a child of physical object, which comes after two tabs, and a grandchild of entity, which comes after only one tab.

# Coreont file for English
# comments can be included when necessary 
# they must be inserted after the symbol '#' 

		abstract entity
		physical object


Table 1: . A fragment of the Coreont as a text file with only the top-nodes.
# Coreont file for English
# comments can be included when necessary 
# they must be inserted after the symbol '#' 

		abstract entity
					social science
						political science
					physical science
						mechanical engineering
					computer science


Table 2: . Another fragment of the Coreont now with some basic content.

The versions of the Coreont in each of the three analyzed languages are already contained in the source code zip file, but there are also links to download them individually. Use the secondary button of the mouse and then `download as...' so you can open them in a code editor, or otherwise your browser will not be able to handle the special characters they contain.

  1. coreont_en.txt (English Coreont)
  2. coreont_es.txt (Spanish Coreont)
  3. coreont_fr.txt (French Coreont)
  4. coreont.txt (A clean Coreont to start with a new language)

3. The initial taxonomy

For the whole process to work, a critical mass of data is needed, and that is a taxonomy larger than the Coreont. That means that we need an initial population of the taxonomy in order to have the algorithm working at full capacity. This is because the algorithm learns from the data it already has in the taxonomy at any given point (the phase we call ``feature extraction'', described in Section 6) in order the guess the best semantic classification of any new noun (the phase we call ``population'', i.e., the process of populating the taxonomy, described in Section 7).

There are different ways to achieve this initial (and maybe noisy) taxonomy. For the sake of clarity, we will keep that part in a separate article, since it is only useful for people who would want to replicate these experiments on a different language. Right now (December 11, 2020) that other site is still under construction, but it will be finished in a few more days. It is called the Dicco module, and when finished, it will be stored here:

For the time being, and to avoid more delays, we will describe here the work that is readily available in Spanish, English and French. As we said, the Dicco module produces a not very clean initial taxonomy for those languages, and so the Kind algorithm starts with a self-cleaning process (what we call the ``pruning'' phase, described in Section 5), before it uses this initial taxonomy as example for its self-learning.

4. The corpora, their indexing and the extraction of concordances

As most of the modules are corpus-based, we need a large textual corpus to operate. The corpora from which we extracted the material were the French, English and Spanish versions of the TenTen corpus (Jakubíček et al., 2013), with approximately 1010 tokens per language, the largest generic corpora currently available for these and other languages. It is composed of web pages randomly crawled from web top-level domains, which have been later converted to plain text, tokenized, lemmatized and POS-tagged.

In order to extract contexts of occurrence of the input terms that will populate the taxonomy from such large corpora, we need a robust concordancer. The quickest one we are aware of is our dear Kwico concordancer, which works very fast because it indexes the corpus. Kwico, a module of the Jaguar project, is freely available and well documented in the following website:

You will have to download and install the Kwico engine before you continue reading these instructions. Consider that when installing the Kind script in your system you will have to adjust the path to the Kwico script and to the corpus once it has been indexed with Kwico.

5. The pruning module

As already mentioned in the introduction, after adapting the Coreont to the target language, we generated an initial and possibly noisy taxonomy, that was later cleaned-up. There are several ways to produce the initial taxonomy. In our case, we used the Dicco module, which, also as already mentioned, will be explained later in a different article. No matter what, however, the method to produce the initial taxonomy, it needs to be cleaned-up and this is the purpose of the pruning module.

The input of this module is a list of suspected hypernymy pairs and the output is a true or false value for each pair. As explained earlier, the objective for this pruning step is to obtain a reduced number of assertions with a high probability of being correct, from which to extract features that in turn are used to classify new nouns during the population phase. The assertions that are likely to be true are denoted as matrix Ta,h where a is a member element (a hyponym) and h is the semantic category (a hypernym).

The pruning is conducted according to a measure of syntagmatic association. Thus, the output of this first module for Ta,h is obtained from estimator Ma,h, which measures the frequency of co-occurrence in the corpus of the alleged hypernymy pair a to h. It can be said that Ma,h is a measure of the relative importance of h to a, and is therefore appropriate for asymmetric relations like hypernymy (Equation 1).

Ma,h is computed by the analysis of a random sample of contexts of occurrence of each hyponym a in the corpus, denoted as set C = {c1, ... cn} (with n<=5000, to allow for lower processing time). The context window is ten words on each side of the target and each context Ci is represented as a bag of words, with no order and no repetitions. M can be used to compute a ranking of the most reliable assertions or just to return a Boolean if Ma,h > p, with p as an arbitrary parameter (experimentally set to 0.3). Additionally, a true value for a pair Ma,h is only issued if it passes the asymmetry test (Equation 2).

6. The feature extraction module

Once the initial taxonomy has been cleaned-up, we are ready to invoke this feature extraction module. As already anticipated, much of the classification is done by extracting features from the cleaned up Taxonomy. Different modules use different types of features (e.g. distributional, morphological) but all of them are subjected to a process of weighting in order to select the most discriminant ones. We score the features according to a given semantic category (Equation 3).

Let Fi be some feature (e.g. a morphological one, like the final letter-sequence -itis); Tj a given semantic category (e.g. disease); f(Fi | Tj) the frequency of feature Fi in the category Tj (the number of hyponyms bearing this trait within the category) and D(Fi) the dispersion of said feature Fi, i.e. the number of semantic categories in which it is present. This score thus promotes features that are concentrated in few categories and that, at the same time, are productive in each category. To continue with the same example, this would be the case of finding a great proportion of words ending with the segment -itis among the population of words in the category of diseases.

Best features are kept in a matrix FMm,h,f (Equation 4), where m stands for the name of the module, h for the hypernym (the semantic category), f for some feature and u is an empirically defined parameter, set to 0.001 in our experiments.

7. The population modules

The population modules accept one or more words as input and try to classify them by adding them to the Taxonomy. These modules actually do not decide the classification of a noun: they merely propose a classification to the central processor, who is in charge of making the final decision. The population modules are identified with code names (e.g. asymmetric, paradigmatic, morfeo) and described in detail in the following subsections.

7.1. Paradigmatic association module

The paradigmatic association module is based on the notion of distributional similarity, i.e. the idea first discovered by distributionalists such as Harris (1954) that words that are semantically similar will share common features in the kind of context in which they appear. In the case of this module, by shared features we mean co-occurring words. Thus, we will start by explaining what a co-occurrence vector is, and the best way to do this is probably by imagining the vector of an input noun a, which could be, for instance, the word piano. Given an input noun such as this, we first extract its concordances from the corpus (using the concordancer) and build a co-occurrence vector a⃗ (5).

The components of a⃗ are words (adjectives, verbs and nouns) having a significant frequency of co-occurrence with a in the context window. The vector includes a itself as it tends to be the most frequent word in its own contexts of occurrence. For simplicity, we restricted the selection of components to single word units, but nothing would change if word n-grams were also to be included. Also for simplicity, the dimensionality |a⃗| was limited to 100, a threshold defined by testing.

This is how a co-occurrence vector is created for an input noun. But the module also creates co-occurrence vectors for every semantic category in the taxonomy, because the classification of a will be based on the overlap between a⃗ and each of the vectors of the categories. This way, the module creates a co-occurrence vector for each of the hypernyms of the taxonomy, and each of them is a ``category vector''. It is different from the co-occurrence vector of a single word in that it is not simply the co-occurrence vector of the word that designates the category. Instead, it is calculated as the sum of all the components of the vectors of each hyponym of said hypernym. Again, and to use the same example, if a is the word piano, then a⃗ will have as components some semantically related words such as sonata, orchestra, voice, violin, concert, guitar and so on. The hypernym or semantic type to which piano belongs is Musical_instrument. For this hypernym there is a category vector, and this category vector will have as components the sum of all the components of the vectors of each noun pertaining to such category (i.e., all the names of musical instruments that are already members of said category).

The logic of this module, then, is that it compares a with each of the categories in the taxonomy. On average, the co-occurrence vector of a noun will have some overlap with its parent category vector. The result is that for a there will be a ranked list of the most similar category vectors. We operationalize this as a measure of overlap between a word vector and a category vector.

In Section 6 we discussed how features are weighted. This module uses that weighting procedure, thus the features (the components of the vectors) are not selected just by frequency of co-occurrence: they are also selected according to Equation 3, shown in Section 6. There is, then, a matrix FMm,h,f to represent the features that are highly associated with h, a given hypernymy, i.e. with words showing significant frequency of co-occurrence with words that are hyponyms of h. This way, given an input noun a, this module decides if h is a good hypernym candidate for a (ah) by calculating the overlap between a⃗ and h⃗, the vector from FMm,h, as shown in Equation 6.

7.2. Asymmetric association module

This asymmetric association module uses the concordances of a given input word and applies a syntagmatic association measure to capture asymmetric associations between words. This follows the intuition that, in general, nouns have a tendency to co-occur with their hypernyms in a non-reciprocal manner (Nazar, 2010). I.e., examining the co-occurrence vector of a word such as motorcycle, the second most frequent co-occurring noun after motor would probably be vehicle, and motorcycle will most probably not be among the most frequent co-occurring words in the vector of vehicle. Likewise, words also co-occur frequently with their co-hyponyms, and then in the co-occurrence vector of motorcycle we will see frequent items such as automobile, bicycle, car, and so on, each of which probably will, also, show the same pattern with respect to vehicle.

Having observed this behavior, we derived a simple strategy that uses the same co-occurrence vectors built in the previous step (Subsection 7.1.) to retrieve, for an input noun a, a first order co-occurrence vector and then a second order co-occurrence vector, composed of words that co-occur with the words that co-occur with a. Correct hypernyms for a are often found among the most frequent first and second order co-occurring items of a. We can represent this reasoning in a matrix:

The first row represents the first order co-occurrence vector, corresponding to input word a. The first column displays the same words as in the first row, i.e., the words in the first order co-occurrence with a. The subsequent rows represent the first order co-occurrence of the element in the first column of each row. In this case, given that d is the most frequent word in the matrix for a, and if d happens to be a known category in T, then this module will promote the candidacy of d as a hypernym of a.

7.3. Morfeo: morphological similarity between co-hyponyms

The strategy of this morphological similarity module is different from those of the previous steps in that in this case we do not use corpora or distributional features. Instead, this step consists of analyzing the correspondence of morphological features of the members of a class with the class they belong to. Morphological features are simply defined as sequences of 3 to 5 letters at the end of each word, meaning that there is no need to have morphological analyzers involved. These features are extracted from the Taxonomy and weighted exactly as with the other modules, i.e. with Equation 3 of Section 6. This means that we have already a matrix FMm,h,f with the best scored features.

Consider, for instance, the case of an input word a (e.g. poliomyelitis) and a hypernym candidate h (e.g. disease): we can decide if ah according to Equation 7. Using the same notation as earlier, the symbol h⃗ denotes a vector of character n-grams associated with hypernym h, and M a,f to denote a particular feature f (e.g. -itis) in input word a.

We argue this particular algorithm operates more like an inference engine rather than just a morphological similarity measure, because it does not test for similarity between hypernym and hyponym. The method can also be considered language-agnostic since it does not rely on any external knowledge source (e.g. lexicons, rules or morphological knowledge). Of course, the selection of character sequences depends on the phylogenetic type of the language being processed (word-initial vs. word-final affixes, Semitic languages, etc.). However, the selected pattern is common to a large subset of languages (English, Spanish and French, among others), and the procedure can be adapted to accommodate other morphological structures. We opted for this pseudo-suffix based approach because it allows us to capture repeated patterns of different sizes without the need for knowledge-intensive word morphological analysis.

7.4. Morfrules: a set of morpho-semantic rules

Using the output of morfeo and in combination with our own observation, we manually crafted a set of morphological rules with the purpose of hard-coding the association between certain morphological features and certain semantic classes. Of course, we had to limit ourselves to the most general categories (those in the first three levels of the Coreont), as it would be excessively laborious to proceed with each category in this way. The following are some examples of such rules:

	`property' => `ment|ness|ity|sion|[ea]nce|ncy',
	`science' => `ics|logy|omy|graphy|[es]try|mics',
	`treatment' => `therapy|scopy',
	`surgery' => `surgery|ctomy',
	`instrument' => `graph|scope|phone|meter',
	`disease' => `[ao]sis|itis|pathy|emia|oma',

These rules mean that if an input word ends with any of the right hand-side patterns (e.g., cholecystectomy), then the element on the left-hand side will be promoted as the hypernym. This strategy will only affect a very limited number of terms, but at least it will be effective on such subset.

We have between 20 and 30 rules of this type per language, but we also make them extensive to the hyponyms of the elements on the left. For instance, in the case of hypernym disease, hyponyms of said element according to the Coreont will also be accepted as hypernym candidates (e.g. disorder, pathology, affliction, inflammation, infection).

7.5. Head: syntactic head of multi-word expressions

If the input term happens to be a multi-word expression, it will be a noun phrase instead of a single noun. In such a case, this module will attempt to identify the syntactic head of the expression. The procedure for this is rudimentary but effective on the vast majority of the cases: the last component is the head in the case of English and the first in the case of French and Spanish. The only added rule is, in the case of English, to start the process by eliminating all elements after the first occurrence of the preposition of. For instance, in the case of a term such as acute infectious disease, the head will be disease, but in the case of disease of lung, the head that is returned will again be disease and not lung.

Arguably, to say that the head of the noun phrase of the input is the hypernym is seldom useful. However, it can be useful when the input term cannot be found in corpus or there is insufficient information. In such cases, this module comes into play to obtain the head and then start a new instance of the hypernymy discovery process, this time to obtain the hypernym of the head. It is likely that a valid hypernym of the head will also be a valid hypernym of the whole term.

7.6. Palex: a quantitative implementation of lexical patterns

The use of lexical patterns as a means to extract hypernymy relations from corpus dates back to the work of Hearst (1992), but is rooted in work previously undertaken on dictionary definitions (Chodorow et al., 1985). A lexical pattern is a sequence such as is a type of. In her original paper, Hearst (1992) suggested searching for lexical patterns such as these in order to extract hypernymy relations between the nouns or noun phrases found on each side of the pattern, and many researchers used this approach. However, lexical patterns applied in this way tend to be error prone. It is in this context that efforts were made to try to identify which patterns are more reliable (Potrich & Pianta, 2008; Seitner et al., 2016).

We believe that instead of focusing energy on trying to select a number of patterns, better results could be achieved by a simple change in tactics. Instead of starting with the pattern, i.e. instead of looking up a given pattern in corpus, we propose starting by looking up a given input noun in corpus. Once a large number of contexts of occurrence of a given noun have been extracted, then we suggest searching for the occurrence of lexical patterns within these contexts. These patterns will yield a number of hypernym candidates, and the key of this procedure is precisely the frequency of occurrence of such candidates. There will be one hypernym candidate with the highest frequency, and if the frequency is not high enough (three times at least) then the candidate is rejected and the module does not return any result. This means that results will be produced only on some occasions, but when they are, they will have a high probability of being correct.

The following are some examples of the patterns we compiled for English, and similar patterns were used for Spanish and French.

HYPO is a type of HYPER
HYPO, a kind of HYPER
HYPO and other HYPER
HYPER such as HYPO

For economy, many of these patterns are coalesced using regular expression syntax, such that one line of code can include many patterns, like this:

is a (type|form|kind) of

We believe this version of the lexical patterns' approach is more advanced than the classical one in the sense that it uses frequency of co-occurrence as a key factor in the selection of hypernym candidates, and yet it is simple in comparison to other supervised learning strategies which depend on training material and are computationally expensive.

8. The central processor

8.1. Flow diagram

In the population process, a final decision is needed for the construction of a taxonomy chain for any given noun. This is where the central processor comes into play. It controls the calls to each of the modules explained earlier. The general structure of this module is represented as a flow diagram in Figure 2.

Figure 2: A flow diagram of the central processor.

According to the flow chart, given one or more input terms, the algorithm starts by taking and analyzing them one-by-one. As before, the symbol a represents the input term. At this stage, one could argue that it is necessary to check, in advance, whether the input term a is not already in the Taxonomy T, but that would prevent us from discovering new meanings of an already known term. We will assume, however, that aT.

Following the chart, we see it takes the syntactic head c of term a and if cT, then a result is found and the system moves to the next input term. Otherwise, then the td( ) function is called, short for top-down. This function, as well as its counterpart bu( ) (bottom-up), are explained in detail in what follows, but they are nothing more than a particular arrangement of the already presented modules. The symbol h represents a set of hypernym candidates that are obtained as the result of these operations, conducted upon the input term a as well as its head c and also to the hypernyms hi that have already started to appear.

At each time of the process, the system checks if any of the obtained hypernyms are known entities (i.e., hiT). The final decision is taken by a ranking method function best(h), which selects the best hypernym-candidate chain according to how many modules voted for them. The system then computes those values in a matrix R to manage the contribution of each module, and obtain the final score of pairs a and h (Z(a,h)) as a weighted sum of each contribution (Equation 8).

Finally, and before deciding to promote the candidacy of a given hypernym, the system calls function prune(a,b), which takes as arguments the input term and the best hypernym candidate. This function returns a Boolean by calling the pruning algorithm already explained in Section 6.

If the input consists of a number of terms, including multi-word expressions, and there are reasons to believe that these terms belong the same specialized domain, there is then room for the application of some forms of pre-processing instead of working with the input terms one-by-one. For instance, the repetition, among this set, of certain head units would be a way to obtain relevant domain-specific semantic categories (e.g. disorder in a medical domain). This, however, is a possibility we have not implemented yet.

8.2. The top-down procedure: td( )

This procedure of the central processor consists of navigating the Coreont from its most general nodes to the most specific. That is to say, its output will be a category of the Coreont which is as specific as possible, but it never reaches a great level of specificity because the Coreont is general by definition.

For instance, if the input term is fluoxetine, a successful result for this procedure would be drug or even antidepressant. It will start by asking, as we said, the most general questions: Is this an entity, a property, or an event? And if it is an entity, is it abstract or concrete?, and so on. This means that, as one travels downwards through the taxonomy, one has to decide which path to traverse next. With the exception of the last one, all other modules are called for this task. At each step they are called with the input term and the options for categories available at each moment. Each module only votes for one of the categories it is presented with.

8.3. The bottom-up procedure: bu( )

This procedure of the central processor has the purpose of finding the most immediate or domain specific hypernym, and this is the task of the last module, Palex. As shown in the flow-chart, the bottom-up procedure will be robust enough to produce results even if the top-down procedure fails, because from an immediate hypernym we can continue to build up a taxonomy chain by recursively submitting the obtained hypernym to a new instance of the top-down procedure. To continue with the same example, if the input term is fluoxetine, the result of this procedure should be selective serotonin reuptake inhibitor, SSRI or, here too, antidepressant. And from there, we can start the process again in order to link-up such result with a more general term in the taxonomy.

9. Ok, enough with the theory: what are you supposed to do now

We thought it was important to provide an overall description of the process before dealing with the more concrete task of trying to reenact the Kind algorithm in another server. The following are the very concrete step-by-step instructions to make it work in any of the languages in which it already is available (at the time of writing, English, Spanish and French). Eventually, more instructions will be published here in order to try to replicate experiments in other languages as well.

It will become apparent that we assume a user familiar with Linux operating systems, servers (such as Apache) and some degree of familiarity with the Perl programming language.

So, the steps are the following:

  1. Set up a web server if you don't already have one.
  2. Get a large (a very large) corpus of the language you want to analyze. As we said earlier in the documentation, we used the TenTen corpora but that is not freely available. Alternatively, you can use Wikipedia or any other large corpus.
  3. Download the Kwico Engine ( ), which will be needed by Kind to extract concordances of the input terms. You could, in principle, use another concordancers, but Kwico is fast and free, so why not?
  4. Index your corpus using Kwico (follow the instructions in Kwico's website in order to do that).
  5. Create a folder contexts somewhere in your file system that your server have access to, in order to store the contexts of occurrence of the searched terms (retrieved by Kwico).
  6. Set permissions so your server can also write files to this folder (chown -R www-data:www-data contexts, etc.).
  7. Create a folder chains to store the taxonomy (one for each language) and grant your server the same read and write permissions, exactly as with contexts.
  8. Create a folder kind in your cgi-bin folder or anywhere you can execute perl scripts.
  9. Copy the Kind Perl script ( in said folder.
  10. Give 755 permissions to said file (chmod 755
  11. Create folders for the co-occurrence vectors for the languages you will be working on (vectors-english, vectors-spanish, vectors-french: purists will have to excuse the all-lower-case spelling the languages). These folders can be placed in the same folder as the main script.
  12. Change ownership of the vector folders to the server (chown -R www-data:www-data vectors-spanish, etc.).
  13. Edit the index file providing the correct paths to all needed data files:
  14. Open a browser and point to where you have installed the kind/ file
  15. For initial tests, activate the `verbose' option to make sure everything is working properly. There is, also, a generous amount of in-line commentary on the code itself to help the users understand what happens in each part.
  16. Voilà! Enjoy. And if you run into trouble, send a message to rogelio (dot) nazar (at) gmail (dot) com

10. Datasets for evaluation

The following are some datasets that we propose for evaluation purposes. In order to download use the secondary button of the mouse and then `download as...', so you can open them in a code editor, or otherwise your browser will not be able to handle the special characters they contain.

Datasets in Spanish:

These others are single nouns of various categories, not tagged:
And these are some terms, also not tagged:
+ 1710 psychiatry terms",

Dataset in French:

Dataset in English:


Back to top

Back home