The corpus based taxonomy induction project

UPDATE on the Taxonomy Project: 30 September, 2018: a new version is now available

The new version, which already works in French, English and Spanish, now also performs a constant autocheck of the results of each category, so that it will flag any errors if detected.

Soon we will also have another version able to process multi-word specialized terminology as well.

The interface is still somewhat rough but the idea is that you provide a noun (single nouns only, at the moment) and the program will try to assign the best semantic categories for such noun. You can also provide a list of nouns (one per line) and the program will treat each noun as an independent trial. At this moment, it does not learn from what it does, i.e., the taxonomy does not change. But eventually, it will begin to change itself, repairing itself from wrong information.


We designed a statistically-based taxonomy induction algorithm consisting of a combination of different strategies not involving explicit linguistic knowledge. Being all quantitative, the strategies we present are however of different nature. Some of them are based on the computation of distributional similarity coefficients which identify pairs of sibling words or co-hyponyms, while others are based on asymmetric co-occurrence and identify pairs of parent-child words or hypernym-hyponym relations. A decision making process is then applied to combine the results of the previous steps, and finally connect lexical units to a basic structure containing the most general categories of the language. We are currently in the process of evaluating the latest results.

SOURCE CODE & DOCUMENTATION: We are working at full speed to finish the new version of source-code and documentation of the Taxonomy Project. Hang on. We will get there!


From 2014 to 2017, this research was supported by a grant from the Chilean Government: Conicyt-Fondecyt 11140686, “Inducción automática de taxonomías de sustantivos generales y especializados a partir de corpus textuales desde el enfoque de la lingüística cuantitativa” (Automatic taxonomy induction from corpora for terminology and general vocabulary using quantitative measures). Lead researcher: Rogelio Nazar.
Since 2017, however, it also received support from the Ecos-Sud Project C16H02 “Inducción automática de taxonomías del español y el francés mediante técnicas cuantitativas y estadística de corpus” (Automatic taxonomy induction from corpora for Spanish and French using quantitative corpus analysis).


Various researchers are involved in the project. The following are the main contributers:

  • Rogelio Nazar
  • Irene Renau
  • Gabriela Ferraro
  • Rafael Marín
  • Antonio Balvet

  • Rogelio Nazar (developer)
  • Nicolás Acosta (developer)
  • Sebastián Márques (designer)

Related publications:

Related concepts:

corpus statistics, distributional semantics, Spanish, taxonomy induction