The corpus based taxonomy induction project

Web demo:

This is the web demo por the Spanish general vocabulary Taxonomy Project.
The demo for the Spanish Psychiatry Taxonomy Project is available.

Disclaimer: this taxonomy has been constructed by automatic means alone. We have not yet conducted any type of manual annotation or correction of the results. The probability of the results being correct is also given by the algorithm itself.

Have comments? Drop a line.

Try the Spanish taxonomy of general vocabulary (nouns):

It will accept a Spanish noun as input and it will return the different hypernymy chains for it if it is in the taxonomy.

Search for:

Or browse the taxonomy by alphabetic order:

A     B     C     D     E     F     G     H     I     J     K     L     M     N     O     P     Q     R     S     T     U     V     W     X     Y     Z    

Or select a random word.

(And soon, browsing by semantic category...)

Video presentation of the project

Watch a short video presentation of the project by Rogelio Nazar, the main researcher. The video is 10 minutes long (MP4 file of 98Mb).


We designed a statistically-based taxonomy induction algorithm consisting of a combination of different strategies not involving explicit linguistic knowledge. Being all quantitative, the strategies we present are however of different nature. Some of them are based on the computation of distributional similarity coefficients which identify pairs of sibling words or co-hyponyms, while others are based on asymmetric co-occurrence and identify pairs of parent-child words or hypernym-hyponym relations. A decision making process is then applied to combine the results of the previous steps, and finally connect lexical units to a basic structure containing the most general categories of the language. We evaluate the quality of the taxonomy both manually and also using Spanish Wordnet as a gold-standard. We estimate an average of 89.07% precision and 25.49% recall considering only the results which the algorithm presents with high degree of certainty, or 77.86% precision and 33.72% recall considering all results.


This research is supported by a grant from the Chilean Government: Conicyt-Fondecyt 11140686, “Inducción automática de taxonomías de sustantivos generales y especializados a partir de corpus textuales desde el enfoque de la lingüística cuantitativa” (Automatic taxonomy induction from corpora for terminology and general vocabulary using quantitative measures). Lead researcher: Rogelio Nazar.

Documentation and source code:

Click here for the source code and documentation of the taxonomy project.


A lot of people are involved in the project in one way or another, but the following are those who contributed the most:

  • Rogelio Nazar (IP)
  • Irene Renau (Research)
  • Gabriela Ferraro (Research)
  • Rafael Marín (Evaluation of the French taxonomy)
  • Hernán Robledo (Collaboration with Spanish Psychiatry taxonomy)
  • Nicolás Acosta (Collaboration on web development)
  • Sebastián Márques (Web design)

Related publications:

+ Nazar, R.; Soto, R.; Urrejola, K. (En prensa). Detección automática de nombres eventivos no deverbales en castellano: un enfoque cuantitativo basado en corpus. Revista Linguamatica. v. 9, n. 2.

+ Nazar, R.; Arriagada, P. (2017). POL: un nuevo sistema para la detección y clasificación de nombres propios. Procesamiento del Lenguaje Natural, n. 58, pp. 13-20.

+ Nazar, R., Renau, I. (2016). “A Quantitative analysis of the semantics of verb-argument structures”. In S. Torner and E. Bernal (eds.), Collocations and other lexical combinations in Spanish. Theoretical and Applied approaches. New York: Routledge, pp. 92-109.

+ Nazar, R. (2016). Distributional analysis applied to terminology extraction: example in the domain of psychiatry in Spanish. Terminology: International Journal of Theoretical and Applied Issues in Specialized Communication, 22(2):142-170.

+ Nazar, R.; Renau, I. (2016). A taxonomy of Spanish nouns, a statistical algorithm to generate it and its implementation in open source code. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC'16). European Language Resources Association (ELRA), May 2016.

+ Nazar, R.; Renau, I. (2016). Automatic extraction of lexico-semantic patterns from corpora. Proceedings of the XVII EURALEX International Congress: Lexicography and Linguistic Diversity. Tinatin Margalitadze and George Meladze (eds). Tbilisi, Gergia: Ivane Javakhishvili Tbilisi State University, pp. 823-830.

+ Nazar, R.; Renau, I. (2015). Agrupación semántica de sustantivos basada en similitud distribucional: implicaciones lexicográficas. In María Pilar Garcés Gómez (ed.): "Lingüística y diccionarios" (Anexos Revista de Lexicografía, vol. 2: 272-295). Universidade da Coruña.

+ Nazar, R.; Renau, I. (2015). Ontology Population Using Corpus Statistics. Proceedings of the Joint Ontology Workshops 2015 co-located with the 24th International Joint Conference on Artificial Intelligence (IJCAI 2015). Buenos Aires, Argentina, July 25-27, 2015.

Related concepts:

corpus statistics, distributional semantics, Spanish, taxonomy induction