Kind - The Taxonomy Project

You are now on the Spanish side of the taxonomy

You can also go to the other sides:

--> English
--> French

New! (April 23, 2021): now you can navigate the categories created so far

Alternatively, you can do as before, that is, to enter any arbitrary word or a list of words in Spanish to classify them (one per line):

Verbose
(this will make it very verbose, so don't tick it if you are
puting many input words, otherwise you will get fluded with text).

Ignore cache
(this will make it not-use the cache, thus it will ignore
any previous classification of the same noun that it may
have done in the past.

You can also try with these precompiled examples.
They have already been tagged by humans, so the evaluation can be made automatically.

+ 358 single nouns in the categories of weapons, cheese, fish and machines
+674 terms of various classes (SemEval 2018) (this one is very noisy)
+ 1C.spanish.test.gold.txt
+ 1767 names of drugs and pharmaceutical products

These others are single nouns of various categories, not tagged:

And these are some terms, also not tagged:
+ 1710 psychiatry terms

Abstract

We designed a statistically-based taxonomy induction algorithm consisting of a combination of different strategies not involving explicit linguistic knowledge. Being all quantitative, the strategies we present are however of different nature. Some of them are based on the computation of distributional similarity coefficients which identify pairs of sibling words or co-hyponyms, while others are based on asymmetric co-occurrence and identify pairs of parent-child words or hypernym-hyponym relations. A decision making process is then applied to combine the results of the previous steps, and finally connect lexical units to a basic structure containing the most general categories of the language. We are currently in the process of evaluating the latest results.

Documentation and source code

Today is Friday, December 11, 2020. The source code and documentation have been changing very rapidly and we are still working on the details. However, you can have a look at what we've got so far, which is a stable version.
We will not maintain older versions.
If you would like to send inquiries you are welcome to do so at rogelio (dot) nazar (at) gmail (dot) com .

Funding

This project has been supported by two successive grants:

Conicyt-Fondecyt 11140686, “Inducción automática de taxonomías de sustantivos generales y especializados a partir de corpus textuales desde el enfoque de la lingüística cuantitativa” (Automatic taxonomy induction from corpora for terminology and general vocabulary using quantitative measures). Lead researcher: Rogelio Nazar. (2014 to 2017).
Ecos Sud-Conicyt Project C16H02 “Inducción automática de taxonomías del español y el francés mediante técnicas cuantitativas y estadística de corpus” (Automatic taxonomy induction from corpora for Spanish and French using quantitative corpus analysis). Lead researcher: Irene Renau. (2016-2019).

Credits

Researchers:

Rogelio Nazar
Antonio Balvet
Gabriela Ferraro
Rafael Marín
Irene Renau

Related publications:

+ Nazar, R.; Balvet, A.; Ferraro, G.; Marín, R.; Renau, I. (2020). Pruning and repopulating a lexical taxonomy: experiments in Spanish, English and French. Journal of Intelligent Systems, vol. 30 num. 1, pp. 376-394.
+ Nazar, R.; Obreque, J.; Renau, I. (2020). Tarántula –> araña –> animal : asignación de hiperónimos de segundo nivel basada en métodos de similitud distribucional. Procesamiento del Lenguaje Natural, núm 64, pp. 29-36.
+ Nazar, Rogelio. (2019). El análisis cuantitativo de la coocurrencia léxica en la lexicografía especializada. In Sanmartín Sáez, Julia y Quilis Merín, Mercedes (eds.). Retos y avances en lexicografía: los diccionarios del español en el eje de la variación lingüística. Anejo 10 de Normas. Valencia: Asociación Española de Estudios Lexicográficos.
+ Nazar, R., Renau, I., Marín, R. (2017). Experiments in taxonomy induction in Spanish and French. Proceedings of Language, Ontology, Terminology and Knowledge Structures Workshop (LOTKS 2017), pp. 66-75.
+ Nazar, R., Renau, I., Marín, R. (2017). Taxonomía automatizada de sustantivos del castellano y del francés: hacia el etiquetado semántico automático multilingüe. In: Sariego López, Ignacio, Juan Gutiérrez Cuadrado y Cecilio Garriga Escribano (eds.), El diccionario en la encrucijada: de la sintaxis y la cultura al desafío digital. Santander: AELEX, 731-745.
+ Renau, I., Nazar, R. (2017). Verbos en contexto: una propuesta para la detección automática de patrones léxicos en corpus. In: Sariego López, Ignacio, Juan Gutiérrez Cuadrado y Cecilio Garriga Escribano (eds.), El diccionario en la encrucijada: de la sintaxis y la cultura al desafío digital. Santander: AELEX, 879- 897.
+ Nazar, R.; Arriagada, P. (2017). POL: un nuevo sistema para la detección y clasificación de nombres propios. Procesamiento del Lenguaje Natural, n. 58, pp. 13-20.
+ Nazar, R.; Soto, R.; Urrejola, K. (2017). Detección automática de nombres eventivos no deverbales en castellano: un enfoque cuantitativo basado en corpus. Revista Linguamatica, vol. 9, num. 2, pp. 21-31.
+ Nazar, R. (2016). Distributional analysis applied to terminology extraction: example in the domain of psychiatry in Spanish. Terminology: International Journal of Theoretical and Applied Issues in Specialized Communication, 22(2):142-170.
+ Nazar, R., Renau, I. (2016). A Quantitative analysis of the semantics of verb-argument structures. In S. Torner and E. Bernal (eds.), Collocations and other lexical combinations in Spanish. Theoretical and Applied approaches. New York: Routledge, pp. 114-136.
+ Nazar, R.; Renau, I. (2016). Automatic extraction of lexico-semantic patterns from corpora. Proceedings of the XVII EURALEX International Congress: Lexicography and Linguistic Diversity. Tinatin Margalitadze and George Meladze (eds). Tbilisi, Gergia: Ivane Javakhishvili Tbilisi State University, pp. 823-830.
+ Nazar, R.; Renau, I. (2016). A taxonomy of Spanish nouns, a statistical algorithm to generate it and its implementation in open source code. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC'16). European Language Resources Association (ELRA), May 2016, pp. 1485-1492.
+ Nazar, R.; Renau, I. (2015). Agrupación semántica de sustantivos basada en similitud distribucional: implicaciones lexicográficas. In María Pilar Garcés Gómez (ed.): Lingüística y diccionarios (Anexos Revista de Lexicografía, vol. 2: 272-295). Universidade da Coruña.
+ Nazar, R.; Renau, I. (2015). Ontology Population Using Corpus Statistics. Proceedings of the Joint Ontology Workshops 2015 co-located with the 24th International Joint Conference on Artificial Intelligence (IJCAI 2015). Buenos Aires, Argentina, July 25-27, 2015.
Related concepts:
corpus statistics, distributional semantics, Spanish, taxonomy induction