Tecling logo   Technologies for Linguistic Analysis
»The World is automatic
KIND (aka The Taxonomy Project)
Example of taxonomy induction algorithm

We designed a statistically-based taxonomy induction algorithm consisting of a combination of different strategies not involving explicit linguistic knowledge. Being all quantitative, the strategies we present are however of different nature. Some of them are based on the computation of distributional similarity coefficients which identify pairs of sibling words or co-hyponyms, while others are based on asymmetric co-occurrence and identify pairs of parent-child words or hypernym-hyponym relations. A decision making process is then applied to combine the results of the previous steps, and finally connect lexical units to a basic structure containing the most general categories of the language. We evaluate the quality of the taxonomy both manually and also using Spanish Wordnet as a gold-standard. We estimate an average of 89.07% precision and 25.49% recall considering only the results which the algorithm presents with high degree of certainty, or 77.86% precision and 33.72% recall considering all results.

Website: http://www.tecling.com/kind

Funding: This research is supported by a grant from the Chilean Government: Conicyt-Fondecyt 11140686, “Inducción automática de taxonomías de sustantivos generales y especializados a partir de corpus textuales desde el enfoque de la lingüística cuantitativa” (Automatic taxonomy induction from corpora for terminology and general vocabulary using quantitative measures). Lead researcher: Rogelio Nazar.

Related publications:

+ Nazar, R.; Soto, R.; Urrejola, K. (En prensa). Detección automática de nombres eventivos no deverbales en castellano: un enfoque cuantitativo basado en corpus. Revista Linguamatica. v. 9, n. 2.

+ Nazar, R.; Arriagada, P. (2017). POL: un nuevo sistema para la detección y clasificación de nombres propios. Procesamiento del Lenguaje Natural, n. 58, pp. 13-20.

+ Nazar, R., Renau, I. (2016). “A Quantitative analysis of the semantics of verb-argument structures”. In S. Torner and E. Bernal (eds.), Collocations and other lexical combinations in Spanish. Theoretical and Applied approaches. New York: Routledge, pp. 92-109.

+ Nazar, R. (2016). Distributional analysis applied to terminology extraction: example in the domain of psychiatry in Spanish. Terminology: International Journal of Theoretical and Applied Issues in Specialized Communication, 22(2):142-170.

+ Nazar, R.; Renau, I. (2016). A taxonomy of Spanish nouns, a statistical algorithm to generate it and its implementation in open source code. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC'16). European Language Resources Association (ELRA), May 2016.

+ Nazar, R.; Renau, I. (2016). Automatic extraction of lexico-semantic patterns from corpora. Proceedings of the XVII EURALEX International Congress: Lexicography and Linguistic Diversity. Tinatin Margalitadze and George Meladze (eds). Tbilisi, Gergia: Ivane Javakhishvili Tbilisi State University, pp. 823-830.

+ Nazar, R.; Renau, I. (2015). Agrupación semántica de sustantivos basada en similitud distribucional: implicaciones lexicográficas. In María Pilar Garcés Gómez (ed.): "Lingüística y diccionarios" (Anexos Revista de Lexicografía, vol. 2: 272-295). Universidade da Coruña.

+ Nazar, R.; Renau, I. (2015). Ontology Population Using Corpus Statistics. Proceedings of the Joint Ontology Workshops 2015 co-located with the 24th International Joint Conference on Artificial Intelligence (IJCAI 2015). Buenos Aires, Argentina, July 25-27, 2015.

Related concepts: corpus statistics, distributional semantics, Spanish, taxonomy induction

      LogoAlt Contact