KIND (aka The Taxonomy Project)
We designed a statistically-based
taxonomy induction algorithm consisting of a combination of different strategies not involving explicit linguistic knowledge. Being all
quantitative, the strategies we present are however of different nature. Some of them are based on the computation of distributional
similarity coefficients which identify pairs of sibling words or co-hyponyms, while others are based on asymmetric co-occurrence and
identify pairs of parent-child words or hypernym-hyponym relations. A decision making process is then applied to combine the results
of the previous steps, and finally connect lexical units to a basic structure containing the most general categories of the language. We
evaluate the quality of the taxonomy both manually and also using Spanish Wordnet as a gold-standard. We estimate an average of
89.07% precision and 25.49% recall considering only the results which the algorithm presents with high degree of certainty, or 77.86%
precision and 33.72% recall considering all results.
Website: http://www.tecling.com/kind
Funding:
This research is supported by a grant from the Chilean
Government: Conicyt-Fondecyt 11140686, “Inducción
automática de taxonomías de sustantivos generales y especializados a partir de corpus textuales desde el enfoque de
la lingüística cuantitativa” (Automatic taxonomy induction from corpora for terminology and general vocabulary using quantitative measures). Lead researcher: Rogelio Nazar.
Related publications:
+ Nazar, R.; Balvet, A.; Ferraro, G.; Marín, R.; Renau, I. (2020). "Pruning and repopulating a lexical taxonomy: experiments in Spanish, English and French". Journal of Intelligent Systems, vol. 30 num. 1, pp. 376-394.
+ Nazar, R.; Obreque, J.; Renau, I. (2020). "Tarántula –> araña –> animal : asignación de hiperónimos de segundo nivel basada en métodos de similitud distribucional". Procesamiento del Lenguaje Natural, núm 64, pp. 29-36.
+ Nazar, R.; Soto, R.; Urrejola, K. (2017). Detección automática de nombres eventivos no deverbales en castellano: un enfoque cuantitativo basado en corpus. Revista Linguamatica, vol. 9, num. 2, pp. 21-31.
+ Nazar, R.; Arriagada, P. (2017). POL: un nuevo sistema para la detección y clasificación de nombres propios. Procesamiento del Lenguaje Natural, n. 58, pp. 13-20.
+ Nazar, R., Renau, I. (2016). “A Quantitative analysis of the semantics of verb-argument structures”. In S. Torner and E. Bernal (eds.), Collocations and other lexical combinations in Spanish. Theoretical and Applied approaches. New York: Routledge, pp. 92-109.
+ Nazar, R. (2016). Distributional analysis applied to terminology extraction: example in the domain of psychiatry in Spanish. Terminology: International Journal of Theoretical and Applied Issues in Specialized Communication, 22(2):142-170.
+ Nazar, R.; Renau, I. (2016). A taxonomy of Spanish nouns, a statistical algorithm to generate it and its implementation in open source code. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC'16). European Language Resources Association (ELRA), May 2016.
+ Nazar, R.; Renau, I. (2016). Automatic extraction of lexico-semantic
patterns from corpora. Proceedings of the XVII EURALEX International
Congress: Lexicography and Linguistic Diversity.
Tinatin Margalitadze and George Meladze (eds). Tbilisi, Gergia: Ivane
Javakhishvili Tbilisi State University, pp. 823-830.
+ Nazar, R.; Renau, I. (2015). Agrupación semántica de sustantivos
basada en similitud distribucional: implicaciones lexicográficas. In
María Pilar Garcés Gómez (ed.): "Lingüística y diccionarios" (Anexos
Revista de Lexicografía, vol. 2: 272-295). Universidade da Coruña.
+ Nazar, R.; Renau, I. (2015). Ontology Population Using Corpus Statistics. Proceedings of the Joint Ontology Workshops 2015
co-located with the 24th International Joint Conference on Artificial
Intelligence (IJCAI 2015). Buenos Aires, Argentina, July 25-27, 2015.
Related concepts: corpus statistics, distributional semantics, Spanish, taxonomy induction
|