The corpus based taxonomy induction project

UPDATE on the Taxonomy Project: 30 September, 2018: a new version is now available

http://www.tecling.com/cgi-bin/kind

The new version, which already works in French, English and Spanish, now also performs a constant autocheck of the results of each category, so that it will flag any errors if detected.

Soon we will also have another version able to process multi-word specialized terminology as well.

The interface is still somewhat rough but the idea is that you provide a noun (single nouns only, at the moment) and the program will try to assign the best semantic categories for such noun. You can also provide a list of nouns (one per line) and the program will treat each noun as an independent trial. At this moment, it does not learn from what it does, i.e., the taxonomy does not change. But eventually, it will begin to change itself, repairing itself from wrong information.



Abstract

We designed a statistically-based taxonomy induction algorithm consisting of a combination of different strategies not involving explicit linguistic knowledge. Being all quantitative, the strategies we present are however of different nature. Some of them are based on the computation of distributional similarity coefficients which identify pairs of sibling words or co-hyponyms, while others are based on asymmetric co-occurrence and identify pairs of parent-child words or hypernym-hyponym relations. A decision making process is then applied to combine the results of the previous steps, and finally connect lexical units to a basic structure containing the most general categories of the language. We are currently in the process of evaluating the latest results.

SOURCE CODE & DOCUMENTATION: We are working at full speed to finish the new version of source-code and documentation of the Taxonomy Project. Hang on. We will get there!

Funding:

From 2014 to 2017, this research was supported by a grant from the Chilean Government: Conicyt-Fondecyt 11140686, “Inducción automática de taxonomías de sustantivos generales y especializados a partir de corpus textuales desde el enfoque de la lingüística cuantitativa” (Automatic taxonomy induction from corpora for terminology and general vocabulary using quantitative measures). Lead researcher: Rogelio Nazar.
Since 2017, however, it also received support from the Ecos-Sud Project C16H02 “Inducción automática de taxonomías del español y el francés mediante técnicas cuantitativas y estadística de corpus” (Automatic taxonomy induction from corpora for Spanish and French using quantitative corpus analysis).

Credits

Various researchers are involved in the project. The following are the main contributers:

  • Rogelio Nazar
  • Irene Renau
  • Gabriela Ferraro
  • Rafael Marín
  • Antonio Balvet

Developers:
  • Rogelio Nazar (developer)
  • Nicolás Acosta (developer)
  • Sebastián Márques (designer)

Related publications:



+ Nazar, R.; Soto, R.; Urrejola, K. (2017). Detección automática de nombres eventivos no deverbales en castellano: un enfoque cuantitativo basado en corpus. Revista Linguamatica. , vol. 9, num. 2, pp. 21-31.

+ Nazar, R.; Arriagada, P. (2017). POL: un nuevo sistema para la detección y clasificación de nombres propios. Procesamiento del Lenguaje Natural, n. 58, pp. 13-20.

+ Nazar, R., Renau, I. (2016). “A Quantitative analysis of the semantics of verb-argument structures”. In S. Torner and E. Bernal (eds.), Collocations and other lexical combinations in Spanish. Theoretical and Applied approaches. New York: Routledge, pp. 92-109.

+ Nazar, R. (2016). Distributional analysis applied to terminology extraction: example in the domain of psychiatry in Spanish. Terminology: International Journal of Theoretical and Applied Issues in Specialized Communication, 22(2):142-170.

+ Nazar, R.; Renau, I. (2016). A taxonomy of Spanish nouns, a statistical algorithm to generate it and its implementation in open source code. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC'16). European Language Resources Association (ELRA), May 2016.

+ Nazar, R.; Renau, I. (2016). Automatic extraction of lexico-semantic patterns from corpora. Proceedings of the XVII EURALEX International Congress: Lexicography and Linguistic Diversity. Tinatin Margalitadze and George Meladze (eds). Tbilisi, Gergia: Ivane Javakhishvili Tbilisi State University, pp. 823-830.

+ Nazar, R.; Renau, I. (2015). Agrupación semántica de sustantivos basada en similitud distribucional: implicaciones lexicográficas. In María Pilar Garcés Gómez (ed.): "Lingüística y diccionarios" (Anexos Revista de Lexicografía, vol. 2: 272-295). Universidade da Coruña.

+ Nazar, R.; Renau, I. (2015). Ontology Population Using Corpus Statistics. Proceedings of the Joint Ontology Workshops 2015 co-located with the 24th International Joint Conference on Artificial Intelligence (IJCAI 2015). Buenos Aires, Argentina, July 25-27, 2015.

Related concepts:

corpus statistics, distributional semantics, Spanish, taxonomy induction