The corpus based taxonomy induction project
UPDATE on the Taxonomy Project: 30 September, 2018: a new version is now available
The new version, which already works in French, English and Spanish,
now also performs a constant autocheck of the results
of each category, so that it will flag any errors if detected.
Soon we will also have another version able to process multi-word specialized terminology as well.
The interface is still somewhat rough but the idea is that you provide a noun (single nouns only, at the moment) and the program
will try to assign the best semantic categories for such noun. You can also provide a list of nouns (one per line)
and the program will treat each noun as an independent trial. At this moment, it does not learn from what it does, i.e., the
taxonomy does not change. But eventually, it will begin to change itself, repairing itself from wrong information.
We designed a statistically-based
taxonomy induction algorithm consisting of a combination of different strategies not involving explicit linguistic knowledge. Being all
quantitative, the strategies we present are however of different nature. Some of them are based on the computation of distributional
similarity coefficients which identify pairs of sibling words or co-hyponyms, while others are based on asymmetric co-occurrence and
identify pairs of parent-child words or hypernym-hyponym relations. A decision making process is then applied to combine the results
of the previous steps, and finally connect lexical units to a basic structure containing the most general categories of the language.
We are currently in the process of evaluating the latest results.
SOURCE CODE & DOCUMENTATION: We are working at full speed to finish the new version of source-code and documentation of the Taxonomy Project. Hang on. We will get there!
From 2014 to 2017, this research was supported by a grant from the Chilean
Government: Conicyt-Fondecyt 11140686, “Inducción
automática de taxonomías de sustantivos generales y especializados a partir de corpus textuales desde el enfoque de
la lingüística cuantitativa” (Automatic taxonomy induction from corpora for terminology and general vocabulary using quantitative measures). Lead researcher: Rogelio Nazar.
Since 2017, however, it also received support from the Ecos-Sud Project C16H02 “Inducción automática de taxonomías del español y el francés mediante técnicas cuantitativas y estadística de corpus” (Automatic taxonomy induction from corpora for Spanish and French using quantitative corpus analysis).
Various researchers are involved in the project. The following are the main contributers:
- Rogelio Nazar
- Irene Renau
- Gabriela Ferraro
- Rafael Marín
- Antonio Balvet
- Rogelio Nazar (developer)
- Nicolás Acosta (developer)
- Sebastián Márques (designer)
Related concepts: corpus statistics, distributional semantics, Spanish, taxonomy induction