Tecling logo   Technologies for Linguistic Analysis
»The World is automatic
Screenshot of PoppinsWeb.com

Poppins a very simple and yet effective algorithm for document categorization. Text categorization has became a very popular issue in computational linguistics and it has developed to great complexity, motivating a large amount of literature. Document categorization can be used in many scenarios. For instance, an experiment on authorship attribution can be seen as a text categorization problem. That is to say, each author represents a category and the documents are the elements to be classified. This system can be used as a general purpose document classifier, for example by content instead of authorship, because it only reproduces the criterion that it learned during the training phase. This program is language independent because it uses purely mathematical knowledge: an n-gram model of texts. It works in a very simple way and is therefore easy to modify. In spite of its simplicity, this program is capable of classifying documents by author obtaining more than 90% of accuracy.

Web demo: http://poppinsweb.com/

Document related with this project:

  • Nazar, R & Sánchez Pol, M. (2006). "An Extremely Simple Authorship Attribution System", (PDF),
    Proceedings of the Second European IAFL Conference on Forensic Linguistics / Language and the Law, Barcelona 2006.

    Contact: rogelio.nazar at gmail.com
      LogoAlt Contact