Poppins a very simple and yet effective algorithm for document categorization.
Text categorization has became a very popular
issue in computational linguistics and it has developed to great complexity, motivating a large
amount of literature.
Document categorization can be used in many scenarios. For instance,
an experiment on authorship attribution can be seen as a text categorization problem.
That is to say, each author represents a category and the
documents are the elements to be classified.
This system can be
used as a general purpose document classifier, for example by content instead of authorship,
because it only reproduces the criterion that it learned during the training phase.
This program is language independent because it uses purely mathematical
knowledge: an n-gram model of texts. It works in a very simple way and is therefore easy to
modify. In spite of its simplicity, this program is capable of classifying documents by author
obtaining more than 90% of accuracy.
Web demo: http://poppinsweb.com/
Document related with this project:
- Nazar, R & Sánchez Pol, M. (2006). "An Extremely Simple Authorship Attribution System",
Proceedings of the Second European IAFL Conference on Forensic Linguistics / Language and the Law, Barcelona 2006.
Contact: rogelio.nazar at gmail.com