Poppins a very simple and yet effective algorithm for document categorization.
Text categorization has became a very popular
issue in computational linguistics and it has developed to great complexity, motivating a large
amount of literature.
Document categorization can be used in many scenarios. For instance,
an experiment on authorship attribution can be seen as a text categorization problem.
That is to say, each author represents a category and the
documents are the elements to be classified.
This system can be
used as a general purpose document classifier, for example by content instead of authorship,
because it only reproduces the criterion that it learned during the training phase.
This program is language independent because it uses purely mathematical
knowledge: an n-gram model of texts. It works in a very simple way and is therefore easy to
modify. In spite of its simplicity, this program is capable of classifying documents by author
obtaining more than 90% of accuracy.