Automatic Document Classification
also: unsupervised document classification
CategoryOrganizingPrinciples
Motivation
The weakest point of document classification (for example to enable
FacetedNavigation, related content navigation or relevance measures) is that
it relys on the authors to classify their documents on their own without any extra provisions so far. Doing so fully manually (supervised)
has the drawbacks that
- authors don't agree on ontologies
- its a daunting task
- ontologies need maintenance themselves
- they tend to be baffling and sometimes counterintuitive or even artificial for several tasks
- authors are not interested in ontologies; they want to write
- the "surprise" factor is low using manual document classification; for the author himself, who bravely classifies his documents, the value in return is very low
Supervised document classification nevertheless has its application when you definitely
must be sure about document relations, i.e. related products on a shopping site.
Ok, then let's have a look into research literature which methods are
available that allow to (semi-)automate the task of content classification.
--
MichaelDaum - 03 Jul 2005
Techniques
Publications
Rafael A. Calvo, Jae-Moon Lee and Xiaobo Li (2004),
Journal of Digital Information,
pdf
Abstract:
News articles and Web directories represent some of the most popular and commonly accessed content on the Web. Information designers normally define categories that model these knowledge domains (i.e. news topics or Web categories) and domain experts assign documents to these categories. The paper describes how machine learning and automatic document classification techniques can be used for managing large numbers of news articles, or Web page descriptions, lightening the load on domain experts. The paper uses two datasets, one with with more than 800,000 Reuters news stories and another with over 41,000 Web sites, and classifies them using a Naïve Bayes algorithm, into predefined categories. We discuss the different parameters and design decisions that normally appear when building automatic classifiers, including, stemming, stop-words, thresholding, amount of data and approaches for improving performance using the structure in
XML documents. The methodology developed would enable Web based applications or workflow systems to manage information more efficiently, i.e. by assigning documents to topics automatically or assisting humans in the process of doing so.
see also:
http://www.steptwo.com.au/columntwo/archives/001306.html
(online forum)
keywords: naive bayesian
Second Edition,
keith@dcsPLEASENOSPAM.gla.ac.uk
Conferences & Journals
Research Groups
Implementations
Online Demos
Discussion
Harrr,
citeseer
is currently down. Please follow
Melluci's homepage
, there are tons of publications available online.
--
MichaelDaum - 03 Jul 2005
Interesting overview.
Glancing over the article I see that automatic classification is gradually getting better. Well, the last time I did some serious research on this was in 1998. The progress is not revolutionary, but it is encouraging.
While describing the content is the most important and difficult task - classification is more than describing subject matter. These other aspects can be automated too for a large part, but they need a different approach.
FacetedNavigation uses all kinds of attributes of a document. Attributes may be the authors, the first publication date, the latest modification date, the lenght of the document, and the kind of document. The latter is what we now use webs for: a support topic, a dev topic, a doc topic - automatically assigning a 'kind of topic' is difficult if the topic does contain little text.
So yes, we need automatic classification for subject matter. But still we need other tools to categorize topics in other ways.
--
ArthurClemens - 03 Jul 2005
Good link to
AI::Categorize documentation:
http://search.cpan.org/~kwilliams/AI-Categorizer-0.07/lib/AI/Categorizer.pm
--
PeterNixon - 03 Jul 2005
http://en.wikipedia.org/wiki/Document_classification
--
MichaelDaum - 03 Jul 2005
This work will become part of the upcoming
ClassificationPlugin.
--
MichaelDaum - 25 Aug 2008