Automatic Document Classification

also: unsupervised document classification

Motivation
Techniques
Publications
- Managing Content with Automatic Document Classification
- Information Retrieval
Conferences & Journals
Research Groups
Implementations
Online Demos
Discussion

Motivation

The weakest point of document classification (for example to enable FacetedNavigation, related content navigation or relevance measures) is that it relys on the authors to classify their documents on their own without any extra provisions so far. Doing so fully manually (supervised) has the drawbacks that

authors don't agree on ontologies
its a daunting task
ontologies need maintenance themselves
they tend to be baffling and sometimes counterintuitive or even artificial for several tasks
authors are not interested in ontologies; they want to write
the "surprise" factor is low using manual document classification; for the author himself, who bravely classifies his documents, the value in return is very low

Supervised document classification nevertheless has its application when you definitely must be sure about document relations, i.e. related products on a shopping site.

Ok, then let's have a look into research literature which methods are available that allow to (semi-)automate the task of content classification.

-- MichaelDaum - 03 Jul 2005

Techniques

Naive Bayes Classifier
Latent Semantic Analysis (LSA)
Probabilistic Latent Semantic Analysis (alternative to LSA): pdf
Support Vector Machines

Publications

Managing Content with Automatic Document Classification

Rafael A. Calvo, Jae-Moon Lee and Xiaobo Li (2004), Journal of Digital Information, pdf

Abstract: News articles and Web directories represent some of the most popular and commonly accessed content on the Web. Information designers normally define categories that model these knowledge domains (i.e. news topics or Web categories) and domain experts assign documents to these categories. The paper describes how machine learning and automatic document classification techniques can be used for managing large numbers of news articles, or Web page descriptions, lightening the load on domain experts. The paper uses two datasets, one with with more than 800,000 Reuters news stories and another with over 41,000 Web sites, and classifies them using a Na�ve Bayes algorithm, into predefined categories. We discuss the different parameters and design decisions that normally appear when building automatic classifiers, including, stemming, stop-words, thresholding, amount of data and approaches for improving performance using the structure in XML documents. The methodology developed would enable Web based applications or workflow systems to manage information more efficiently, i.e. by assigning documents to topics automatically or assisting humans in the process of doing so.

see also: http://www.steptwo.com.au/columntwo/archives/001306.html (online forum)

keywords: naive bayesian

Information Retrieval

Second Edition, keith@dcsPLEASENOSPAM.gla.ac.uk

Conferences & Journals

Journal of Digital Information: Publishing papers on the management, presentation and uses of information in digital environments
2nd International Semantic Web Conference (ISWC2003): online proceedings available
Semantic Web Technologies for Searching and Retrieving Scientific Data, Workshop at the ISWC2003.
UserSWeb 2005: Workshop on End User Aspects of the Semantic Web (2005)

Research Groups

Information Management Systems at the University of Padova (Massimo Melucci)
Information Retrieval Group at the University of Glasgow

Implementations

Automatic Document Classification With Perl, AI::Categorize by Ken Williams ken@mathforumPLEASENOSPAM.org, sources
TACHIR: a Tool for the Automatic Construction of Hypertexts for Information Retrieval, developed by Massimo Melucci as part of his Ph.D. thesis (1996), massimo.melucci@unipdPLEASENOSPAM.it,

Online Demos

http://www.dcs.gla.ac.uk/~iain/keith/data/

Discussion

Harrr, citeseer is currently down. Please follow Melluci's homepage, there are tons of publications available online.

-- MichaelDaum - 03 Jul 2005

Interesting overview.

Glancing over the article I see that automatic classification is gradually getting better. Well, the last time I did some serious research on this was in 1998. The progress is not revolutionary, but it is encouraging.

While describing the content is the most important and difficult task - classification is more than describing subject matter. These other aspects can be automated too for a large part, but they need a different approach.

FacetedNavigation uses all kinds of attributes of a document. Attributes may be the authors, the first publication date, the latest modification date, the lenght of the document, and the kind of document. The latter is what we now use webs for: a support topic, a dev topic, a doc topic - automatically assigning a 'kind of topic' is difficult if the topic does contain little text.

So yes, we need automatic classification for subject matter. But still we need other tools to categorize topics in other ways.

-- ArthurClemens - 03 Jul 2005

Good link to AI::Categorize documentation: http://search.cpan.org/~kwilliams/AI-Categorizer-0.07/lib/AI/Categorizer.pm

-- PeterNixon - 03 Jul 2005

http://en.wikipedia.org/wiki/Document_classification

-- MichaelDaum - 03 Jul 2005

This work will become part of the upcoming ClassificationPlugin.

-- MichaelDaum - 25 Aug 2008

BasicForm
TopicClassification	BrainstormingIdea
TopicSummary	Resources about Automatic Document Classification: essential readings and how to integrate into a wiki
InterestedParties
RelatedTopics	WhyWebsAreABadIdea, FacetedNavigation, MultiLevelWikiWebs

Topic revision: r8 - 2008-08-25 - MichaelDaum

Account
- Log In
- Register User

Edit
Attach

Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2026 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.