Tags: view all tags

Web Data Commons: Extracting Structured Data from the Common Web Crawl

More and more websites embed structured data describing for instance products, people, organizations, places, events, resumes, and cooking recipes into their HTML pages using encoding standards such as Microformats, Microdatas and RDFa. The Web Data Commons project extracts all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is currently available to the public, and provide the extracted data for download in the form of RDF-quads and also in the form of CSV-tables for common entity types (e.g. product, organization, location, ...).

Web Data Commons thus enables you to use structured data originating from hundreds of million web pages within your applications without needing to crawl the Web yourself.

Pages in the Common Crawl corpora are included based on their PageRank score, thereby making the crawls snapshots of the current popular part of the Web.

Web Data Commons is a joint effort of the Web-based Systems Group at Freie Universität Berlin and the Institute AIFB at the Karlsruhe Institute of Technology.

Related links:

http://webdatacommons.org/ - project home

-- Contributors: PeterThoeny - 2012-04-11

Discussion

TWiki's data should not be siloed. RDF and Web Data Commons are ways to expose and consume content.

-- PeterThoeny - 2012-04-11

BasicForm
TopicClassification	DefineTerm
TopicSummary
InterestedParties
RelatedTopics	RDF, MicroFormat

Topic revision: r1 - 2012-04-11 - PeterThoeny

Account
- Log In
- Register User

Edit
Attach

Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2026 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.