Web Data Commons: Extracting Structured Data from the Common Web Crawl
More and more websites embed structured data describing for instance products, people, organizations, places, events, resumes, and cooking recipes into their
HTML pages using encoding standards such as Microformats, Microdatas and RDFa. The Web Data Commons project extracts all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is currently available to the public, and provide the extracted data for download in the form of
RDF-quads and also in the form of
CSV-tables for common entity types (e.g. product, organization, location, ...).
Web Data Commons thus enables you to use structured data originating from hundreds of million web pages within your applications without needing to crawl the Web yourself.
Pages in the Common Crawl corpora are included based on their
PageRank score, thereby making the crawls snapshots of the current popular part of the Web.
Web Data Commons is a joint effort of the Web-based Systems Group at Freie Universität Berlin and the Institute AIFB at the Karlsruhe Institute of Technology.
Related links:
--
Contributors: PeterThoeny - 2012-04-11
Discussion
TWiki's data should not be siloed.
RDF and Web Data Commons are ways to expose and consume content.
--
PeterThoeny - 2012-04-11