Tags:
create new tag
view all tags

Web Data Commons: Extracting Structured Data from the Common Web Crawl

More and more websites embed structured data describing for instance products, people, organizations, places, events, resumes, and cooking recipes into their HTML pages using encoding standards such as Microformats, Microdatas and RDFa. The Web Data Commons project extracts all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is currently available to the public, and provide the extracted data for download in the form of RDF-quads and also in the form of CSV-tables for common entity types (e.g. product, organization, location, ...).

Web Data Commons thus enables you to use structured data originating from hundreds of million web pages within your applications without needing to crawl the Web yourself.

Pages in the Common Crawl corpora are included based on their PageRank score, thereby making the crawls snapshots of the current popular part of the Web.

Web Data Commons is a joint effort of the Web-based Systems Group at Freie Universität Berlin and the Institute AIFB at the Karlsruhe Institute of Technology.

Related links:

-- Contributors: PeterThoeny - 2012-04-11

Discussion

TWiki's data should not be siloed. RDF and Web Data Commons are ways to expose and consume content.

-- PeterThoeny - 2012-04-11

Topic revision: r1 - 2012-04-11 - PeterThoeny
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2026 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.