W3C

GRDDL Primer

Editor's Draft 27 September 2006

This version:
$Id: primer.html 6180 2006-09-27 13:29:57Z id $
(see change log below)
Latest version:
http://research.talis.com/2006/grddl-wg/primer
Editor:
Ian Davis, Talis
Authors:
Brian Suda
Chimezie Ogbuji, Cleveland Clinic Foundation
Fabien Gandon, INRIA

Abstract

This document serves as an introduction to GRDDL (Gleaning Resource Descriptions from Dialects of Languages), a mechanism for obtaining RDF data from XML documents and in particular XHTML pages using explicitly associated transformation algorithms, typically represented in XSLT. It uses a number of examples from the GRDDL Use Cases document to illustrate the techniques GRDDL provides for associating documents with appropriate instructions for extracting any embedded data.

Status of this Document

This document has been developed by the GRDDL Working Group as part of the W3C Semantic Web Activity (Activity Statement, Group Charter)

Publication as a draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

Please send review comments and feedback to public-grddl-wg@w3.org, the mailing list of the GRDDL Working Group; the mailing list has a public archive.

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

Contents

Introduction

XML and RDF technologies address separate and often orthogonal problem spaces: message and structured document formats, meta data, and knowledge representation. Publishers of distributed web content meant for both human and machine consumption have much to gain from standards that enable both technologies.

GRDDL provides a relatively inexpensive set of mechanisms for bootstrapping RDF content from uniform XML dialects in such a way as to shift the burden of formulating RDF to transformation algorithms written specifically for these dialects. XML Transformation languages such as XSLT are quite versatile in their ability to process, manipulate, and generate XML and the use of XSLT to generate XHTML from single-purpose XML vocabularies is historically celebrated as a powerful idiom for separating structured content from presentation.

GRDDL shifts this idiom to a different end: separating structured content from its authoritative meaning (or semantics). The way in which GRDDL empowers authors of web content can be considered somewhat analogous to allowing a non-native speaker to learn the spoken form of a new language first, before attempting to master its written form - rather than trying to learn both simultaneously.

GRDDL works through associating transformations with an individual document either through direct inclusion of references or indirectly through profile documents. Content authors can nominate the transformations for producing RDF from their content and use GRDDL to refer to them. For XML formats the transformations are commonly expressed using XSLT 1.0, although other methods are permissible. Generally, if the transformation can be fully expressed in XSLT 1.0 then it is preferable to use that format since all GRDDL processors should be capable of interpreting an XSLT 1.0 document.

For a collection of scenarios that demonstrate how GRDDL enables common patterns in the management of distributed web data, the reader should refer to the GRDDL Use Cases.

In this document the term HTML is used to refer to the XHTML dialect of HTML.

Scheduling Example

To introduce GRDDL concepts, the following section explores how GRDDL can be used to satisfy the scheduling use case. In this use case Jane, a frequent traveller, is trying to schedule a meeting with three of her friends.

Linking to a GRDDL Transform

GRDDL provides a number of ways for the GRRDL Transformations to be associated with content, each of which is appropriate in different situations. The simplest method for authors of HTML content is to embed a reference to the transformations using a link element in the head of the document.

Microformats are simple conventions for embedding semantic markup for a specific domain in human-readable documents. In our example one of Jane's friends has marked up their schedule using the hCalendar microformat. The hCalendar microformat uses HTML class attributes to associate event related semantics with elements in the markup:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <title>Robin's Schedule</title>
  </head>
  <body>
    <ol class="schedule">
      <li>2006
        <ol>
          <li class="vevent">
            <strong class="summary">Fashion Expo</strong> in 
            <span class="location">Paris, France</span>:
            <abbr class="dtstart" title="2006-10-20">Oct 20</abbr> to 
            <abbr class="dtend" title="2006-10-22">22</abbr>
           </li>
        
          <li class="vevent">
            <strong class="summary">New line review</strong> in 
            <span class="location">Köln, Germany</span>:
            <abbr class="dtstart" title="2006-10-26">Oct 26</abbr> to 
            <abbr class="dtend" title="2006-10-27">27</abbr>
           </li>
    
          <li class="vevent">
            <strong class="summary">Clothing 2006</strong> in 
            <span class="location">Rome, Italy</span>:
            <abbr class="dtstart" title="2006-12-1">Dec 1</abbr> to 
            <abbr class="dtend" title="2006-12-5">5</abbr>
           </li>
        </ol>
      </li>
      <li>2007
        <ol>
          <li class="vevent">
            <strong class="summary">Diva Awards</strong> in 
            <span class="location">Los Angeles, USA</span>:
            <abbr class="dtstart" title="2007-01-6">Jan 6</abbr> to 
            <abbr class="dtend" title="2007-01-8">8</abbr>
           </li>
        
          <li class="vevent">
            <strong class="summary">Board Review</strong> in 
            <span class="location">New York, USA</span>:
            <abbr class="dtstart" title="2007-02-23">Feb 23</abbr> to 
            <abbr class="dtend" title="2007-02-24">24</abbr>
           </li>
    
        </ol>
      </li>
    </ol>
  </body>
</html>

To explicitly relate the data in this document to the RDF data model the author needs to make two changes. First she needs to add a profile attribute to the head element to denote that her document contains GRDDL metadata. In HTML, profiles are used to link documents to descriptions of the metadata schemes they employ. The profile URI for GRDDL is http://www.w3.org/2003/g/data-view and by including this URI in her document Robin is declaring that the metadata in her markup can be interpreted using GRDDL.

The resulting HTML might look like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head profile="http://www.w3.org/2003/g/data-view">
    <title>Robin's Schedule</title>
  </head>
  <body>
  ...

Then she needs to add a link element containing the reference to the specific instructions for converting HTML containing hCalendar patterns into RDF. She can either write her own instructions or re-use an existing set. The link element contains the token transformation in the rel attribute and the URI of the instructions for extracting RDF in the href attribute

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head profile="http://www.w3.org/2003/g/data-view">
    <title>Robin's Schedule</title>
    <link rel="transformation" href="http://www.w3.org/2002/12/cal/glean-hcal"/>
  </head>
  <body>
  ...

The profile URI in the resulting document signals that the receiver of the document may look for link elements with a rel attribute containing the token transformation and use any or all of those links to determine how to extract the data as RDF.

A diagram indicating the sequence of steps described for obtaining RDF from a document using an explicit link to the transformation as described in the preceding paragraph

Referencing Via Profile Documents

Another way to associate GRDDL instructions with a document is by referencing those transformations from a profile document referenced in the head of the HTML. This method can be more convenient for the content author but requires that the profile document contains GRDDL metadata and be accessible to the GRDDL client.

In our example another of Jane's friends, David, has chosen to mark up his schedule using Embedded RDF:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head profile="http://purl.org/NET/erdf/profile">
    <title>Where Am I</title>
    <link rel="schema.cal" href="http://www.w3.org/2002/12/cal#" />
  </head>
  <body>
    <p class="-cal-Vevent" id="tiddlywinks">
      From <span class="cal-dtstart" title="2006-10-07">7 October, 2006</span>
      to <span class="cal-dtend"  title="2006-10-13">12 October, 2006</span> 
      I will be attending the <span class="cal-summary">National Tiddlywinks
      Championship</span> in 
      <span class="cal-location">Bognor Regis, England</span>
    </p>
    
    <p class="-cal-Vevent" id="holiday">
      Then I'm <span class="cal-summary">on holiday</span> in the 
      <span class="cal-location">Cayman Islands</span> between
      <span class="cal-dtstart" title="2006-11-14">14 November, 2006</span>
      and <span class="cal-dtend"  title="2007-01-02">1 January, 2007</span> 
    </p>

    <p class="-cal-Vevent" id="award">
      I'm back in the US on <span class="cal-dtstart" title="2007-01-08">the 8th
      January</span> to <span class="cal-summary">pick up a lifetime
      achievement award from the world gamers association</span>. This time
      the ceremony is in <span class="cal-location">Los Angeles</span>. I'll be
      flying home on the <span class="cal-dtend"  title="2007-01-11">10th</span> 
    </p>
  </body>
</html>

Note that in this document the profile attribute does not contain a reference to the GRDDL profile. Instead it references the standard profile URI for Embedded RDF which does contain the GRDDL metadata. Anyone wishing to get the RDF data out of David's page can fetch the Embedded RDF profile URI to obtain the following profile document:

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head profile="http://www.w3.org/2003/g/data-view">
    <title>Embedded RDF HTML Profile</title>
    <link rel="transformation" href="http://www.w3.org/2003/g/glean-profile" />
  </head>
  <body>
    <p>
      <a rel="profileTransformation" 
          href="http://purl.org/NET/erdf/extract-rdf">GRDDL transform</a>
    </p>
  </body>
</html>

This document contains a reference to the GRDDL profile which again indicates that it may contain link elements with references to GRDDL instructions that can be applied. Note that these instructions are applied to this profile document, not David's document. Because the client is inspecting a profile document it expects that the instructions identified by http://www.w3.org/2003/g/glean-profile are for producing a list of URIs identifying instructions to be applied to David's HTML document. Those instructions are identified in the profile document using links with a rel attribute of profileTransformation.

In this case the profile transformation refers to a a stylesheet that can convert HTML containing Embedded RDF into RDF/XML. This stylesheet can be applied to David's document to obtain the equivilent RDF triples.

A diagram indicating the sequence of steps for obtaining RDF from a document using the profile URI as described in the preceding paragraph

Buying a Guitar Example

In this section the guitar review use case is used to explain more fully the role of GRDDL in aggregating data from a variety of different sources.

Stephan wishes to buy a guitar, so decides to check reviews. There are various special interest publications online which feature musical instrument reviews and there are also blogs which contain reviews by individuals. Among the reviewers there may be friends of Stephan and people whose opinion Stephan values (e.g. well-known musicians and people whose reviews Stephan has found useful in the past). There may also be reviews planted by instrument manufacturers which offer very biased views.

First, Stephan needs to get a list of people he considers trusted sources into some sort of machine readable document. FOAF and vCard/RDF are both suitable sources to extract the data from. The question is how to get these values? Microformats define simple formats which can easily convert between HTML and RDF through the use of GRDDL. To extract vCard/RDF from HTML he uses an XSLT stylesheet to transform the hCard encoded HTML document.

<address class="vcard" id="smith-stephan">
<a href="http://example.org/ssmith" class="fn url">Stephan Smith</a>
</address>

This snippet of HTML is converted into RDF with the use of the XSLT

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF 
  xmlns:rdf  ="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:vCard="http://www.w3.org/2001/vcard-rdf/3.0#"
>	
 <rdf:Description rdf:about="http://example.org/ssmith">
  <vCard:FN>Stephan Smith</vCard:FN>
  <vCard:URL>http://example.org/ssmith</vCard:URL>

 </rdf:Description>
</rdf:RDF>

Another microformat that allows for more information to be gleaned from the document is XFN. XFN is the HTML friends network. Using values in the rel attributes of links it is possible to assert the types of relationships between the site owner and their friends, colleagues, co-workers, etc. Since XFN values are found on 'a' elements, this gives us another resource to follow and look for more hCards and more XFN values. This allows for use to modify the circle of trust from our direct friends to first-order friends of our friends.

<ul>
  <li><a href="http://peter.example.org/" rel="met friend collegue">Peter Smith</a></li>
  <li><a href="http://john.example.org/" rel="met">John Doe</a></li>
  <li><a href="http://paul.example.org/" rel="met">Paul Revere</a></li>
</ul>

Given a seed URI with XFN data, a GRDDL transformation can extract FOAF data about all of these people. That FOAF file will then give us an additional list of URIs that can be spidered for additional GRDDL vCard-RDF data about each friend.

Another property in XFN is 'me' which is used for identification consolidation. With this value it is possible to say that the data over on site 1 is also me and should be considered as if it were from my own site. This allows us to extend our ability to use different resources. For instance:

<ul>
  <li><a href="http://del.icio.us/guitar-rocker45" rel="me">My Del.icio.us Link</a></li>
  <li><a href="http://claimid.com/guitar-rocker" rel="me">Me on ClaimID</a></li>
  <li><a href="http://guitar-rocker.com" rel="me">I love guitars</a></li>
</ul>

The power of the rel="me" and the identity consolidation is that it allows us to glean data from multiple sources and merge it all into a single RDF document about a single individual. For example, the del.icio.us link could be encoded into RDF and associated with a user "guitar-rocker45", but because of the rel="me" and any reciprocal to example.org assertions can be made that the bookmarks have an owner "Stephan Smith" who has an RDF-vCard at "example.org" and has data in other places on other services such as claimid.com and guitar-rocker.com. All of these can be merged to form a bigger picture of "Stephan Smith" at "http://example.org/stephan"

On the Guitar site, there are product reviews for each guitar. The guitars are also marked up with microformats so it is possible to extract machine-readable data about each item. Along with manufacturer data, each member of the site can also leave feedback about the item in the form of a review.

Stephan's friend Peter Smith writes several reviews of a new guitars. Each review has a link to the reviewer, which in this case is a link back to Peter's profile page on the guitar site. Stephan know that the profile page belongs to Peter by visual inspection, but a machine does not. Luckily, on Peter's profile page on the guitar site, it allows him to link back to his own personal site. This link has a rel="me" value. Now a machine can assert that the Peter on the Guitar site, is the same Peter that is listed in Stephan's XFN list, which was converted to FOAF, because the URIs resolve to the same resource.

With all of these tools it is possible to find Stephan's friends and to find additional resources that we know those friends created. Using GRDDL is it possible to glean information about the guitar in the form of product specifications supplied by the manufacture and reviews from site members. Once we have this data as RDF it can be passed into a SPARQL engine and queries can be run on it.

If Stephan was looking for a Guitar with a specific review rating or higher, from a selected group of friends, we now have enough data in RDF to do just that.

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX rev: <http:/www.purl.org/stuff/rev#>

SELECT DISTINCT ?name ?rating

FROM <http://example.org/guitar/1234/>

WHERE {
  ?x rev:reviewer ?reviewer ;
     rev:rating ?rating . 
  FILTER (?rating > "2") .
  ?reviewer foaf:name ?name .
}

The first restriction on the data can be a check on review data such as rating. Once we have all the matching reviews, we can then restricted based on Stephan's friends. Using a seeded list of XFN URIs given by Stephan that are converted to FOAF, we can match the URIs to any URIs from the FOAF generated from the guitar reviews. Now we have a list of members that Stephan trusts relative to the guitar site. We can then match the URIs of the reviewers to the URIs in the XFN list.

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX rev: <http:/www.purl.org/stuff/rev#> 
PREFIX xfn: <http://gmpg.org/xfn/11#>

SELECT DISTINCT ?name ?rating ?xfnhomepage ?foafhomepage

FROM <http://example.org/guitar/1234/>
FROM <http://stephans-homepage.org/blogroll/xfn/>

WHERE {
  ?x rev:reviewer ?reviewer ;
     rev:rating ?rating .
  FILTER (?rating > "2") . 

  ?reviewer foaf:name ?name ;
            foaf:homepage ?foafhomepage .

  ?y xfn:friend ?xperson .
  ?xperson foaf:homepage ?xfnhomepage . 
  FILTER (?xfnhomepage = ?foafhomepage) 
}

SPARQL results can be obtained as XML or JSON and can easily be consumed by another application. This can display the results on screen, email them to Stephan or it can be pulled into another application to search the web for the best prices on the short list of guitars.

References

[GRDDL Draft]
Gleaning Resource Descriptions from Dialects of Languages (GRDDL), Dominique Hazaël-Massieux, Dan Connolly, Authors'draft, 2006/03/09 15:45:31, http://www.w3.org/2004/01/rdxh/spec. Latest version available at http://www.w3.org/TR/grddl/.
[Microformats]
Microformats.org, 2006/08/30 11:05:31, http://microformats.org/ .
[RDF]
Resource Description Framework (RDF) Model and Syntax Specification, Ora Lassila, Ralph R. Swick, Editors. World Wide Web Consortium Recommendation, 1999,
http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/.
Latest version available at http://www.w3.org/TR/REC-rdf-syntax/.
RDF Vocabulary Description Language 1.0: RDF Schema, Dan Brickley and R.V. Guha, Editors. W3C Recommendation, 10 February 2004,
http://www.w3.org/TR/2004/REC-rdf-schema-20040210/ .
Latest version available at http://www.w3.org/TR/rdf-schema/.
[SPARQL]
SPARQL Query Language for RDF, Eric Prud'hommeaux and Andy Seaborne, Editors. W3C Candidate Recommendation 6 April 2006,
http://www.w3.org/TR/2006/CR-rdf-sparql-query-20060406/ .
Latest version available at http://www.w3.org/TR/rdf-sparql-query/.

Acknowledgements

The editor would like to thank the following Working Group members for their contributions to this document: A, B ,C .

This document is a product of the GRDDL Working Group.

Change Log

$Log:  $

Valid XHTML 1.1 Valid CSS!

pubrules check