A how-to guide for creating a Linked Data site

“That sounds fantastic” you say, “..but how do I create a Linked Data site?”

In this article, I will try to address this by walking you through the whole process: from zero to look at my linked data bling. Well, at least one way of going at it. I will conveniently dodge the question of answering what constitutes a Linked Data site, which technologies are involved, or how things are ought to be built. However, many would agree that TimBL's Design Issues on Linked Data is the authoritative outline.

For our purposes, we will assume that you have a basic understanding of the RDF data model, have seen a few triple statements in one of the serializations, know what SPARQL is for, can read some PHP and HTML code, not too shy in the Linux (Ubuntu/Debian as far as these examples go) command line, and have Apache and PHP ready to go in your environment.

Unfortunately there are lots of steps at the moment, but, I will do my best to make it painless. Bare with the Linked Data community as improvements are made on a daily basis. I’ll do my best to keep the tutorial steps as up to date as possible, but, don’t be shocked if something is slightly off. Please let me know, and I’ll update here. With that out of the way, let’s dive in.

What’s on the menu?

  • Setup a SPARQL server to store and query our data, and import RDF triples into that store
  • Install a bunch of tools which will interact with the queried data
  • Create templates to output stuff from our RDF store

Setting up Fuseki

We will use Fuseki as our SPARQL server. If you wish to use a different server (see also SPARQL Query Engines), you can skip this part of the how-to. Before we get Fuseki, let’s make sure the essentials to run it are in place:

sudo apt-get install maven
sudo apt-get install ruby
sudo apt-get install subversion

And, now, Fuseki:

sudo svn co https://svn.apache.org/repos/asf/jena/trunk/jena-fuseki/ /usr/lib/fuseki
cd /usr/lib/fuseki
sudo mvn clean package

Note: If you get build errors of some sort (after all, this is the development version), you could either try to roll back to some version that builds without errors, or instead use the official Fuseki builds which are more stable.

Lets configure the way we want to run Fuseki.

Although the following depends on your needs, it is worth pointing out an example custom configuration for your query results. For instance, if we want DESCRIBE queries to look for the resource as a subject (default) as well as the object in the triple, there is a class that does that for us. Keep in mind that this query is a bit more resource intensive, hence you should only enable the following if you are sure about it. In any case, it can be used by copying the required file over:

sudo cp src-dev/dev/BackwardForwardDescribeFactory.java src/main/java/org/openjena/fuseki/

Update the namespace to use org.openjena.fuseki instead of dev for package at /usr/lib/fuseki/src/main/java/org/openjena/fuseki/BackwardForwardDescribeFactory.java:

package org.openjena.fuseki;

We have one more change to do, and that's in /usr/lib/fuseki/tdb2.ttl. Configuring TDB settings for the Fuseki server. In here, we use the same namespace that we used earlier i.e., org.openjena.fuseki for BackwardForwardDescribeFactory. Additionally, we can uncomment tdb:unionDefaultGraph true ; to use all graphs in the dataset as one, default graph. We can of course always refer to graphs individually.

[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
[] ja:loadClass "org.openjena.fuseki.BackwardForwardDescribeFactory" .

tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
tdb:GraphTDB    rdfs:subClassOf  ja:Model .

<#dataset> rdf:type tdb:DatasetTDB ;
    tdb:location "/usr/lib/fuseki/DB" ;
    tdb:unionDefaultGraph true .

A minor note here about the dataset name. It is currently set to dataset by default, but you can change this easily from here.

Let’s factor-in our changes by rebuilding:

sudo mvn install

By default, Fuseki starts in read only mode:

sudo ./fuseki-server --desc tdb2.ttl /dataset

This is good because if we make our SPARQL endpoint public, we have one less worry about access controls. Whenever we need to update (write), we can simply restart the server by adding in the --update option. By the way, the server runs on http://localhost:3030/ by default.

Some quick tests to make sure we got everything up and running okay. Let’s import some RDF triples into store.

sudo ./s-put --verbose http://localhost:3030/dataset/data default books.ttl

That simply imports a Turtle file named books.ttl with triples to default graph into dataset named dataset. Remember that every time we put triples into store with the same graph name, it will first delete the existing graph before inserting the new one. If you wish add triples to the existing graph, then you should use the s-post command instead.

A simple query to see the triples we’ve just imported:

sudo ./s-query --service http://localhost:3030/dataset/query 'SELECT * WHERE {?s ?p ?o .}'

Note here that the service has the SPARQL endpoint URI at http://localhost:3030/dataset/query. If you wish to offer a public SPARQL endpoint e.g., http://example.org/sparql, then you might want to use a reverse proxy in Apache.

And now we can clear the default graph:

sudo ./s-update --service http://localhost:3030/dataset/update 'CLEAR default'

A little note here about the system architecture that Fuseki runs on. Fuseki uses TDB (which handles the RDF storage and query) out of the box, but it could be configured to use SDB as well. This is not something you have to worry about right now unless you have particular performance needs you need to address. For the time being, just know that TDB performs well if you are doing a lot of reads.

Even though TDB can run on both 32-bit and 64-bit Java Virtual Machines, 64-bit is highly recommended with minimum 1GB of RAM for reasonable performance for small number of triples. In production environments, you should dedicate a lot more RAM. More memory certainly helps Java especially during SPARQL Updates (writes) as opposed to queries (reads). For significant amount of data importing, I recommend upping the Java max memory to at least 8GB even if it is temporarily during importing. That should be fine for say 10-50 million triples. Anything above that, consider 16GB+ of RAM on that machine.

If you have a large dataset (like greater than 100k triples), and if you want to perform DESCRIBE queries in both directions (resource as subject we well as object) then try to run Fuseki on a 64-bit machine. For anything small-scale like running it for a personal blog or just testing on your local machine, 32-bit with 1GB of RAM is sufficient – famous last words.

By default, Fuseki server (fuseki-server) runs Java with -Xmx1200M. Simply use -Xmx8192M instead if you want to assign ~8GB of RAM.

If you are importing a lot of data, it would be a good idea to do it via TDB as opposed to Fuseki instead. The command-line tooling for TDB works in a similar way to Fuseki. For instance, take a look at the --help options available for each TDB command that's available for you i.e., type java tdb. and press tab to get a list in your shell. Make sure to have the following CLASSPATH setup:

/usr/lib/fuseki/target/jena-fuseki-{version}-SNAPSHOT-server.jar

Change the version line to the version you are currently using.

So, now take a look at java tdb.tdbloader --help. The importing that we did earlier with s-put can also be done as follows:

java tdb.tdbloader --desc=/usr/lib/fuseki/tdb2.ttl default books.ttl

Note that the Fuseki service is not required to run for you to work with TDB. If you have Fuseki running and doing updates with TDB (not a good idea, but you could get away with it), you should stop the service and then start again to see the changes.

Logging requests to Fuseki server is done through Apache’s log4j. There should be a log4j.properties file at the root of your Fuseki install.

Here is a log configuration that I use in log4j.properties:

log4j.appender.R=org.apache.log4j.RollingFileAppender
log4j.appender.R.File=query.log
log4j.appender.R.MaxFileSize=100KB
log4j.appender.R.MaxBackupIndex=10
log4j.appender.R.layout=org.apache.log4j.PatternLayout
log4j.appender.R.layout.ConversionPattern=[%d{yyyy-MM-dd HH:mm:ss}] %-5p %-20c{1} :: %m%n

and the fuseki-server would contain:

exec java -Xmx3048m -Dlog4j.configuration=file:log4j.properties -jar "$JAR" "$@"

Remember to always stop the Fuseki server before doing any updates.

Creating a public SPARQL endpoint

If you wish to create a public SPARQL endpoint i.e., allowing your data to be queried (and maybe even updated), here are a few steps you could take.

If our site is at http://site/, then http://site/sparql could be our public SPARQL Endpoint. That means that we can take requests from that endpoint and pass it to http://localhost:3030/dataset/query. To accomplish this, all we have to do is a reverse proxy with Apache. Add the following to your Apache configuration:

RewriteRule ^/sparql$ /usr/lib/fuseki/pages/sparql.html [L]

RewriteCond %{QUERY_STRING} query=
RewriteRule . http://localhost:3030/dataset/query [P]

The first rewrite rule is simple. It just takes a request like http://site/sparql and does an internal rewrite to load the SPARQL query form at /usr/lib/fuseki/pages/sparql.html. If you wish to make changes to this form e.g., like adding PREFIXes, you could edit that file.

The RewriteCond and RewriteRule handles the proxy bit we need. When the SPARQL query form is submitted, it does a GET request. You might be wondering why is a form doing a GET request when it should be POST. One reason is that, the resulting GET URI can be used as a dereferencable resource. It has its limitations e.g., number of characters, so, you could change it to POST if you feel that other people may make long requests. In any case, when the form is submitted, the request URI would be something like http://site/sparql?query=...&output=xml&stylesheet=xml-to-html-links.xsl. Hence, the RewriteCond takes the request with the query and sends it off to http://localhost:3030/dataset/query, which is same as what we'll use for the Linked Data Pages in the next section.

Alright, so, are you all Fusekied out? Good. Me too. Let’s move on to something else.

Setting up Linked Data Pages

Again, let’s make sure the essentials are set up in order to grab Moriarty, Paget, ARC2, and Linked Data Pages:

sudo apt-get install git

You can place the libraries anywhere you like, but I find it convenient to have it all under /var/www/lib/. The Linked Data Pages package is what we’ll work with and it requires the following libraries:

cd /var/www/
svn checkout http://moriarty.googlecode.com/svn/trunk/ lib/moriarty
svn checkout http://paget.googlecode.com/svn/branches/2.0/ lib/paget
git clone git://github.com/semsol/arc2.git lib/arc2

And now we can grab the main framework for our site:

git clone git://github.com/csarven/linked-data-pages.git lib/linked-data-pages

Until some of the issues are fixed in Moriarty and Paget, we’ll copy over the fixes in this package:

cp -R lib/linked-data-pages/patches/moriarty/* lib/moriarty/
cp -R lib/linked-data-pages/patches/paget/* lib/paget/

In this how-to we have used Fuseki as our SPARQL server, however, as mentioned earlier, the Linked Data Pages framework can use any other for its baseline dataset. Therefore, you can hook up a local or a remote SPARQL endpoint for it to work.

Setting up our site

At this point we’ll assume that you have a site enabled at /var/www/site/. But, just to make sure you avoid running into problems with your server not permitting access, use a simple configuration like the following in your /etc/apache2/sites-available/site.conf:

<VirtualHost *:80>
    ServerName site
    ServerAlias *.site

    DocumentRoot /var/www/site/
    <Directory /var/www/site>
        Options Indexes FollowSymLinks MultiViews
        AllowOverride All
        Order allow,deny
        Allow from all
    </Directory>
</Virtualhost>

Enable the Apache rewrite module:

a2enmod rewrite

Install the cURL library for PHP:

sudo apt-get install php5-curl

The Linked Data Pages package comes with an installation script to setup the directories and SPARQL endpoint URI. We’ll copy that file over and go from there.

cp /var/www/lib/linked-data-pages/install.php /var/www/site/

We need to make sure that this installation script (temporarily) has write access to the site directory:

chmod a+w /var/www/site

Now we simply load http://site/install.php (replace site with whatever host is pointing to /var/www/site/) in our browser. Enter the form values that correspond to the locations where we installed the libraries earlier. If you went ahead with the defaults in this how-to, then the following is what we want:

Site:     /var/www/site/
LDP:      /var/www/lib/linked-data-pages/
Paget:    /var/www/lib/paget/
Moriarty: /var/www/lib/moriarty/
ARC2:     /var/www/lib/arc2/

SPARQL Endpoint: http://localhost:3030/dataset/query

That is it! When you submit this form, you should be ready to go. If you now load http://site/ in your browser, you should see the default homepage. If you have your host set to something other than site, then we need to revisit /var/www/site/config.php and change the value at:

$config['site']['server'] = 'site';

Since site corresponds to http://site.

While we are here, we can update the following as well:

$config['site']['name'] = 'My Linked Data site';

That simply sets the name of the site as it appears in page title, address etc.

The following is used to set the path of our site if it is somewhere other than base e.g., /foo in http://site/foo.

$config['site']['path'] = '';

We can set the theme here too, where default points to /var/www/site/theme/default:

$config['site']['theme'] = 'default';

If you have your own theme, simply copy over your theme files under a directory in /var/www/www/theme/.

And finally, we can specify the site logo file, where logo_latc.png is at /var/www/site/theme/default/images/logo_latc.png

$config['site']['logo'] = 'logo_latc.png';

Creating templates

We can finally get down to doing cool stuff. Let’s say we want to create a template that renders a FOAF profile of a person. We’ll first import the RDF data into our store, create a query to get it out, and finally create a template where we can process the data and render it back out to the user.

We’ll place the following in /usr/lib/fuseki/people.ttl:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<http://csarven.ca/#i>
    rdf:type foaf:Person ;
    foaf:homepage <http://csarven.ca/> ;
    foaf:interest <http://en.wikipedia.org/wiki/Design> ;
    foaf:knows <http://richard.cyganiak.de/#me> .

We can now start Fuseki in update mode:

sudo ./fuseki-server --update --desc tdb2.ttl /dataset

And, finally, let’s import the Turtle file:

/usr/lib/fuseki/./s-put --verbose http://localhost:3030/dataset/data http://site/graph/people people.ttl

Note here that http://site/graph/people is the name of the graph where we’ve put our people data. Note also that since the default graph is the union of all named graphs, we don’t need to use GRAPH <http://site/graph/people> in our SPARQL queries if we don’t want to.

If you would like to use different names for your datasets, simply update tdb2.ttl and run the fuseki server using that dataset name.

It is now time to create a template where we can process this data. Before we do that however, it is important to give an overview for managing entities. This frameworks uses entity sets, by providing unique ids to identify each set. Each entity set contains the following information:

Path
Path value is used to do URI pattern matching in order to identify which entity set to initiate.
Query
Query value is sent directly to the SPARQL endpoint based on entity path match.
Template
Template values specifies the template to load based on entity path match.

Let’s start by writing our SPARQL query, and for that we’ll head over to /var/www/site/config.php and add the following:

$config['sparql_query']['people'] = "
    CONSTRUCT {
        ?s ?p ?o .
    }
    WHERE {
        GRAPH <http://site/graph/people> {
            ?s ?p ?o .
        }
    }
";

This is our SPARQL query with key people. All we are doing here is constructing an RDF graph result of all the triples in named graph http://site/graph/people. Now, we’ll tie this id to our entity set id site_people query:

$config['entity']['site_people']['query'] = 'people';

Similarly, we set the URI path where everything will be initiated when we visit http://site/people:

$config['entity']['site_people']['path'] = '/people';

And finally, we assign the template we want to render:

$config['entity']['site_people']['template'] = 'page.people.html';

Now, let’s create our template at /var/www/site/templates/page.people.html.

<?php
require_once SITE_DIR . 'templates/html.html';
require_once SITE_DIR . 'templates/head.html';
?>
    </head>
    <body id="<?php echo $entitySetId; ?>">
        <div id="wrap">
<?php require_once SITE_DIR . 'templates/header.html'; ?>
            <div id="core">
                <div id="content">
                    <h1>This page is about <a href="<?php e($resource_uri);?>"><?php e($title);?></a></h1>
                    <div id="content_inner">
                        <div id="resource_data">

<?php $triples = $this->getTriples('http://csarven.ca/#i'); ?>
                            <table>
                                    <tbody>
<?php
    foreach($triples as $s => $po) {
        foreach($po as $p => $o) {
            echo "\n".'<tr><td>'.$s.'</td><td>'.$p.'</td><td>'.$o[0]['value'].'</td></tr>';
        }
    }
?>
                                    </tbody>
                            </table>
                        </div>
                    </div>
                </div>
            </div>
<?php require_once SITE_DIR . 'templates/footer.html'; ?>
        </div>
<?php require_once SITE_DIR . 'templates/foot.html'; ?>

This is a pretty simple template which we can reuse. It is simply rendering the resulting triples in a table. The most noteworthy line here is:

$triples = $this->getTriples('http://csarven.ca/#i');

The getTriples function gets all the triples for us from our SPARQL query result and we place it in a multi-dimensional array. In this example, we are getting all the triples with subject http://csarven.ca/#i. It should give us the same triples we’ve imported into our RDF store. But we could also get other triples that match the pattern for parameters (subject, property, object) e.g.,

$this->getTriples('http://csarven.ca/#i', 'http://xmlns.com/foaf/0.1/knows');

That would get us all the triples with subject http://csarven.ca/#i and property http://xmlns.com/foaf/0.1/knows. An alias to this is the getValue function where we can use qnames for the property position:

$this->getValue('http://csarven.ca/#i', 'foaf:knows');

You can define more qnames in $config['prefixes'] at /var/www/site/config.php. See also /var/www/lib/linked-data-pages/README for more uses of getTriples like wildcards.

For complex templating, that is, if you’d like to do more data processing with PHP, you can dive into the SITE_Template class in /var/www/classes/SITE_Template.php instead of creating your functions directly inside HTML templates. The functions in here can be called directly from your templates.

If you are curious about internals of Linked Data Pages, see this article section.

Conclusions

If you have made it this far, congratulations! But don't stop here, take your site even further by building useful interactions for your consumers. Consider the following items:

  • Create a VoiD file to describe your Linked Datasets.
  • Create applications that makes use of your data.
  • Create data visualisations to help your users to get insights into the data that you are publishing.

The setup outlined in this article is more or less used at 270a.info. One of the goals of Linked Data Pages is to have a framework where a Linked Data site can be created with minimal “development” work. This framework relies heavily on Paget, which in return relies heavily on Moriarty and ARC2. Even though Paget has some quirks, with Moriarty and ARC2, it worked out quite well and got me at least 80% of the way there. There are probably a few more things I could iron out (i.e., fix bugs, not reinvent) once I address the finer details of Paget and Moriarty. Here is my to-do list for Linked Data Pages:

  • An additional administration user interface for site configuration, instead of having to edit config files directly.
  • Improve templating by adding more common functions.
  • Integration of data visualisations for common data dimensions.

Let's wrap this up here. All feedback are most welcome. Let me know how all this works out for you, especially if you would like me to clarify anything further. Happy Linking and stuff =)

Published
2011-02-03
Replies
12

Entry Reaction

Reader Comments (12)

  1. Alex Genadinik's photoAlex Genadinik replied on #2011-05-19 13:41:05

    Thanks - this was good to follow. How could I actually test that the inferencing is working? Is there an easy way to do that?

  2. Christopher Gutteridge's photoChristopher Gutteridge replied on #2012-03-13 03:27:45

    Hi, I've been working on a Library which makes some of the output end of this even easier. http://graphite.ecs.soton.ac.uk/

    It allows you to pull data about a resource quite easily from SPARQL, either by a query or listing the predicates of interest.

    It lets you work with the loaded graph in a much friendlier way than with raw triples.

    It handles providing alternate views on the data required to build a page for a resource.

  3. Sarven Capadisli's photoSarven Capadisli replied on #2012-03-17 20:49:33

    Hi Christopher. Thanks for mentioning Graphite. I've came across Graphite a while back and find it pretty neat. I'd like to take a look at it at again at some point.

    I didn't get too into Linked Data Pages' features in this article because it was intended to be a quick summary on how to get something up and running. Perhaps an update is in order.

    Like Graphite, LDP comes with a bunch of function calls that let's you easily dissect whatever is in the query result and have it ready in the templates. Generally speaking, helper functions which do more than the common data manipulation probably gets to be domain specific. Similarly, that line of development (or thinking) results in something along the lines of Fresnel - which I consider to be quite useful as well.

    Personally, what I find really handy in LDP is to be able to trigger a SPARQL query, and an accompanying template based on the requested URL pattern.

  4. davenzhang's photodavenzhang replied on #2012-11-25 02:26:59

    I have been working on publishing linked data,i stored my rdf data in TDB,I Wonder if i can do this in windows ,by fuseki?

  5. Tarje Lavik replied on #2013-03-10 07:05:45

    I am testing out LDP, but i am having trouble the URI mapping. My settings are below.

    $config['site']['server'] = 'localhost'; /* 'site' in http://site */ $config['site']['path'] = '/site';

    $config['server']['example.org'] = 'localhost'; /* URI maps e.g., http://dbpedia.org/resource/Montreal to http://site/resource/Montreal */

    This results queries like this: <http://example.org/site/book/book2> on http://localhost/site/book/book2>. This returns nothin as the resource in Fuseki is <http://example.org/book/book2>.

    When i change the settings in config.php i can not set it so "/site" is removed from the query. Is there something i can do to make this work?

    Thanks in advance!

  6. Sarven Capadisli's photoSarven Capadisli replied on #2013-03-11 09:24:55

    @Tarje , if I understand you correctly, you should have this:

    $config['site']['path'] = '';

    That also means that you want to request:

    http://example.org/book/book2

    Send me an email if that doesn't work.

  7. Tarje Lavik replied on #2013-03-11 11:18:44

    You are of course correct. I had forgot to enable the site in apache. Sorry about that. Now I can get to explore templating with LDP. Looks like it could fit my needs :-).

  8. vali replied on #2013-06-12 11:33:57

    Hey, thank you for this tutorial! I seem to have a problem, i can't start the Fuseki server (can't find the jar file). It is a bit odd... cand you help me?

  9. jdk replied on #2013-07-01 19:51:31

    The tutorial is very helpful. You might want to add to your tutorial a hint on two issues that sidetracked me for a long time: (1) If you find it impossible to run ./s-put or other ./s commands in your fuseki directory, make sure that they are enabled for execution using, e.g., chmod 744 s-* (2) On the LDP install.php page, you'll get examples that have a trailing / at the end of directory paths. Be certain *not* to terminate a path with that /

    Also, I am uncertain what steps a person would follow to set up a reverse proxy for this particular example. Could you provide something more on this?

    I made it all the way to the install.php script provided by LDP, but the resultant index.php file renders blank in my browser. So I'm a bit stuck, wondering if the answer is in the reverse proxy.

  10. sID replied on #2013-10-12 02:46:39

    Godlike tutorial ! Really Helpful

  11. Stefan replied on #2013-10-23 06:55:03

    Hi Saven,

    The tutorial looks great! Unfortunately, I am experiencing problems at the very start... On trying to build the fuseki server [mvn clean package] - I get dependency resolution problems:

    [ERROR] Failed to execute goal on project jena-fuseki: Could not resolve dependencies for project org.apache.jena:jena-fuseki:jar:1.0.1-SNAPSHOT: Could not find artifact org.apache.jena:jena-spatial:jar:1.0.1-SNAPSHOT in apache.snapshots (http://repository.apache.org/snapshots) -> [Help 1]

    Perhaps the given repository does not contain that jar any more. Could you suggest a different repo, that would sort my problem out?

  12. Colin replied on #2014-01-11 06:06:04

    Same problem as Stefan, that's a pity.

    I'll give a try to the packaged distrib.

Leave a comment

* marked fields are required.

  • (will not be published)