Document Identifier
http://csarven.ca/linked-sdmx-data
Keywords
Acknowledgements
¸ø•*`*•ø¸ø•*`*•ø¸ø•*`*•ø¸ø•*`*•ø¸¸ø•*`*•ø¸¸ø•*`*•ø¸ø•*`*•ø¸ø•*`*•ø¸ø•*`*•ø¸¸ø•*`
_
.' \ __/7 _ _ /7__ _//_ ___ ___ /7__
/ o ,',//_/ \/,'o/7/7//,'o,'o,'o,'o/ \','o/ \//_(c'
/_n_/\_//\/_n_/|_,|,^,//|_|__/|_,|_/_nn_|_/_n_///__)
_//
A special thanks to Richard Cyganiak for his ongoing support with this
effort, and being on the same wavelength for us to push this contribution
out there as well as graciously offered to host the dataspaces on a server at
Digital Enterprise Research Institute. Much appreciation to
Bern University of Applied Sciences for partially funding
the transformation effort for the pilot Swiss Statistics Linked Data project.
And, Swiss Federal Statistical Office for an excellent
collaboration from the very beginning.
-Sarven
¸ø•*`*•ø¸ø•*`*•ø¸ø•*`*•ø¸ø•*`*•ø¸¸ø•*`*•ø¸¸ø•*`*•ø¸ø•*`*•ø¸ø•*`*•ø¸ø•*`*•ø¸¸ø•*`
Author's Notes
Request for comments.
Table of Contents
- Introduction
- Background
- SDMX-ML to Linked Data
- Linked SDMX Data Transformation
- Linked Datasets
- Publication
- Conclusions
Introduction
And so we begin.
Problem Statement
While access to statistical data in the public sector has increased in recent years, a range of technical challenges makes it difficult for data consumers to tap into this data at ease. These are particularly related to the following two areas:
- Automation of data transformation of data from high profile statistical organizations.
- Minimization of third-party interpretation of the source data and metadata and lossless transformations.
Development teams often face low-level repetitive data management tasks to deal with someone else's data. Within the context of Linked Data, one aspect is to transform this raw statistical data (e.g., SDMX-ML) into an RDF representation in order to be able to start tapping into what's out there in a uniform way.
Contributions
The contributions of this article are two-fold. The primary work herein is based on the XSLT 2.0 templates and small tooling which transforms SDMX-ML data to RDF/XML. Following this, SDMX-ML data from OECD (Organisation for Economic Co-operation and Development), BFS (Bundesamt für Statistik@de, Swiss Federal Statistical Office@en), FAO (Food and Agriculture Organization of the United Nations), and ECB are retrieved, transformed and published as Linked Data.
Background
Looking back but not too far back.
Linked Statistics
As pointed out in Statistical Linked Dataspaces (Capadisli, S., 2012), what linked statistics provide, and in fact enable, are queries across datasets: Given that the dimension concepts are interlinked, one can learn from a certain observation's dimension value, and enable the automation of cross-dataset queries.
Moreover, a number of approaches have been undertaken in the past to go from raw statistical data from the publisher to linked statistical data, as discussed in great detail in Official statistics and the Practice of Data Fidelity (Cyganiak, R., 2011). These approaches go from retrieval of the data by majority; in tabular formats: Microsoft Excel or CSV, tree formats: XML with a custom schema, SDMX-ML, PC-Axis, to transformation into different RDF serialization formats. As far as graph formats go, majority of them are not published by the original source. However, there are number of statistical linked dataspaces in the LOD Cloud already.
Related Work
A number of transformation efforts are performed by the Linked Data community based on various formats. For example, the World Bank Linked Dataspace is based on custom XML that the World Bank provides through their APIs with the application of XSL Templates. The Transparency International Linked Dataspace's data is based on CSV files with the transformation step through Google Refine and the RDF Extension. That is, data sources provide different data formats for the public, with or without accompanying metadata e.g., vocabularies, provenance. Hence, this repetitive work is no exception to Linked Data teams as they have to constantly be involved either by way of hand-held transformation efforts, or in best-case scenarios, it is done semi-automatically. Currently, there is no automation of the transformation step to the best of my knowledge. This is generally due to the difficulty of the task when dealing with the quality and consistency of the statistical data that is published on the Web, as well as the data formats that are typically focused on consumption. Although SDMX-ML is the primary format of the high profile statistical data organizations, it is yet to be taken advantage of.
SDMX-ML to Linked Data
Recently, SDMX is approved by ISO as International Standard. It is a standard which provides the possibility to consistently carry out data flows between publishers and consumers. SDMX-ML (using XML syntax) is considered to be the gold standard for expressing statistical data. It has a highly structured mechanism to represent statistical observations, classifications, and data structures. Organizations behind SDMX are BIS (Bank for International Settlements), OECD (Organisation for Economic Co-operation and Development), UN (United Nations), ECB (European Central Bank), World Bank, IMF (International Monetary Fund), FAO (Food and Agriculture Organization of the United Nations), and Eurostat.
Thus, a side proposal here is that, the path to high fidelity statistical data representation in Linked Data should take advantage of what SDMX-ML offers. As it is widely adopted and carried by data producers with rich data about our societies, the need for transforming SDMX-ML to RDF, and publishing accompanying Linked Dataspaces is arguably hard to debate against. So, let us have a go at it by giving the complete account of the system, hopefully the master-builder for statistical data. Example datasets which uses such transformation, and their published dataspaces are discussed herein.
Data Sources
As a demonstration of the SDMX-ML to RDF transformations, the selection of datasets here are from the following organizations. Instead of regurgitating publisher's profile about themselves, and to keep this part brief, I'll quote the first sentences from their about pages:
- OECD
The mission of the Organisation for Economic Co-operation and Development (OECD) is to promote policies that will improve the economic and social well-being of people around the world
from OECD Our mission.- BFS
Swiss Statistics, the Federal Statistical Office’s web portal. Offering a modern and appealing interface, our website proposes a wide range of statistical information on the most important areas of life: population, health, economy, employment, education and much more
from BFS Welcome.- FAO
Achieving food security for all is at the heart of FAO's efforts - to make sure people have regular access to enough high-quality food to lead active, healthy lives
from FAO's mandate.- ECB
Whose main task is to maintain the euro's purchasing power and thus price stability in the euro area
from ECB home
The OECD, FAO, and ECB datasets consisted of observational and structural data. The OECD and ECB data provided complete coverage (to the best of our knowledge), whereas FAO had partial fishery related data. BFS had all of their classifications available, with no observational data in SDMX-ML.
Data Retrieval
As SDMX-ML publishers have their own publishing processes, availability and accessibility of the data varied. After obtaining the dataset codes, names, and URLs with common Linux command-line work, a Bash script was created to retrieve the data.
OECD
On how to retrieve all of the datasets from the OECD website was not entirely clear. In order to automatically get a hold of list of datasets, I copied the innerHTML of the DOM tree that contained all the dataset codes from OECD.StatExtracts to a temporary file. This was done due to the fact that a simple scrape of an HTTP GET wasn't possible as the data on the page was populated via JavaScript on document ready. After constructing a list of datasets and structures to get, two REST API endpoints were called.
BFS
BFS offered a Microsoft Excel document which contained a catalog of their classifications and URLs for retrieval.
FAO
After searching for keywords along the lines of SDMX site:fao.org at a search-engine nearby, FAO Fisheries and data.fao.org SDMX Registry and Repository, and its children pages were marked for SDMX-ML retrieval.
ECB
ECB had a similar REST API to OECD. Additionally, SDMX Dataflows was retrieved to get a primary list of datasets to retrieve. Some of the large datasets was retrieved by making multiple smaller calls to the API using a call per refereance area.
Provenance
Provenance at Retrieval
At the time of data retrieval, information pertaining to provenance was captured using the PROV Ontology in order to further enrich the data.
This RDF/XML document contains prov:Activity information which indicates the location of the XML document on the local filesystem. It contains other provenance data like when it was retrieved, with what tools, etc.
This provenance data from retrieval may be provided to the XSL Transformer during the transformation phase and VoID enrichment.
Provenance at Transformation
Resources of type qb:DataStructureDefinition, qb:DataSet, skos:ConceptScheme are also typed with the prov:Entity class. Also properties prov:wasAttributedTo were added to these resources with the creator value which is of type prov:Agent obtained from XSLT configuration. There is a unique prov:Activity for each transformation, and it has a dcterms:title, and contains values for prov:startedAtTime, prov:wasAssociatedWith (the creator), prov:used (i.e., source XML, XSL to transform) to what was prov:generated (and source data URI that it prov:wasDerivedFrom). It also declares dcterms:license where value taken from XSLT configuration. The provenance document from the retrieval phase may be provided to the transformer. In this case, it establishes a link between the current provenance activity (i.e., the transformation), with the earlier provenance activity (i.e., the retrieval) using the prov:wasInformedBy property.
Provenance at Post-processing
The post-processing step for provenance is intended to retain provenance data for future use.
As datasets get updated, it is important to preserve information about past activities by way of exporting all instances of the prov:Activity class from the RDF store. Activities are unique artifacts, on a conceptual level as well as with regard to referencing them. Since one of the main concerns of provenance is to keep track of activities, this post-processing step also allows us to retain a historical account of all activities during the data lifecycle, and to preserve all previously published URIs (cf. Cool URIs don't change).
Data Preprocessing
By in large, there was no need to pre-process the data as the transformation dealt with the data as it was. However, some non-vital SDMX components were omitted from the output. For instance, one type of attribute in OECD and ECB observations contained free-text as opposed to its corresponding code from a codelist. Since the RDF Data Cube required codes as opposed to free-text for dimension values, some attributes were excluded. The decision here was to trade-off some precision in favour of retaining the dataset.
Data Modeling
This section goves over several areas which are at the heart of representing statistical data in SDMX-ML as Linked Data. The approach taken was to provide a level of consistency for data consumers and tool builders for all statistical Linked Data with its origins from data in SDMX-ML.
Vocabularies
Besides the common vocabularies: RDF RDFS, XSD, OWL, XSD, the RDF Data Cube vocabulary is used to describe multi-dimensional statistical data, and SDMX-RDF for the statistical information model. PROV-O is used for provenance coverage. SKOS and XKOS to cover concepts, concept schemes and their relationships to one another.
Versioning
SDMX data publishers version their classifications and the generated cubes refer to particular versions of those classifications. Consequently, versions need to be explicitly part of classification URIs in order to uniquely identify them. Although including version information in the URI is disputed by some authors, it is a good exception for identifying different concepts and data structures. Jeni Tennison et al discussed Versioning URIs, and concluded that there was no one-size-fits all solution. An alternative approach using named graphs for a series of changes was proposed in Linking UK Government Data.
URI Patterns
An outline for the URI patterns is given in table below: authority is replaced with the domain (see also: Agency identifiers and URIs) followed with class, code, concept, dataset, property, provenance, or slice as example. These tokens as well as / which is used to separate the dimension concepts in URIs can be configured in the toolkit.
In order to construct the URIs for the above patterns, some of the data values are normalized to make them URI safe but not altered in other ways (e.g., lower-casing). The rationale for this was to keep the consistency of terms in SDMX and RDF.
| Entity type | URI Pattern |
|---|---|
qb:DataStructureDefinition | http://{authority}/structure/{KeyFamilyID} |
qb:DataSet | http://{authority}/dataset/{datasetID} |
qb:Observation | http://{authority}/dataset/{datasetID}/{dimension-1}/../{dimension-n} |
qb:Slice | http://{authority}/slice/{KeyFamilyID}/{dimension-1}/../{dimension-n-no-FREQ} |
skos:Collection | http://{authority}/code/{version}/{hierarchicalCodeListID},http://{authority}/code/{version}/{hierarchyID} |
sdmx:CodeList | http://{authority}/code/{version}/{codeListID} |
skos:ConceptScheme | http://{authority}/concept/{version}/{conceptSchemeID} |
skos:Concept,sdmx:Concept | http://{authority}/code/{version}/{codeListID}/{codeID}http://{authority}/concept/{version}/{conceptSchemeID}/{conceptID} |
owl:Class,rdfs:Class | http://{authority}/class/{version}/{codeListID} |
rdf:Property,qb:DimensionProperty,qb:MeasureProperty,qb:AttributeProperty | http://{authority}/property/{conceptID} |
Datatypes
XSD datatypes are assigned to literals are based on the value of the measure component (e.g., decimal, year). In the absence of this datatype, observation values are checked whether they can be casted to xsd:decimal. Otherwise, they are left as plain literals.
Linked SDMX Data Transformation
The Linked SDMX XSLT 2.0 templates and scripts are developed to transform SDMX-ML data and metadata to RDF/XML. Its goals are:
- To improve access and discovery of cross-domain statistical data.
- To perform the transformation in a lossless and semantics preserving way.
- To support and encourage statistical agencies to publish their data using RDF and integrating the transformation into their workflow.
The key advantage of this transformation approach is that additional interpretations are not required by the data modeler especially in comparison to alternative transformation (e.g., CSV or XML to RDF serialization). Since the SDMX-RDF vocabulary is based on SDMX-ML standard, and the RDF Data Cube vocabulary is closely aligned with the SDMX information model, the transformation is to a large extent a matter of mapping the source SDMX-ML data to its counter parts in RDF.
Features of the transformation
- Transforms SDMX KeyFamilies, ConceptSchemes and Concepts, CodeLists and Codes, Hierarchical CodeLists, DataSets.
- Configurability for SDMX publisher's needs.
- Detection and referencing CodeLists and Codes of external agencies.
- Support of interlinking publisher-specific annotation types.
- Support for omission of components.
- Inclusion of provenance data.
Figure : Transformation Process
What is inside?
It comes with scripts and sample data:
- XSLT 2.0 templates to transform Generic SDMX-ML data and metadata. It includes the main XSL template for generic SDMX-ML, an XSL for common templates and functions, and an RDF/XML configuration file to set preferences like base URIs, delimiters in URIs, how to map annotation types.
- Bash script that transforms sample data using saxonb-xslt.
- Sample SDMX Message and Structure retrieved from those organizations that are initially involved in the SDMX standard, as well as from BFS.
Requirements
The requirements for the Linked SDMX toolkit are an XSLT 2.0 processor to transform, and optionally to configure some of the settings in the transformation with provided config.ttl (in RDF Turtle) and transforming that to an abbreviated version of RDF/XML. In sequel some of they key features are described in more detail.
The transformation follows some common Linked Data practices as well as other ones out of thin air.
Configuration
The config file is used to pre-configure some of the decisions that are made in XSL templates. Here is an outline for some of the noteworthy things.
Agency identifiers and URIs
agencies.ttl is used to track some of the mappings for maintenance agencies. It includes the maintenance agency's i.e., the SDMX publisher's, identifier that's in the SDMX Registry, as well as the base URI for that agency. This file allows references to external agency identifiers to be looked up for their base URI and used in the transformations. Currently this agency recognition is treated as either "SDMX" or some agency that's publishing the actual statistics.
In the case of SDMX, when there is a reference to SDMX CodeLists and Codes, it is typically indicated by the component agency being set to SDMX e.g., codelistAgency="SDMX" of a structure:Component and/or agencyID="SDMX" of a CodeList with id="CL_FREQ". When this is detected, corresponding URIs from the SDMX-RDF vocabulary is used e.g., for metadata; http://purl.org/linked-data/sdmx/2009/code#freq, and data; http://purl.org/linked-data/sdmx/2009/code#freq-A.
Similarly, an agency might use some other agency's codes. By following the same URI pattern conventions, the agency file is used to find the corresponding base URI in order to make a reference. For example, here is a coded property that's used by European Central Bank (ECB) to associate a code list that's defined by Eurostat (eurostat):
<http://ecb.270a.info/property/OBS_STATUS>
<http://purl.org/linked-data/cube#codeList>
<http://eurostat.270a.info/code/1.0/CL_OBS_STATUS>
Naturally, the transformation does not re-define metadata that's from an external agency as the owners of the data would define them under their authority.
URI configurations
Base URIs can be set for classes, codelists, concept schemes, datasets, slices, properties, provenance, as well as for the source SDMX data.
The value for uriThingSeparator e.g., /, lets one set the delimiter to separate the "thing" from the rest of the URI. This is typically either a / or #. For example, if slash is used, an URI would end up like http://{authority}/code/{version}/CL_GEO (note the last slash before CL_GEO). If hash is used, an URI would end up like http://{authority}/code/{version}#CL_GEO.
Similarly, uriDimensionSeparator can be set to separate dimension values that's used in RDF Data Cube observation URIs. As observation should have its own unique URI, the method to construct URIs is done by taking dimension values as safe terms to be used in URIs separated by the value in uriDimensionSeparator. For example, here is a crazy looking observation URI where uriDimensionSeparator is set to /: http://{authority}/dataset/HEALTH_STAT/EVIEFE00/EVIDUREV/AUS/1960. But with uriThingSeparator set to # and uriDimensionSeparator set to -, it could end up like http://{authority}/dataset/HEALTH_STAT#EVIEFE00-EVIDUREV-AUS-1960. HEALTH_STAT is the dataset id.
Creator's URI can also be set which is also used for provenance data.
Default language
From the configuration, it is possible to force a default xml:lang on skos:prefLabel and skos:definition when lang is not originally in the data. If config.rdfcontains a non-empty lang value it will use it. Default language may also be applied in the case of Annotations. See Interlinking SDMX Annotations for example.
Interlinking SDMX Annotations
SDMX Annotations contain important information that can be put to use by the publisher. Data in AnnotationTypes are typically used as publisher's internal conventions. Hence, there is no standardization on how they are used across all SDMX publishers. In order not to leave this information behind in the final transformation, the configuration allows publishers to define the way they should be transformed. This is done by setting interlinkAnnotationTypes: the AnnotationType to detect (in rdfs:type), the predicate (as an XML QName) to use (in rdf:predicate), whether to apply instances of Concepts or Codes to apply to, or as Literals (in rdf:range), and whether to target AnnotationText or AnnotationTitle (in rdfs:label). Currently this feature is only applied to Annotations in Concepts and Codes. Only the AnnotationTypes with a corresponding configuration will be applied, and unspecific ones will be skipped.
Omitting components
There are cases in which certain data parts contain errors. To get around this until the data is fixed at source, and without giving up on rest of the data at hand, as well as without making any significant assumptions or changes to the remaining data, omitComponents is a configuration option to explicitly skip over those parts. For example, if the Attribute values in a DataSet don't correspond to coded values - where they may contain whitespace - they can be skipped without damaging the rest of the data. This obviously gives up on precision in favour of still making use of the data.
Linked Datasets
This section provides information on the publication of OECD, BFS, FAO, and ECB datasets.
The original SDMX-ML files were transformed to RDF/XML using XSLT 2.0. Saxon’s command-line XSLT and XQuery Processor tool was used for the transformations, and employed as part of Bash scripts to iterate through all the files in the datasets.
RDF Datasets
Here are some statistics for the transformations.
The command-line tool saxonb-xslt was used to conduct the XSL transformations. 12000M of memory was allocated on a machine with Intel(R) Xeon(R) CPU E5620 @ 2.40GHz. Linux kernel 3.2.0-33-generic was used. Table [Transformation time] provides information on datasets; input SDMX-ML size, output RDF/XML size, their size difference in ratio, and the total amount transformation time.
| Dataset | Input size | Output size | Ratio | Time |
|---|---|---|---|---|
| Input size (rounded) refers to the original data in XML, and the output (rounded) is the RDF/XML size. Time is the real process time. | ||||
| OECD | 3430 MB | 23000 MB | 1:6.7 | 131m25.795s |
| BFS | 87 MB | 139 MB | 1:1.6 | 2m38.225s |
| FAO | 902 MB | 5000 MB | 1:5.5 | 31m48.207s |
| ECB | 5670 MB | 24000 MB | 1:4.2 | 181m3.313s |
Table [Transformed data] provides data on the transformed data; number of triples it contains, as well as the number of qb:Observations, and the ratio.
| Data | Number of triples | Number of observations | Ratio |
|---|---|---|---|
Metadata (from graph/meta) includes dataset structures and classifications. Ratio refers to rounded ratio of total number of triples (rounded) to number of observations (rounded) in the dataset. | |||
| OECD Dataset | 225 million | 24 million | 9.4:1 |
| OECD Metadata | 0.77 million | N/A | N/A |
| BFS Metadata | 1 million | N/A | N/A |
| FAO Dataset | 53 million | 7.2 million | 7.4:1 |
| FAO Metadata | 0.36 million | N/A | N/A |
| ECB Dataset | 241 million | 12.5 million | 19.3:1 |
| ECB Metadata | 0.45 million | N/A | N/A |
Table [Resource counts] provides further statistics on prominent resources. It gives a contrast between the classifications and the dataset.
| Resource | OECD | BFS | FAO | ECB |
|---|---|---|---|---|
| Count of resources in datasets | ||||
| skos:ConceptScheme | 1212 | 185 | 32 | 149 |
| skos:Concept | 43368 | 106233 | 28115 | 54389 |
| rdf:Property | 126 | 0 | 12 | 209 |
| qb:Observation | 24381106 | 0 | 7186764 | 12513494 |
Interlinking
Initial interlinking is done among the classifications themselves in the datasets. The OECD classifications in particular contained highly similar codes (in some cases the same) throughout its codelists. Hence, majority of the codes within the codelist CL_*_LOCATION was interlinked with one another using skos:exactMatch.
The consequent interlinking was done with DBpedia, World Bank Linked Data, Transparency International Linked Data, and EUNIS using LInk discovery framework for MEtric Spaces (LIMES): A Time-Efficient Hybrid Approach to Link Discovery (Ngonga Ngomo, A.-C., 2011)
| Source dataset | Target dataset | Entity type | Link relation | Link count |
|---|---|---|---|---|
| OECD | World Bank | dbo:Country, skos:Concept | skos:exactMatch | 3487 |
| OECD | Transparency International | dbo:Country, skos:Concept | skos:exactMatch | 3335 |
| OECD | DBpedia | dbo:Country, skos:Concept | skos:exactMatch | 3391 |
| OECD | BFS | skos:Concept, code:CL_STATES_AND_TERRITORIES, skos:Concept | skos:exactMatch | 3383 |
| BFS | World Bank | dbo:Country, skos:Concept | skos:exactMatch | 185 |
| BFS | DBpedia | dbo:Country, skos:Concept | skos:exactMatch | 261 |
| FAO | DBpedia | dbo:Species, skos:Concept | skos:exactMatch | 673 |
| FAO | EUNIS | e:sameSynonymFIFAO, skos:Concept | skos:exactMatch | 359 |
| ECB | World Bank | dbo:Country, skos:Concept | skos:exactMatch | 188 |
| ECB | Transparency International | dbo:Country, skos:Concept | skos:exactMatch | 167 |
| ECB | DBpedia | dbo:Country, skos:Concept | skos:exactMatch | 239 |
| ECB | BFS | skos:Concept, code:CL_STATES_AND_TERRITORIES | skos:exactMatch | 221 |
Figure [SDMX concept links] gives an overview of the complete connectivity of a concept that's linked internally, externally, and with sdmx-codes where applicable, as well as the interlinking that was done to an external concept.
Figure : SDMX Concept links
RDF Data Storage
Apache Jena’s TDB storage system is used to load the RDFized data using TDB’s incremental tdbloader utility. tdbstats, the tool for TDB Optimizer is executed after a complete load to internally update the count of resources in order for TDB to make the best decision to come up with future query results.
Individual datasets from each organization were transformed to N-Triples format before loading into the store. Each RDF Data Cube dataset was imported into its own NAMED GRAPH in the store. Given the significant load speed on an empty database, N-Triples were ordered from largest to smallest, and then loaded.
Publication
The publication steps are described in this section.
Dataset Discovery and Statistics
As VoID file is generally intended to give an overview of the dataset metadata i.e., what it contains, ways to access it or query it, each dataspace contains Vocabulary of Interlinked Datasets (VoID) files accessible through their .well-known/void. Each OECD, BFS, FAO, and ECB VoID contains locations to RDF datadumps, named graphs that are used in the SPARQL endpoint, used vocabularies, size of the datasets, interlinks to external datasets, as well as the provenance data which was gathered through the retrieval and transformation process. The VoID files were generated automatically by first importing the LODStats information into respective graph/void into the store, and then a SPARQL CONSTRUCT query to include all triples as well as additional ones which could be actively created based on the available information in all graphs.
Dataset statistics are generated and are included in the VoID file using LODStats, LODStats – An Extensible Framework for High-performance Dataset Analytics (Demter, J., 2012).
User-interface
The HTML pages are generated by the Linked Data Pages framework, where Moriarty, Paget, and ARC2 does the heavy lifting for it. Given the lessons learned over the years about Linked Data publishing, there is a consideration to either take Linked Data Pages further (originally written in 2010), or to adapt one of the existing frameworks after careful analysis.
SPARQL Endpoint
Apache Jena Fuseki is used to run the SPARQL server for the three datasets. SPARQL Endpoints are publicly accessible and read only at their respective /sparql and /query locations for OECD, BFS, FAO and ECB. Currently, 12000MB of memory is allocated for the single Fuseki server running all datasets.
Data Dumps
The data dumps for the datasets are available from their respective /data/ directories: OECD, BFS, FAO, and ECB. Additionally, they are mentioned in the VoID files. The the Data Hub entries (see below) also contains links to the dumps.
Source Code
The code for transformations is at csarven/linked-sdmx, and for retrieval and data loading to RDF store for OECD is at csarven/oecd-linked-data, for BFS is at csarven/bfs-linked-data, for FAO is at csarven/fao-linked-data, and for ECB is at csarven/ecb-linked-data. All using the Apache License 2.0.
Data License
All published Linked Data adheres to original data publisher’s data license and terms of use. Additionally attributions are given on the websites. The Linked Data version of the data is licensed under CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.
Announcing the Datasets
For other ways for these datasets to be discovered, they are announced at mailing lists, status update services, and at the Data Hub: OECD is at data/oecd-linked-data, BFS is at dataset/bfs-linked-data, FAO is at dataset/fao-linked-data, ECB is at dataset/ecb-linked-data.
Conclusions
With this work we provided an automated approach for transforming statistical SDMX-ML data to Linked Data in a single step. As a result, this effort helps to publish and consume large amounts of quality statistical Linked Data. Its goal is to shift focus from mundane development efforts to automating the generation of quality statistical data. Moreover, it facilitates to provide RDF serializations alongside the existing formats used by high profile statistical data owners. The approach to employ XSLT transformations does not require changes to well established workflows at the statistical agencies.
One aspect of future work is to improve the SDMX-ML to RDF transformation quality and quantity. Regarding quality, the aim to test the transformation with further datasets to identify shortcomings and special cases being currently not yet covered by the implementation. Also, we plan the development of a coherent approach for (semi-)automatically interlinking different statistical dataspaces, which establishes links on all possible levels (e.g. classifications, observations). With regard to quantity, we plan to publish statistical dataspaces for Bank for International Settlements (BIS), International Monetary Fund (IMF), World Bank and Eurostat based on SDMX-ML data.
The current transformation is mostly based on the generic SDMX format. Since some of the publishers make their data available in compact SDMX format, the transformation toolkit has to be extended. Alternatively, the compact format can be transformed to the generic format first (for which tools exist) and then Linked SDMX transformations can be applied. Ultimately, we hope that Linked Data publishing will become a direct part of the original data owners workflows and data publishing efforts. Therefore, further collaboration on this will expedite the provision of uniform access to statistical Linked Data.
There we have it. It is all very simple.
Entry Reaction
Reader Comments (1)
Benedikt replied on #2013-02-23 07:24:28
Nicely done, Sarven! A TOC at the beginning of the article would be nice.
Best,
Benedikt