Linked SDMX Data
- Document ID
- CC BY-SA 4.0
- In Reply To
- Semantic Web Journal Call: 2nd Special Issue on Linked Dataset Descriptions
- Appeared In
- SWJ-LDD-2015 (Semantic Web, Linked Dataset Descriptions 2015) , Tracking 454-1631 , DOI 10.3233/SW-130123 , ISSN 2210-4968 , Volume 6 , Issue 2 , Pages 105 – 112
- Path to high fidelity Statistical Linked Data
As statistical data is inherently highly structured and comes with rich metadata (in form of code lists, data cubes etc.), it would be a missed opportunity to not tap into it from the Linked Data angle. At the time of this writing, there exists no simple way to transform statistical data into Linked Data since the raw data comes in different shapes and forms. Given that SDMX (Statistical Data and Metadata eXchange) is arguably the most widely used standard for statistical data exchange, a great amount of statistical data about our societies is yet to be discoverable and identifiable in a uniform way. In this article, we present the design and implementation of SDMX-ML to RDF/XML XSL transformations, as well as the publication of OECD, BFS, FAO, ECB, IMF, UIS, FRB, BIS, and ABS dataspaces with that tooling.
Categories and Subject Descriptors
While access to statistical data in the public sector has increased in recent years, a range of technical challenges makes it difficult for data consumers to tap into this data at ease. These are particularly related to the following two areas:
- Automation of data transformation of data from high profile statistical organizations.
- Minimization of third-party interpretation of the source data and metadata and lossless transformations.
Development teams often face low-level repetitive data management tasks to deal with someone else's data. Within the context of Linked Data, one aspect is to transform this raw statistical data (e.g., SDMX-ML) into an RDF representation in order to be able to start tapping into what's out there in a uniform way.
The contributions of this article are two-fold. We present an approach for transforming SDMX-ML based on XSLT 2.0 templates and showcase our implementation which transforms SDMX-ML data to RDF/XML. Following this, SDMX-ML data from OECD (Organisation for Economic Co-operation and Development), BFS (Bundesamt für Statistik@de, Swiss Federal Statistical Office@en), FAO (Food and Agriculture Organization of the United Nations), ECB (European Central Bank), and IMF (International Monetary Fund), UIS (UNESCO Institute for Statistics), FRB (Federal Reserve Board), BIS (Bank for International Settlements), ABS (Australian Bureau of Statistics) are retrieved, transformed and published as Linked Data.
As pointed out in Statistical Linked Dataspaces (Capadisli, S., 2012), what linked statistics provide, and in fact enable, are queries across datasets: Given that the dimension concepts are interlinked, one can learn from a certain observation's dimension value, and enable the automation of cross-dataset queries.
Moreover, a number of approaches have been undertaken in the past to go from raw statistical data from the publisher to linked statistical data, as discussed in great detail in Official statistics and the Practice of Data Fidelity (Cyganiak, R., 2011). These approaches go from retrieval of the data by majority; in tabular formats: Microsoft Excel or CSV, tree formats: XML with a custom schema, SDMX-ML, PC-Axis, to transformation into different RDF serialization formats. As far as graph formats go, majority of datasets in those formats not published by the owners. However, there are number of statistical linked dataspaces in the LOD Cloud already.
A number of transformation efforts are performed by the Linked Data community based on various formats. For example, the World Bank Linked Dataspace is based on custom XML that the World Bank provides through their APIs with the application of XSL Templates. The Transparency International Linked Dataspace's data is based on CSV files with the transformation step through Google Refine and the RDF Extension. That is, data sources provide different data formats for the public, with or without accompanying metadata e.g., vocabularies, provenance. Hence, this repetitive work is no exception to Linked Data teams as they have to constantly be involved either by way of hand-held transformation efforts, or in best-case scenarios, it is done semi-automatically. Currently, there is no automation of the transformation step to the best of our knowledge. This is generally due to the difficulty of the task when dealing with the quality and consistency of the statistical data that is published on the Web, as well as the data formats that are typically focused on consumption. Although SDMX-ML is the primary format of the high profile statistical data organizations, it is yet to be taken advantage of.
The publication steps are described in this section.
Dataset Discovery and Statistics
As VoID file is generally intended to give an overview of the dataset metadata i.e., what it contains, ways to access it or query it, each dataspace contains Vocabulary of Interlinked Datasets (VoID) files accessible through their
.well-known/void. Each OECD, BFS, FAO, ECB, IMF, UIS, FRB, BIS, ABS VoID contains locations to RDF datadumps, named graphs that are used in the SPARQL endpoint, used vocabularies, size of the datasets, interlinks to external datasets, as well as the provenance data which was gathered through the retrieval and transformation process. The VoID files were generated automatically by first importing the LODStats information into respective
graph/void into the store, and then a SPARQL
CONSTRUCT query to include all triples as well as additional ones which could be actively created based on the available information in all graphs.
Dataset statistics are generated and are included in the VoID file using LODStats, LODStats – An Extensible Framework for High-performance Dataset Analytics (Demter, J., 2012).
The HTML pages are generated by the Linked Data Pages framework, where Moriarty, Paget, and ARC2 does the heavy lifting for it. Given the lessons learned over the years about Linked Data publishing, there is a consideration to either take Linked Data Pages further (originally written in 2010), or to adapt one of the existing frameworks after careful analysis.
Apache Jena Fuseki is used to run the SPARQL server for the three datasets. SPARQL Endpoints are publicly accessible and read only at their respective
/query locations for OECD, BFS, FAO, ECB, IMF, UIS, FRB, BIS, ABS. Currently, 12000MB of memory is allocated for the single Fuseki server running all datasets.
The code for transformations is at csarven/linked-sdmx, and for retrieval and data loading to RDF store for OECD is at csarven/oecd-linked-data, for BFS is at csarven/bfs-linked-data, for FAO is at csarven/fao-linked-data, for ECB is at csarven/ecb-linked-data, for IMF is at csarven/imf-linked-data, for UIS is at csarven/uis-linked-data, for FRB is at csarven/frb-linked-data, for ABS is at csarven/abs-linked-data. All using the Apache License 2.0.
Announcing the Datasets
For other ways for these datasets to be discovered, they are announced at mailing lists, status update services, and at the Data Hub: OECD is at oecd-linked-data, BFS is at bfs-linked-data, FAO is at fao-linked-data, ECB is at ecb-linked-data, IMF is at imf-linked-data, UIS is at uis-linked-data, FRB is at frb-linked-data, BIS is at bis-linked-data, ABS is at abs-linked-data.
With this work we provided an automated approach for transforming statistical SDMX-ML data to Linked Data in a single step. As a result, this effort helps to publish and consume large amounts of quality statistical Linked Data. Its goal is to shift focus from mundane development efforts to automating the generation of quality statistical data. Moreover, it facilitates to provide RDF serializations alongside the existing formats used by high profile statistical data owners. Our approach to employ XSLT transformations does not require changes to well established workflows at the statistical agencies.
One aspect of future work is to improve the SDMX-ML to RDF transformation quality and quantity. Regarding quality, we aim to test our transformation with further datasets to identify shortcomings and special cases being currently not yet covered by the implementation. Also, we plan the development of a coherent approach for (semi-)automatically interlinking different statistical dataspaces, which establishes links on all possible levels (e.g. classifications, observations). With regard to quantity, we plan to publish statistical dataspaces for Bank for International Settlements (BIS), World Bank and Eurostat based on SDMX-ML data.
The current transformation is mostly based on the generic SDMX format. Since some of the publishers make their data available in compact SDMX format, the transformation toolkit has to be extended. Alternatively, the compact format can be transformed to the generic format first (for which tools exist) and then Linked SDMX transformations can be applied. Ultimately, we hope that Linked Data publishing will become a direct part of the original data owners workflows and data publishing efforts. Therefore, further collaboration on this will expedite the provision of uniform access to statistical Linked Data.
We thank Richard Cyganiak for his ongoing support, as well as graciously offering to host the dataspaces on a server at Digital Enterprise Research Institute. We also acknowledge the support of Bern University of Applied Sciences for partially funding the transformation effort for the pilot Swiss Statistics Linked Data project and thank Swiss Federal Statistical Office for the excellent collaboration from the very beginning.
Nicely done, Sarven! A TOC at the beginning of the article would be nice.
This paper presents transformation from the ISO standard statistical data format SDMX-ML to RDF. URI patterns, interlinking, and publication is proposed for several statistical datasets. The paper is clearly written and the tools can be used for publishing other datasets, although some extensions or specific configuration might be necessary.
The only comments are that the paper is 2 pages too long according to the call. Also it would be better if datasets were published by the providers.
This paper describes a practical workflow for transforming SDMX collections into Linked Data. The authors focus on four relevant statistical datasets:
- OECD: whose mission is to promote policies that will improve the economic and social well-being of people around the world. - BFS Swiss Statistics, due to the Federal Statistical Office's web portal offering a wide range of statistical information including population, health, economy, employment and education. - FAO, which works on achieving food security for all to make sure people have regular access to enough high-quality food. - ECB, whose main task is to maintain the euro's purchasing power and thus price stability in the euro area.
Nevertheless, the tool proposed in the paper can be easily used for transforming any other SDMX dataset into Linked Data.
On the one hand, statistical data are rich sources of knowledge currently underexploited. Any new approach is welcome and it describes an effective workflow for transforming SDMX collections into Linked Data. On the other hand, the approach technically sounds. It describes a simple but effective solution based on well-known toolss, guaranteing robustnees and making easy its final integration into existing environments. Thus, the workflow is a contribution by itself and each stage describes how it impacts in the final dataset configuration.
With respect to the obtained datasets, these are clearly described in Sections 7 and 8. These reuse well-known vocabularies and provide interesting interlinkage between them but also with DBpedia, World-Bank, Transparency International and EUNIS. Apache Jena TDB is used to load RDF and Apache Jena Fuseki is used to run the SPARQL endpoint. Datasets are also released as RDF dumps (referenced from Data Hub).
Finally, its relevant for me how scalability problems are addressed, because I think that 12 GB is an excessive amount of memory for processing these datasets (the largest one outputs less 250 million triples). Do you have an alternative for processing larger datasets? Maybe you could partition the original dataset into some fragments: is the tool enough flexible to support it? Please, explain how scalability issues will be addressed to guarantee big datasets to be effectively transformed.
This paper describes the Linked Data datasets, and the process used to generate them, for a set of SDMX-enabled datasets coming from relevant international and country-focused statistical organisations (even some work has been done during the review process on another IMF dataset).
The paper is clearly fitting the special issue call, and the datasets are of good quality, and are made available under open licenses in such linked data format. I particularly like the extensive use of DataCube and Prov-o, and the approach taken for versioning of codelists and concept schemes.
The comments provided in this review are mostly curiosities about some design decisions or requests to make things more clear:
- First of all, a large part of the paper is about the process used to generate the Linked data datasets, but not so much about the datasets themselves or the problems associated to problems in the original SDMX data, which certainly exist still in current sdmx implementations.
- One example is related to the management of some NL descriptions of codes in codelists. These are simply ignored. Could they be handled differently? How manty of the datasets present this situation?
- URI patterns. I find them nicely created and sensible. However, I would have some questions: how do you order dimensions in observations? are you using order in the dimensions in the data cube structures? Why in the owl:Class part do you only use codelistID instead of considering also conceptIDs?
- The interlinkage of sdmx annotations part should be better explained, probably with an example that illustrates how it works.
- The interlinking of datasets should be better explained as well. are all found links correct? I have found many cases of compatible codelists that are produced as separate codelists by stat offices and hence in principle are not linked, but could be. Have you found those cases?
Great work Sarven, I'm keen to do more work on this. I'm working for Stats NZ at the moment but won't be for too long as I've moved to Ireland. But with the emergence of the SDMX Global Registry soon you should be able to collect a lot of interlinking data from that using the SDMX StructureSet structures which can hold both a ConceptMap and a CodeMap; so you could theoretically create your skos:exactMatch directly from these structures. Once the global registry is in place it obviously needs the key players to populate their data collaboratively.
We are building an RDF based classification system in SNZ which should be able to enrich the experience of doing this sort of thing in the future.
I'll have a play at this trying to get a few of our SDMX Datasets into the RDF and will be in touch. We have an SDMX Registry in v2.1 with the data exposed using v2.0 but I'd like to get the transforms working with v2.1 for at least the structure (and in preparation for the consolidated messages introduced in v2.1).