Sparqlines: SPARQL to Sparkline

Authors
Sarven Capadisli1
Identifier
http://csarven.ca/sparqlines-sparql-to-sparkline
Notifications Inbox
inbox/
Published
Modified
In Reply To
SemStats 2016 Call for Contributions
Call for Linked Research
Appeared In
CEUR (Central Europe workshop proceedings): Proceedings of the 4th International Workshop on Semantic Statistics, Volume 1654, urn:nbn:de:0074-1654-1
License
CC BY 4.0

Abstract

This article presents sparqlines: statistical observations fetched from SPARQL endpoints and displayed as inline-charts. An inline-chart, also known as a sparkline, is concise, and located where it is discussed in the text, complementing the supporting text without breaking the reader’s flow. For example, the  GDP per capita growth (annual %) [Canada] GDP per capita growth (annual %) claimed by the World Bank Linked Dataspace. We demonstrate an implementation which allows scientists or authors to easily enhance their work with sparklines generated from their own or public statistical linked datasets. This article includes an active demonstration accessible at http://csarven.ca/sparqlines-sparql-to-sparkline.

Keywords

Introduction

In this article we introduce sparqlines, an integration of statistical data retrieval using SPARQL with displaying observations in the form of word-size graphics: sparklines. We describe an implementation which is part of a Web based authoring tool (dokieli). We cover how the data is modelled and exposed in order to be suitable for embedding; demonstrate how to embed data as both static and dynamic sparklines and discuss the technical requirements of each; and walk through the user interactions to do so.

Our contribution is the generation of a well-established visual aid to reading statistical data (the sparkline) directly from the dataset itself, at the time of authoring the supporting text as part of the writing workflow. This enables authors who are already publishing data to use it directly, as well as encouraging them to make their data available for others to use, and offers an easy way to present the reader with a way to better understand the information.

We conclude with a discussion, including design considerations. The code of our implementation is open source, and we invite you to try it out and make requests for more advanced features: https://github.com/linkeddata/dokieli.

Data Provision

In order to use sparqlines, data has to be both well-formed and available over a SPARQL endpoint. Here we briefly discuss both of these requirements.

The RDF Data Cube vocabulary is used to describe multidimensional statistical data. It makes it possible to represent significant amounts of heterogeneous statistical data as Linked Data which can be discovered and identified in a uniform way. To qualify for consumption as a sparqline, the data must conform with some of the integrity constraints of the RDF Data Cube model, e.g., IC-1 (Unique DataSet), IC-11 (All dimensions required), IC-12 (No duplicate observations), IC-14 (All measures present).

Additional enrichments on the data cubes can improve their discovery and reuse. Examples include but not limited to; providing human-readable labels for the datasets (with language tags), classifications, and data structure definition, as well as provenance level data like license, last updated.

In order to allow user interfaces which can utilise a group of observations in a dataset, slices should be made available in the data. This enables consuming applications to dissect datasets (through SPARQL queries) for arbitrary subsets of observations. For example, while it is possible to construct a general query to get all of the observations in a dataset which have a particular dimension, it may be preferable to only query for such subsets provided that their structures can be identified and externally referenced. In the case of sparklines, one common use case for slices is to present data in time-series.

SPARQL queries are used to filter for graph patterns in the RDF Data Cube datasets. Depending on the user interface application, there may be multiple queries made to the SPARQL endpoints in order to filter the data based on user input. For example, an initial query may be a cursory inspection to discover suitable datasets with given parameters, e.g., what the dataset is about, the type of dimensions and their values, and subsequent queries may be to retrieve the matching datasets or slices with observations and their measure values.

Static and Dynamic Sparqlines

The data behind a sparqline can be static: a fixed historical set to which no new points are added; or dynamic: subject to change as new data is gathered. Both of these cases are accommodated by our implementation.

Static and Dynamic Sparklines
UseMethodsExample
Static Historical data or a fixed snapshot
  • Pre-rendered SVG
  • Embedded directly from datastore
Dynamic Data which may be subject to updates
  • Re-fetches data on page load or polls in real-time
  • Embed source as API endpoint which returns the sparkline
(reload article in browser)

Embedding Sparqlines

Our implementation allows authors to select text they have written which describes the data they want to visualise; it searches available datasets for those relevant to the text, and lets the user choose the most appropriate if there’s more than one. The sparqline is inserted along with a reference to the source.

A specific example workflow is demonstrated when this article is viewed in a Web browser (at its canonical URL: http://csarven.ca/sparqlines-sparql-to-sparkline). Enable the mode from the menu and highlight the text GDP of Canada. What occurs is as follows:

  1. User enters text in a sentence e.g., GDP of Canada.
  2. User selects text GDP of Canada with their mouse or keyboard.
  3. The user select the sparkline option from presented authoring toolbar.
  4. The input text is split into two: 1) GDP and 2) Canada segments, whereby the first term is the concept, and the second is a reference area. Reference areas are disambiguated against an internal dictionary.
  5. System constructs a SPARQL query URL and sends it to the World Bank Linked Dataspace endpoint, looking for a graph pattern where the datasets of labels have GDP in them in which there is at least one observation for the reference area Canada.
  6. User is given a list of datasets to select from which match the above criteria, and the user selects desired dataset.
  7. System sends a SPARQL query to get the observations of the selected dataset for Canada.
  8. A sparkline is created and displayed for the user, also indicating the number of observations it has.
  9. If the user is happy with this visualisation they include it in the text. A hyperlink to the dataset, and a sparkline SVG is inserted back into the sentence replacing GDP of Canada with GDP per capita growth (annual %).

Figure [2] is a video screencast of this interaction.

Video of Sparqlines interaction in dokieli

Semantic Publishing

Our implementation in dokieli automatically includes semantic annotations within the embedded sparqlines. The sparqline resource has its own URI that can be used for global referencing. The RDF statements are represented using the HTML+RDFa syntax, and they preserve the following information:

  • The part of the document to which the sparqline belongs (rel="schema:hasPart").
  • The human-readable name for the figure (based on the dataset used), where it was derived from (the qb:DataSet instance), and the generated SVG.
  • The SVG resource has statements to indicate:
    • linked statistical dataset which was used (rel="prov:wasDerivedFrom").
    • human-readable name of the dataset (property="schema:name").
    • license for the generated SVG (rel="schema:license").
    • further information for each qb:Observation (rel="rdfs:seeAlso").

This information can be discovered and parsed as RDF, thus making easy to access and reuse by third-party applications. For example, another author can cite or include these sparqlines in their work.

Discussion and Conclusions

We have presented a preliminary implementation of sparklines generated from SPARQL endpoints and embedded directly through authoring tool. This allows authors to visualise their data in an optimal way without breaking their workflow. However, there is a lot of scope for future work in this area. We now discuss some areas for further development.

Design principles: Tufte makes recommendations on readability, as well as applying Cleveland’s analytical method of choosing aspect ratios banking to 45° [2, 10, 11]. Cleveland’s method has been extended to generate banked sparklines by providing the vertical dimension to fit a typographical line. These approaches help maximize the clarity of the line segments [12]. Applying these methods is a future implementation in dokieli (issue 159).

Dataset interaction: Building on existing work in faceted searching and browsing of RDF data, authors can explore suitable datasets with a combination of searching using natural-language and filtering through available datasets and dimensions of interest. This approach is convenient for datasets in RDF Data Cubes since they are highly structured and classified. Further work is needed to improve the process for disambiguation of the author’s input in natural language in order to discover appropriate URIs in the dataset.

Privacy considerations: Many researchers collect experimental data which has sensitive or identifiable information. This information should not be exposed through public SPARQL endpoints. Measures such as access control lists can allow researchers to generate sparqlines over sensitive data.

Data availability: SPARQL endpoints are notoriously unreliable and they may have high setup costs for new datasets. Applications which rely on endpoints to generate sparqlines with dynamic data, may want to initially include a local cached copy from the last access point in the article. The application can then asynchronously fetch or subscribe for new updates.

Acknowledgements

The motivation and work on sparqlines was inspired by Edward Tufte’s education and popularisation of sparklines. Special thanks to Amy Guy and Ilaria Liccardi for their great support and tireless nagging to get this written up, as well as Jindřich Mynarz for help with SPARQL query optimisations.

References

  1. Zelchenko, P., Medved, M.: QuoteTracker, http://pete.zelchenko.com/portfolio/screen/2gk.htm
  2. Tufte, E.: Beautiful Evidence, Graphics Press, 2006, ISBN 9781930824164, http://www.worldcat.org/title/beautiful-evidence/oclc/70203994&referer=brief_results
  3. P. Meharia: Use of Visualization in Digital Financial Reporting: The effect of Sparkline (2012). Theses and Dissertations--Business Administration. Paper 1, http://uknowledge.uky.edu/cgi/viewcontent.cgi?article=1000&context=busadmin_etds
  4. Google Docs Sparklines, https://support.google.com/docs/answer/3093289?hl=en
  5. Cyganiak, R., Reynolds, D.: The RDF Data Cube vocabulary, W3C Recommendation, 2014, https://www.w3.org/TR/vocab-data-cube/
  6. Skjæveland, M. G.: Sgvizler: A JavaScript Wrapper for Easy Visualization of SPARQL Result Sets, 2012, http://2012.eswc-conferences.org/sites/default/files/eswc2012_submission_303.pdf
  7. Hage, W. R. v., Marieke v., Malaisé., V.: Linked Open Piracy: A story about e-Science, Linked Data, and statistics (2012), http://www.few.vu.nl/~wrvhage/papers/LOP_JoDS_2012.pdf
  8. Percy E. Rivera Salas, P. E. R., Mota, F. M. D., Martin, M., Auer, S., Breitman, K., Casanova, M. A.: Publishing Statistical Data on the Web, ISWC (2012), http://svn.aksw.org/papers/2012/ESWC_PublishingStatisticData/public.pdf
  9. Capadisli, S., Auer, S. Riedl, R.: Linked Statistical Data Analysis, ISWC SemStats (2013), http://csarven.ca/linked-statistical-data-analysis
  10. Edward Tufte forum: Sparkline theory and practice Edward Tufte, http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0001OR&topic_id=1
  11. Cleveland, W.: Visualizing Data, Hobart Press, 1993, ISBN 9780963488404, http://dl.acm.org/citation.cfm?id=529269
  12. Heer, J., Maneesh, A.: Multi-Scale Banking to 45°, IEEE Transactions on Visualization and Computer Graphics, Vol. 12, No. 5, 2006, http://vis.berkeley.edu/papers/banking/2006-Banking-InfoVis.pdf

Interactions

52 interactions

Ruben Verborgh’s photoRuben Verborgh replied on

I'm remember Denny showing me something similar a couple of years ago: Spark (http://km.aifb.kit.edu/sites/spark/). I'm not a 100% sure it did sparklines, but my memory and the very name of the tool suggest so.

3 interactions

Sarven Capadisli’s photoSarven Capadisli replied on

Thanks for the heads-up Rube. I wasn't able to reproduce rdf-spark's data points example. It doesn't look like the project is maintained. Is there an article or a screenshot demonstrating a sparkline?

Please bear in mind that sparklines are not ordinary line-charts. There are a number of 'SPARQL to visualisation' tools out there capable of creating line-charts. I've also created them in the past for http://worldbank.270a.info/ . This is in and of itself not new to the (semantic) statistics community either. People have been doing similar work with or without SPARQL even. See R, Plotly etc.

AFAIK, Sgvizler is arguably one of the most diverse in producing SPARQL to visualisations that uses the Google Charts API. I've just added that to the related work since it actually exemplifies a sparkline. It escaped me earlier that it did.

The focus here is specifically on Tufte's description of sparklines. This work has particular focus to 1) sentence level sparklines 2) data from RDF Data Cube over 3) SPARQL endpoint, 4) generating SVG, that's 5) semantically annotated/related to the rest of the article in context of 6) an authoring tool. [Perhaps someone should make this more clear in the abstract or conclusion or something :P]

3 interactions

Ruben Verborgh’s photoRuben Verborgh replied on

Yes, it has been some years, not everything is still in working order. (Did I mention it relies on SPARQL endpoints ;-)

As I wrote, I'm not sure whether there were actual sparklines in there; I seem to vaguely remember there were, but I might be confusing it with another demo I saw at a later point. It has been 5 years since Denny showed me. If anybody knows, it would be him.

+1 for Tufte, of course, but—if I'm very critical—it seems that 3 is covered by Spark, 1 by various libraries, and 4 also. So the novelty would be in 2/5/6.

In other news, I'm curious how things like this can be ported to lightweight server-side interfaces, i.e., what features we would need to equip a server with in order to realize this at minimal cost.

Anonymous Reviewer replied on

The paper presents an implementation of sparkline, i.e. inline charts that allow to read a text without un-necessary breaks of the reading flow. The implementation is named "sparqlines", and is based on fetching data from SPARQL endpoints. Data can be both static and dynamic.

The contribution of the paper is original and very interesting. The presentation can be slightly improved by providing some more detail on implementation aspects.

Anonymous Reviewer replied on

This is a demo contribution that introduces an implementation for creating sparkline graphics from RDF data cubes. For sure this implementation is of value for the community as it provides to users a new way for exploiting RDF cubes but also because it will contribute to the understanding of the requirements of exploiting RDF cubes.

Some suggestions for improvement:

  • Τhis implementation is part of dokieli (a software tool for decentralised editing) and the code of the implementation leads to dokieli’s github page. So, it would be good for the cohesion and readability of the paper the author to also describe dokieli in more detail.
  • Section 3 on Data Provision is important because it proves that the author has taken into account the reusability of the tool across distinct datasets. The tool expects to consume RDF Data Cubes that comply with some of the integrity constraints of the vocabulary and also encourages the creation of qb:Slice. I think that this discussion is very important.
  • I understand that the implementation is only a proof-of-concept and thus it makes many assumptions. But I think that these assumptions should be clearly stated. For example, how a timeline is selected from a multi-dimensional (or even multi-measure) cube? Even if slices have been defined what is the process of choosing one of the many slices? How the user can evaluate whether the create graph is the proper one?

In general I think that dokieli and sparqlines is a contribution that will provide added value to the workshop.

Anonymous Reviewer replied on

The paper presents a system for generating sparklines (small “word-size” inline charts) from SPARQL endpoints that contain data represented in the Data Cube Vocabulary. There is an editing environment that helps generating an appropriate chart from a text input, and a facility for embedding the resulting graphic, both as a static image and as a dynamic component.

What I like about the paper: It shows a way of hiding a lot of complexity to get a simple and aesthetically pleasing result. It's pretty cool! The paper is well written and well presented.

What I don't like: First, sparklines have limited applicability and this is not sufficiently discussed. The example in the abstract already shows some of the problems: Is this the GDP over the last decade or the last century? What is the range of the Y axis? Sparklines show a trend, but don't show axis ranges. They are only applicable where the axis ranges don't matter, or are clear from the context (e.g., in a document that discusses GDP changes throughout the 20th century, or on a 24-hour dashboard, the X axis would be clear; and if a percentage is measured, the Y axis is more clear). They facilitate comparison (Tufte: “small multiples”), so work best when there are many. The paper shows little awareness of these strengths and weaknesses.

Second, the paper is an interesting systems demonstration but there is very little science here. What is the contribution? In what way does the paper show an improvement over the state of the art? Does it line up evidence to show that this improvement has indeed be achieved? (For example, is the intended contribution to make it easier for document authors to use sparklines? Or is the intended contribution to make use of the untapped potential in these SPARQL endpoints? Or something else?)

Third, the paper talks about “enhancing the work of scientists and authors”, so presumably the vision is to have sparklines show up in more and more scientific papers and other kinds of documents. This shakes up the notion of what a paper/document is. We now have a “front-end” that fetches and displays data from some data source. So, the paper becomes more like a dashboard? This raises questions around the purpose of a paper, around archiving, snapshots versus living documents, and so on. The discussion of static versus dynamic modes hints at some awareness of this, but I would have liked to see a vision “beyond the PDF for printing to paper” spelled out.

Fourth, more details on the text-to-chart component, its applicability and potential and limitations, and how data needs to be organised to enable and support such a component, would have been good. At the moment I am not confident that it is more than a gimmick that only works in very limited cases.

Spelling/grammar:

  • Abstract: readers flow -> reader's flow
  • Section 3: in dataset -> in a dataset
  • Section 5: Systems constructs
  • Section 5: User is is
  • Section 6: which be used

1 interaction

Leave a comment

* marked fields are required.