Linked Statistical Data Analysis

ISWC SemStats, Sydney, 2013-10-22

#LinkedData #ISWC2013 #SemStats2013

Sarven Capadisli http://csarven.ca/#i @csarven

http://csarven.ca/linked-statistical-data-analysis http://stats.270a.info/ http://csarven.ca/presentations/lsd-analysis
What and Why?

Fun and profit
Statistical Data
- e.g., personal, government, health, .. macrodata about societies
- Uncovering insights, making predictions, decisions
- Understanding human societies
Statistical Data on the Web (Characteristics)
- Heterogeneous
- Decentralized
- Structured
- High volume
- Formats (e.g., CSV, Excel, PC-Axis, SDMX-ML, XML)
Clean? Synchronised? Comparable? Provenance? Trustable?
Interesting queries?
- Number of people who were born in Sydney before 1900
- List of project names in countries which are classified to have low to middle income that are situated above the equator
Interesting analysis?
- analysis which is statistically significant, and has to do with GDP and health subjects
- a list of indicator pairs with strong correlations
- using the line of best fit of a regression analysis to predict or forecast possible outcomes
- countries which have less mortality rate than average with high corruption
Statistical Linked Dataspaces
A Linked Dataspace

from Statistical Linked Dataspaces
Statistical Linked Dataspaces (2010-2011)
- Central Statistics Office Ireland
- Eurostat
Statistical Linked Dataspaces (2012)
- World Bank (WB)
- Transparency International (TI)
Statistical Linked Dataspaces (2013-present)
- Organisation for Economic Co-operation and Development (OECD)
- Swiss Federal Statistics (BFS)
- Food and Agriculture Organization of the United Nations (FAO)
- European Central Bank (ECB)
- International Monetary Fund (IMF)
270a Cloud
270a Cloud
Statistical Linked Data vocabularies
- RDF Data Cube: Data structure definitions, code lists, datasets, ..
- SKOS: Code lists, and concepts can be reused
- XKOS: Hierarchical concept schemes
- VoID: vocabulary for dataset metadata
- PROV-O: provenance
- British reference periods, DC Terms, FOAF, ..
Data modeling (towards RDF Data Cube)
- Custom XML to RDF/XML (WB)
- Excel/CSV to RDF Turtle (TI)
- Linked SDMX Data (XSLT) tooling to RDF/XML (OECD, FAO, BFS, ECB, IMF) <----- win
- Provenance, LODStats
Linked SDMX Transformation
Linked SDMX Concepts

Transformed data

Data	Number of triples	Number of observations	Ratio
	~ 1 billion	82.5 million	-
WB datasets	221 million	21 million	10.5:1
TI datasets	50 thousand	4 thousand	12.5:1
OECD datasets	225 million	24 million	9.4:1
BFS metadata	1 million	N/A	N/A
FAO datasets	53 million	7.2 million	7.4:1
ECB datasets	427 million	23 million	18.5:1
IMF datasets	36 million	3.3 million	10.9:1

Linked Statistical artefacts
- Dataset: http://worldbank.270a.info/dataset/world-bank-finances
- Observation: http://ecb.270a.info/dataset/SEE/A/AT/WBR0/EXT/X/E/2011
- Dimension (property): http://oecd.270a.info/property/TIME
- Measure (property): http://ecb.270a.info/property/OBS_VALUE
- Attribute: http://transparency.270a.info/classification/attribute/matching-percentiles
- Concept: http://imf.270a.info/code/1.0/CL_AREA/CH
- Code list: http://fao.270a.info/code/0.1/CL_UN_COUNTRY
- Regression Analysis: http://stats.270a.info/analysis/worldbank:GC.DOD.TOTL.GD.ZS/transparency:CPI2009/2009
Cool URIs? 1, 5, 100, 10000 years?
Interlinking
- Identified resources with high value (i.e., "usefulness") e.g, reference areas, currencies
- Some manual linking
- Used LIMES to semi-automatically link (several threshold tests and reviewing)
Provenance
- At retrieval, transformation, and publication phases
- Provenance activities are linked, e.g:
- a transformation activity is derived from a retrieval activity
- new data (like analysis) is derived from other acitivities, and sources
Provenance
- Data source
- License
- Related resource
- Creator, publisher
- Issued, modified dates
- ...
stats.270a.info

A human and machine-friendly Web based application which uses statistical linked dataspaces for federated queries, generates analysis and visualisations.

Intended to be friendly for non-developers to discover statistical analysis.
Related work
- Performing Statistical Methods on Linked Data
- Defining and Executing Assessment Tests on Linked Data for Statistical Analysis
- Linked Open Piracy: A story about e-Science, Linked Data, and statistics
- Towards Next Generation Health Data Exploration
- Publishing Statistical Data on the Web
- Google Public Data Explorer / Gapminder
- Generating Possible Interpretations for Statistics from Linked Open Data
What's common above? Single data endpoint
stats.270a.info Toolkit
- Shiny server (node)
- R (Shiny, SPARQL packages)
- Jena Fuseki
- Apache
- Linked Data Pages
I/O
- [Input] User-selects analysis (covariants, reference period):
  - Use Shiny server cache, else;
  - Check local RDF store, else;
  - Submit federated queries
- [Output] Federated query response:
  - Build (regression) analysis with R
  - Store analysis results, and provenance data to local RDF store
  - Generate a plot
  - Return analysis information and plot to UI
Stats analysis vocabulary

@prefix stats: <http://stats.270a.info/vocab#>
Analysis user-interface
Analysis user-interface (Plot) 1/3

http://stats.270a.info/analysis/worldbank:SP.DYN.IMRT.IN/transparency:CPI2009/year:2009
Analysis user-interface (Summary) 2/3

http://stats.270a.info/analysis/worldbank:SP.DYN.IMRT.IN/transparency:CPI2009/year:2009
Oh yeah?

Provenance user-interface
Analysis user-interface (Provenance) 3/3

http://stats.270a.info/provenance/fa698e46868fe348865678884e89ef84b0be6c64
Federated SPARQL overview
- Intended to be work with minimal endpoint knowledge (trade-off: it is dumb)
- Some queries are expensive (i.e., remote data transfer time from, local joins)

Federated SPARQL query (anatomy)

SELECT referenceArea measureX measureY
  SERVICE endpointX
     observation measure values for all referenceAreas
       from datasetX with reference period

  SERVICE endpointY
     observation measure values for all referenceAreas
       from datasetY with reference period

  FILTER referenceArea exactMatchXes from or to exactMatchYes
           or they are same

Federated SPARQL query (1/3)

SELECT ?refAreaY ?x ?y ?identityX ?identityY
WHERE {
  SERVICE <http://example.org/sparql> {
    SELECT DISTINCT ?identityX ?refAreaX ?refAreaXExactMatch ?measureX
    WHERE {
      ?observationX qb:dataSet <http://example.org/dataset/X> .
      ?observationX ?propertyRefPeriodX exampleRefPeriod:1234 .
      ?propertyRefAreaX rdfs:subPropertyOf* sdmx-dimension:refArea .
      ?observationX ?propertyRefAreaX ?refAreaX .
      ?propertyMeasureX rdfs:subPropertyOf* sdmx-measure:obsValue .
      ?observationX ?propertyMeasureX ?x .
      <http://example.org/dataset/X>
        qb:structure/stats:identityDimension ?propertyIdentityX .
      ?observationX ?propertyIdentityX ?identityX .
      OPTIONAL {
        ?refAreaX skos:exactMatch ?refAreaXExactMatch .
        FILTER (REGEX(STR(?refAreaXExactMatch), "^http://example.net/"))
      }
    }
  }
  ...

Federated SPARQL query (2/3)

  SERVICE <http://example.net/sparql> {
    SELECT DISTINCT ?identityY ?refAreaY ?refAreaYExactMatch ?measureY
    WHERE {
      ?observationY qb:dataSet <http://example.net/dataset/Y> .
      ?observationY ?propertyRefPeriodY exampleRefPeriod:1234 .
      ?propertyRefAreaY rdfs:subPropertyOf* sdmx-dimension:refArea .
      ?observationY ?propertyRefAreaY ?refAreaY .
      ?propertyMeasureY rdfs:subPropertyOf* sdmx-measure:obsValue .
      ?observationY ?propertyMeasureY ?y .
      <http://example.net/dataset/Y>
        qb:structure/stats:identityDimension ?propertyIdentityY .
      ?observationY ?propertyIdentityY ?identityY .
      OPTIONAL {
        ?refAreaY skos:exactMatch ?refAreaYExactMatch .
        FILTER (REGEX(STR(?refAreaYExactMatch), "^http://example.org/"))
      }
    }
  }
  ...

Federated SPARQL query (3/3)

  FILTER (?refAreaYExactMatch = ?refAreaX 
          || ?refAreaXExactMatch = ?refAreaY
          || ?refAreaY = ?refAreaX)
}
ORDER BY ?identityY ?identityX ?x ?y

Dereferenceable URIs (statistical artefacts)
- Analysis - http://stats.270a.info/analysis/
  - {independentVariable}/{dependentVariable}/{referencePeriod}
  - {prefix}:{dataset}/{prefix}:{dataset}/{prefix}:{refPeriod}
  - worldbank:SP.DYN.IMRT.IN/transparency:CPI2009/year:2009
- PROV-O Activity - http://stats.270a.info/provenance/
  - {sha1(datasetX, datasetY, refPeriod)}
Discovering analysis
- analysis which is statistically significant, and has to do with Gross Domestic Product (GDP) and health subjects
- a list of indicator pairs with strong correlations
- using the line of best fit of a regression analysis to predict or forecast possible outcomes
- countries which have less mortality rate than average with high corruption
Quick demo?

Let's break something live ...

Or come by for the poster/demo Semantic Web Challenge session any way!
Conclusions and future work
- Improve federated query performance
- Interlink more codelists and concept schemes
- Factor-in concept temporality
- Improve UI for variable selection
- Add other analysis and visualisations e.g., linear charts, multivariate analysis
- Allow developer to enter VoID URI
Linked Statistical Data Analysis

Sarven Capadisli

http://csarven.ca/#i

@csarven

#LinkedData #ISWC2013 #SemStats2013
http://csarven.ca/linked-statistical-data-analysis http://stats.270a.info/ http://csarven.ca/presentations/lsd-analysis has CC-BY-3.0 license
Credits
- Slides using S5 and homecooked RDFa

Linked Statistical Data Analysis

ISWC SemStats, Sydney, 2013-10-22

#LinkedData #ISWC2013 #SemStats2013

Sarven Capadisli http://csarven.ca/#i @csarven

What and Why?

Fun and profit

Statistical Data

Statistical Data on the Web (Characteristics)

Interesting queries?

Interesting analysis?

Statistical Linked Dataspaces

A Linked Dataspace

Statistical Linked Dataspaces (2010-2011)

Statistical Linked Dataspaces (2012)

Statistical Linked Dataspaces (2013-present)

270a Cloud

270a Cloud

Statistical Linked Data vocabularies

Data modeling (towards RDF Data Cube)

Linked SDMX Transformation

Linked SDMX Concepts

Transformed data

Linked Statistical artefacts

Interlinking

Provenance

Provenance

stats.270a.info

Related work

stats.270a.info Toolkit

I/O

Stats analysis vocabulary

Analysis user-interface

Analysis user-interface (Plot) 1/3

Analysis user-interface (Summary) 2/3

Oh yeah?

Provenance user-interface

Analysis user-interface (Provenance) 3/3

Federated SPARQL overview

Federated SPARQL query (anatomy)

Federated SPARQL query (1/3)

Federated SPARQL query (2/3)

Federated SPARQL query (3/3)

Dereferenceable URIs (statistical artefacts)

Discovering analysis

Quick demo?

Let's break something live ...

Conclusions and future work

Linked Statistical Data Analysis

Sarven Capadisli

http://csarven.ca/#i

@csarven

#LinkedData #ISWC2013 #SemStats2013

Credits