1. Linked Statistical Data Analysis

    ISWC SemStats, Sydney, 2013-10-22

    #LinkedData #ISWC2013 #SemStats2013

    Sarven's avatar Sarven Capadisli http://csarven.ca/#i @csarven

  2. What and Why?

    Fun and profit

  3. Statistical Data

    Data Cube Life expectancy
    • e.g., personal, government, health, .. macrodata about societies
    • Uncovering insights, making predictions, decisions
    • Understanding human societies
  4. Statistical Data on the Web (Characteristics)

    • Heterogeneous
    • Decentralized
    • Structured
    • High volume
    • Formats (e.g., CSV, Excel, PC-Axis, SDMX-ML, XML)

    Clean? Synchronised? Comparable? Provenance? Trustable?

  5. Interesting queries?

    • Number of people who were born in Sydney before 1900
    • List of project names in countries which are classified to have low to middle income that are situated above the equator
  6. Interesting analysis?

    • analysis which is statistically significant, and has to do with GDP and health subjects
    • a list of indicator pairs with strong correlations
    • using the line of best fit of a regression analysis to predict or forecast possible outcomes
    • countries which have less mortality rate than average with high corruption
  7. Statistical Linked Dataspaces

  8. A Linked Dataspace

    from Statistical Linked Dataspaces

  9. Statistical Linked Dataspaces (2010-2011)

  10. Statistical Linked Dataspaces (2012)

  11. Statistical Linked Dataspaces (2013-present)

  12. 270a Cloud

  13. 270a Cloud

  14. Statistical Linked Data vocabularies

    • RDF Data Cube: Data structure definitions, code lists, datasets, ..
    • SKOS: Code lists, and concepts can be reused
    • XKOS: Hierarchical concept schemes
    • VoID: vocabulary for dataset metadata
    • PROV-O: provenance
    • British reference periods, DC Terms, FOAF, ..
  15. Data modeling (towards RDF Data Cube)

    • Custom XML to RDF/XML (WB)
    • Excel/CSV to RDF Turtle (TI)
    • Linked SDMX Data (XSLT) tooling to RDF/XML (OECD, FAO, BFS, ECB, IMF) <----- win
    • Provenance, LODStats
  16. Linked SDMX Transformation

  17. Transformed data

    DataNumber of triplesNumber of observationsRatio
    ~ 1 billion82.5 million-
    WB datasets221 million21 million10.5:1
    TI datasets50 thousand4 thousand12.5:1
    OECD datasets225 million24 million9.4:1
    BFS metadata1 millionN/AN/A
    FAO datasets53 million7.2 million7.4:1
    ECB datasets427 million23 million18.5:1
    IMF datasets36 million3.3 million10.9:1
  18. Linked Statistical artefacts

    • Dataset: http://worldbank.270a.info/dataset/world-bank-finances
    • Observation: http://ecb.270a.info/dataset/SEE/A/AT/WBR0/EXT/X/E/2011
    • Dimension (property): http://oecd.270a.info/property/TIME
    • Measure (property): http://ecb.270a.info/property/OBS_VALUE
    • Attribute: http://transparency.270a.info/classification/attribute/matching-percentiles
    • Concept: http://imf.270a.info/code/1.0/CL_AREA/CH
    • Code list: http://fao.270a.info/code/0.1/CL_UN_COUNTRY
    • Regression Analysis: http://stats.270a.info/analysis/worldbank:GC.DOD.TOTL.GD.ZS/transparency:CPI2009/2009

    Cool URIs? 1, 5, 100, 10000 years?

  19. Interlinking

    • Identified resources with high value (i.e., "usefulness") e.g, reference areas, currencies
    • Some manual linking
    • Used LIMES to semi-automatically link (several threshold tests and reviewing)
  20. Provenance

    • At retrieval, transformation, and publication phases
    • Provenance activities are linked, e.g:
    • a transformation activity is derived from a retrieval activity
    • new data (like analysis) is derived from other acitivities, and sources
  21. Provenance

    • Data source
    • License
    • Related resource
    • Creator, publisher
    • Issued, modified dates
    • ...
  22. stats.270a.info

    A human and machine-friendly Web based application which uses statistical linked dataspaces for federated queries, generates analysis and visualisations.

    Intended to be friendly for non-developers to discover statistical analysis.

  23. Related work

    What's common above? Single data endpoint

  24. stats.270a.info Toolkit

    • Shiny server (node)
    • R (Shiny, SPARQL packages)
    • Jena Fuseki
    • Apache
    • Linked Data Pages
  25. I/O

    • [Input] User-selects analysis (covariants, reference period):
      • Use Shiny server cache, else;
      • Check local RDF store, else;
      • Submit federated queries
    • [Output] Federated query response:
      • Build (regression) analysis with R
      • Store analysis results, and provenance data to local RDF store
      • Generate a plot
      • Return analysis information and plot to UI
  26. Stats analysis vocabulary

    @prefix stats: <http://stats.270a.info/vocab#>

  27. Analysis user-interface

  28. Analysis user-interface (Plot) 1/3

    http://stats.270a.info/analysis/worldbank:SP.DYN.IMRT.IN/transparency:CPI2009/year:2009

  29. Analysis user-interface (Summary) 2/3

    http://stats.270a.info/analysis/worldbank:SP.DYN.IMRT.IN/transparency:CPI2009/year:2009

  30. Oh yeah?

    Provenance user-interface

  31. Analysis user-interface (Provenance) 3/3

    http://stats.270a.info/provenance/fa698e46868fe348865678884e89ef84b0be6c64

  32. Federated SPARQL overview

    • Intended to be work with minimal endpoint knowledge (trade-off: it is dumb)
    • Some queries are expensive (i.e., remote data transfer time from, local joins)
  33. Federated SPARQL query (anatomy)

    SELECT referenceArea measureX measureY
      SERVICE endpointX
         observation measure values for all referenceAreas
           from datasetX with reference period
    
      SERVICE endpointY
         observation measure values for all referenceAreas
           from datasetY with reference period
    
      FILTER referenceArea exactMatchXes from or to exactMatchYes
               or they are same
    
  34. Federated SPARQL query (1/3)

    SELECT ?refAreaY ?x ?y ?identityX ?identityY
    WHERE {
      SERVICE <http://example.org/sparql> {
        SELECT DISTINCT ?identityX ?refAreaX ?refAreaXExactMatch ?measureX
        WHERE {
          ?observationX qb:dataSet <http://example.org/dataset/X> .
          ?observationX ?propertyRefPeriodX exampleRefPeriod:1234 .
          ?propertyRefAreaX rdfs:subPropertyOf* sdmx-dimension:refArea .
          ?observationX ?propertyRefAreaX ?refAreaX .
          ?propertyMeasureX rdfs:subPropertyOf* sdmx-measure:obsValue .
          ?observationX ?propertyMeasureX ?x .
          <http://example.org/dataset/X>
            qb:structure/stats:identityDimension ?propertyIdentityX .
          ?observationX ?propertyIdentityX ?identityX .
          OPTIONAL {
            ?refAreaX skos:exactMatch ?refAreaXExactMatch .
            FILTER (REGEX(STR(?refAreaXExactMatch), "^http://example.net/"))
          }
        }
      }
      ...
    
  35. Federated SPARQL query (2/3)

      SERVICE <http://example.net/sparql> {
        SELECT DISTINCT ?identityY ?refAreaY ?refAreaYExactMatch ?measureY
        WHERE {
          ?observationY qb:dataSet <http://example.net/dataset/Y> .
          ?observationY ?propertyRefPeriodY exampleRefPeriod:1234 .
          ?propertyRefAreaY rdfs:subPropertyOf* sdmx-dimension:refArea .
          ?observationY ?propertyRefAreaY ?refAreaY .
          ?propertyMeasureY rdfs:subPropertyOf* sdmx-measure:obsValue .
          ?observationY ?propertyMeasureY ?y .
          <http://example.net/dataset/Y>
            qb:structure/stats:identityDimension ?propertyIdentityY .
          ?observationY ?propertyIdentityY ?identityY .
          OPTIONAL {
            ?refAreaY skos:exactMatch ?refAreaYExactMatch .
            FILTER (REGEX(STR(?refAreaYExactMatch), "^http://example.org/"))
          }
        }
      }
      ...
    
  36. Federated SPARQL query (3/3)

      FILTER (?refAreaYExactMatch = ?refAreaX 
              || ?refAreaXExactMatch = ?refAreaY
              || ?refAreaY = ?refAreaX)
    }
    ORDER BY ?identityY ?identityX ?x ?y
    
  37. Dereferenceable URIs (statistical artefacts)

    • Analysis - http://stats.270a.info/analysis/
      • {independentVariable}/{dependentVariable}/{referencePeriod}
      • {prefix}:{dataset}/{prefix}:{dataset}/{prefix}:{refPeriod}
      • worldbank:SP.DYN.IMRT.IN/transparency:CPI2009/year:2009
    • PROV-O Activity - http://stats.270a.info/provenance/
      • {sha1(datasetX, datasetY, refPeriod)}
  38. Discovering analysis

    • analysis which is statistically significant, and has to do with Gross Domestic Product (GDP) and health subjects
    • a list of indicator pairs with strong correlations
    • using the line of best fit of a regression analysis to predict or forecast possible outcomes
    • countries which have less mortality rate than average with high corruption
  39. Quick demo?

    Let's break something live ...

    Or come by for the poster/demo Semantic Web Challenge session any way!

  40. Conclusions and future work

    • Improve federated query performance
    • Interlink more codelists and concept schemes
    • Factor-in concept temporality
    • Improve UI for variable selection
    • Add other analysis and visualisations e.g., linear charts, multivariate analysis
    • Allow developer to enter VoID URI
  41. Linked Statistical Data Analysis

    Sarven's avatar Sarven Capadisli

    http://csarven.ca/#i

    @csarven

  42. Credits

    • Slides using S5 and homecooked RDFa