Enabling Accessible Knowledge

Sarven Capadisli12
Reinhard Riedl1𝄞
Sören Auer2
Document ID
CC BY 4.0
In Reply To
CeDEM15 Call for Papers
Appeared In
CeDEM15 (International Conference for E-Democracy and Open Government 2015) , ISBN 9783902505699 , Volume 2015 , Pages 257267
To inspire and enable researchers to discover and share their knowledge using the native Web stack for maximum openness, accessibility, and flexibility.


The purpose of this document is to enable Web researchers to discover and share their knowledge using the native Web technologies and standards for maximum openness, accessibility, and flexibility.



Web: a frontier. It is one of the voyages of the modern human. Its continuing mission: to explore strange new memes, to seek out new knowledge and new experiences, to boldly go where no one has gone before.

Tim Berners-Lee stated that the pursuit of the Semantic Web is what we will get if we perform the same globalisation process to Knowledge Representation that the Web initially did to Hypertext [1]. While there are notable efforts making research publications and data available for free with open access, there is a self-evident disconnect between research documents as far as machines are concerned. This is mostly due to machine-friendly methods for scholarly articles are neither solicited by research institutions or journals, nor exploited by researchers. This is concerning especially when the research knowledge pertaining to Web Science eventually inhabits in the Web ecosystem. Thus, our motivation is to ensure that research documents are as human and machine-friendly as possible.

Let us take a look at significant efforts which contribute towards meeting the challenge of making research knowledge more accessible to people and societies.

Significant Efforts

The Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities is one of the milestones of the OA (Open Access) movement [2]. Today, the push for more OA publishing is widely embraced and celebrated by many, including; researchers, governments, universities, libraries, archives, and funding agencies. The OA strategy is intended to address issues on standards, quality, stability, and transition of scholarly research. While there political and social mandates with best practices, there are no explicit technical mandates for systems and technologies (e.g., Web semantics) to carry out the OA mission.

In order to further open up the market for services based on public-sector information [3], the European Commission PSI Directive — a digital agenda for Europe — recommends high-demand datasets from libraries and archives be published in machine-readable and open formats (CSV, JSON, XML, RDF, etc.) to enhance accessibility, and described in rich metadata formats and classified according to standard vocabularies [4], and to facilitate re-use, public sector bodies should [...] make documents available [...] at the best level of precision and granularity, in a format that ensures interoperability [5]. Here again we observe well-intended recommendations without concrete mandates on technology or systems on precisely how research documents should be shared.

There are notable initiatives by public sector bodies which take on holistic approaches to build a sustainable ecosystem around data that supports social, economic and political impact [6]. The recommendations lean towards making reusable and machine-friendly formats available, in parallel to their PDF equivalents.

In 1999 article, Practical Knowledge Representation for the Web stated that the lack of semantic markup is a major barrier to the development of more intelligent document processing on the Web, and meta-data annotation of Web sources is essential for applying AI techniques on a large and successful scale [7], which at large, remain as open issues today.

The Web Science community continues to conduct research tracks, workshops, and challenges, in a manner that even the call for papers of top-level Web conferences request research documents that are neither machine nor Web-friendly. There remains strong demands for submissions to be solely in Word or PDF formats, meanwhile requesting strict adherence to the dimensions and limitations of a printable page. Unfortunately, research articles which discuss the discoveries about the Semantic Web, semantic publications or linked science are locked inside well-known data-silos: Word and PDF, and thus unnecessarily challenging to extract information, or conduct pattern recognition from within. The community is however becoming more aware of the public discussions around these issues in recent years, and considering to make efforts towards improving the state of research publishing within its own field.

For academic reviews, the Semantic Web Journal adapts an open and transparent review process with publicly available responses on the site Assigned editors and reviewers are known by name, and are published together with accepted manuscript [8].

It should be noted that, there are important achievements in Web Science that contribute to the future of scholarly communication, for example, vocabulary and ontology engineering (e.g., PROV-O, OPMW, Wf4Ever, SPAR, SIO, DDI-RDF, QB), tool-building, conceptual and architectural designs. The application of these discoveries and efforts are however absent in research articles.

In the publishing and journal space, there are different business models in place: from traditional publishers charging the readers and institutions per article, open-access publishers charging per publishing researcher, to one-time membership fees. While these systems provide varying options for research publications, the commonality is that the research documents are disjoint from one another at the lower data level, e.g., variables, hypothesis (as opposed to basic metadata). By in large, research knowledge is packaged with the intention of being printed on paper. Hence, the requirements of the publishers and journals from (Web) conferences are that, submissions foremost comply with printability. There exists digital journals, which have an open policy for the submission format, but those are few and far between.

Problem Statement
One societal issue is that, communication of research knowledge have severe limitations and lack coordination. We observe that majority of the academic knowledge is presented as a package (i.e., a binary file merging structural, presentational, and behavioural layers), and often disconnected from other knowledge, due to policy, cost, or technical reasons. Relations between the information on a granular level do not exist, thus increasingly making it difficult to acquire knowledge which would be otherwise better machine and human-accessible.
Opinion diversity is good for the knowledge ecosystem. We propose a shift in scholarly communication: adoption of native Web technologies and standards declaratively from ground up; authoring, publishing and consuming, in order to improve the quality of knowledge representation and acquisition on the Web. Researchers should control and make their knowledge accessible to the greatest extent possible. Our argument is antithetical to knowledge extraction methods from non-machine processable documents, post-publishing semantic uplifting, as well as inadequate user experiences in context of the Web. In our approach, we adopt Web technologies which are designed with the principles to evolve and have information accessible to many. We suggest researchers in Web Science to re-examine their use cases.


Our strategy is as follows (adopted from StratML):


What do we aim to accomplish? We are set out to socially and technically enable researchers to take full control, ownership, and responsibility of their own knowledge, and have their contributions accessible to the society at maximum capacity, by dismantling the use of archaic and artificial barriers. Our proposed solution is intended to influence: a shift in community and publication process, the universal accessibility and discovery, and the identification and planning of new challenges and funding opportunities.


Who is trying to do it and for the benefit of whom? Both publishers and consumers of knowledge have a stake and their roles. By enabling authors to publish their knowledge at a granular level as they deem useful, it introduces the possibility of that knowledge to be mined at that level by the consumers. There is a healthy symbiotic relationship among these stakeholders as they make their impact from both directions e.g., availability and use of information.


How do we know whether we are making progress and when we have succeeded? The quality of sharing and knowledge acquisition is perpetual. The fundamental pitfall to avoid is settling on particular workflows and mediums as absolutes. All tangible (as opposed to hypothetical) efforts and changes which improve the state of the art of message carrying and receiving should be considered successes. The evolvability and flexibility of the methods to further separate the core content from the medium it is carried in at any point are important milestones.

The remainder of this document discusses the qualitative requirements needed for our conceptual design for how we envision the end result. The discussion follows with the top-level qualities that an implemented system will have based on the requirements. Then, the specific components and challenges to be solved to implement the conceptual design. Finally, we discuss and demonstrate partial implementation of our design.

Non-Functional Requirements

The non-functional requirements in this section are aimed at improving information discovery, knowledge acquisition, foster trust and transparency using the available native Web technologies. While all knowledge domains are equally significant to capture information, we retain some focus on Web and Information Sciences. The overall properties of the system as a whole that we seek should inherit best accessibility and usability practices, devise ubiquitous user experience designs, and be subject to evolvability.


We use the definition of the term accessibility as the availability of knowledge to as many people as possible by applying universal design principles. Where applicable, the Web Content Accessibility Guidelines (WCAG) 2.0 [9] should be applied to make the content more accessible.

The second dimension to accessibility we adapt is open access: publications and accompanying research data should be discoverable, accessible, and reusable by humans, as well as machines, as the content becomes available. This is preferably without having to go through third-party proxies or services.


We use the definition of the term usability which incorporates usefulness and learnability of available knowledge. Here we seek a system in which the consumer can efficiently and with satisfaction use and interact with the tools. For example, having only the options visible that are applicable in context of a task, the system properly communicates with the author as to what is going on. In the case of learnability, the document should aim to provide interactions to help the reader get a better understanding of the content. The typography of the article matters a great deal under different devices, and therefore should be handled accordingly for optimal readability and legibility.

User Experience

It is important to have a system that adapts user-centred design processes in order to allow users (both content producers and consumers) to intuitively interact with the system. This is so that researchers do not focus on the user interface, but that they are seamlessly engaged in their work. While actual experiences may not be completely designed, they can be supported with affordances.

It also follows that, building a ubiquitous user experience across different devices (i.e., customizing the user interface and interactions based on the characteristics of user’s media device) would be preferable to accommodate different access points, and learning styles.

Creating a responsive design and interactions between the user and the system can provide an opportunity for the user to learn by doing e.g., changing the parameters of an algorithm’s input and observing the corresponding graph output.


Evolvability can be seen as an attribute or a quality of a system, such that the system can survive in the long run. This is especially important when we want to have different publishing systems to be interoperable over time based on new needs. To be more specific, we are seeking to employ technology stacks that are likely to handle extensions and accommodate diverse scenarios for knowledge acquisition in the future. While it is not always clear or easy to predict which technologies will persevere in the future, we can pursue features like simplicity, flexibility, decentralization, interoperability, and tolerance. We are in agreement with Tim Berners-Lee's Web architecture commentary on the Evolution of the Web [10].

Acid Test

We propose an acid test in order to verify, approve, or test the openness, accessibility and flexibility of the approaches for research publication. This is along the lines of a test to distinguish gold from base metals. This test does not mandate a specific technology, therefore the challenge can be met by different solutions. It is intended to test the design philosophies so that different approaches may be closer to passing the independent invention test.

  1. Alice makes her research document available on the Web with the research objects available at fine granularity e.g., variables of a hypothesis.
  2. Bob wants to precisely refer to and discuss Alice’s research objects from his own research document.
  3. Carol is interested in discovering research documents in the wild, which contain a pattern of reusable research objects like the ones in Alice, Bob or others’ documents e.g., using the follow your nose type of exploration, searching, or querying against a service.
  4. Dan learns by interacting with the components inside the research document e.g., changing the parameters, rerunning an experiment.
  5. Eve visits Frank, and displays Alice’s article on one of his devices, and prints out a copy.

There are some assumptions (adapted from the Social Web Acid Test - Level 0):

  • The interactions have at least two different tool stacks.
  • The interactions are both human and machine-friendly.
  • All interactions are according to published open standards, with 1) no dependency on proprietary APIs, protocols, or formats, and 2) no commercial dependency or priori relationship between the groups using the workflows and tools involved.
  • All interactions are possible without prior knowledge of the user’s environment or configuration (within reason).
  • There is no spoon.

Conceptual Design

We begin our investigation by taking a broad look at the fundamental technologies and design decisions, which compose both, the Web and the Internet.

Tim Berners-Lee writes that the less powerful the language, the more you can do with the data stored in that language. If you write it in a simple declarative form, anyone can write a program to analyze it in many ways. The Semantic Web is an attempt, largely, to map large quantities of existing data onto a common language so that the data can be analyzed in ways never dreamed of by its creators [11].

The Rule of Least Power document by the W3C Technical Architecture Group suggests when publishing on the Web, you should usually choose the least powerful or most easily analyzed language variant that’s suitable for the purpose [12].

RFC 1958 states that the principle of constant change is perhaps the only principle of the Internet that should survive indefinitely [13].

We base our design decisions for maximizing the potential for knowledge acquisition of scholarly articles around these design principles. By design, HTML fulfils the principle of least power. As far as computer languages and their environments go, HTML is the least common denominator; it can be used virtually from any device; from command-line terminals, desktop browsers, smartwatches to cars, whether on or offline. Together with HTTP, the undeniable penetration of HTML, both on a technological and social level makes it a clear winner for information distribution.

Using the widely accepted and adapted Web standards, we can use the Web in its truest sense to close the gap between research efforts and people. If each research is semantically described at high-caliber, it will be possible to connect with other research. Thus, the solution must rest in the direction of applying ways to network concepts and ideas as closely as possible, and have them accessible.

As research continuously makes way to new ideas, architectures, user experiences, and social changes, it should be possible to access and examine its evolution as knowledge becomes available. This can unleash greater awareness of the changes in research; what are the experiments? what do we not know? what to work on next?

Let us consider a way to examine our challenges.

Format War

Historically speaking, format wars occur between incompatible formats which compete for the same market e.g., AC versus DC. In that sense, we consider PDF and HTML formats to be competing in a system for knowledge discovery and acquisition. We consider PDF as the predominant format perceived based on historic precedent, and HTML as the challenger with disruptive potential. Let us have a look at some of their characteristics.

Donald E. Knuth, the inventor of TeX, considers it to be a program written using the literate programming approach. LaTeX, which is widely used in scholarly articles is a collection of TeX macros. PDF is usually generated from LaTeX.

PDF is intended for displaying and storing, and generally self-contained. However, it is not intended to be editable as the formatting instructions are no longer available. PDF is inherently layout oriented, and it is an optimal format for printing. Publishers tend to require a fixed layout with typographical guidelines for research documents, hence research articles in PDF are normally destined and designed for printing.

In order to facilitate metadata interchange, an XMP (Extensible Metadata Platform) package in the form of XML (most commonly serialized as RDF/XML) may be embedded in PDF (ISO 16684-1:2012). Otherwise, semantically useful information is not preserved when PDFs are generated, and makes it difficult to go back to source content format.

HTML is both a content format for encoding information, and a document format for storing information. HTML is the most used and open international standard (ISO 15445:2000). In our discussion, we consider XHTML as part of HTML (barring the differences in the W3C specifications).

HTML is easy to reuse, exploit, process, and extend. HTML does not have a hard page limit, so it can reflow to fit different displays. HTML has ubiquitous support across devices.

If HTML is used as the original format to record the information, a PDF copy can be generated on demand, using the specifications of the user. With an accompanying CSS for the HTML, desired layout guidelines can be achieved. Converting from PDF to HTML+CSS while possible, the precision rests on rendering the view, as opposed to creating a structurally and semantically meaningful document.

Offloading fundamental research knowledge via PDF may be preferable or convenient for some, however it comes only at the cost of losing access to granular information. Thus, PDF’s primary focus; the presentation layer leads to legacy documents that are not reusable in the long run.

In the survey conducted by Elsevier, PDF versus HTML — which do researchers prefer? [14], authors report more than 65 percent said they thought there would be a shift towards HTML use in the future.

In order to preserve and make information more accessible from different devices, the storage format choice should be based on flexibility and evolvability.

Table [Comparison TeX HTML] compares the TeX and HTML stacks.

Comparison of stacks based on TeX and HTML

We consider TeX stack family to include DVI, XMP, LaTeX, PDF and ECMAScript, whereas the HTML stack family to include hypertext and semantic (W3C) technologies and JavaScript.

Any media refers to CSS media queries e.g., braille, handheld, print, screen, speech.

Device readiness is an informal estimate on how likely a device will be able to open the stack, view, and offer interaction.

Programming paradigmLiterateDeclarative
Device readinessModerateGood
Applicable mediaScreen, PrintAny*
Reference granularityCoarseFine

Linked Research

Linked Research [15] is a proposal for Web researchers to use the technologies in the native Web stack to access, share and discover knowledge. The Call for Linked Research [16] aims to encourage the “do it yourself” behaviour for sharing and reusing research knowledge. The workflow template for Linked Research is intended to enable better discovery of research objects, and improved user experience and access to research knowledge.

Linked Research Workflow
  1. URI Ownership: Publish your research and findings at a Web space that you control. This is a fundamental step to control and have responsibility over ones own research. It is inline with the self-archiving act. This serves the purpose of maximizing its accessibility, usage and citation impact.
  2. Web Standards: Publish your progress and work following the Web standards as well as Linked Data design principles. Create a URI for everything that is of some value to you and may be to others e.g., hypothesis, workflow steps, variables, provenance, results. Ensure that the research is cost-free and available via open access. Use a liberal license to promote reuse.
  3. Knowledge Acquisition: Reuse and link to other researchers URIs of value, so nothing goes to waste or reinvented without good reason. It also fosters a navigable network of information, as well as information to flow between research documents.
  4. User Experience: Build towards a strong user experience. Use screen and print stylesheets. Create a copy of a view for the research community to fulfil organisational requirements. Design interactive user-interfaces for improved communication and education.
  5. Announce: Announce your work publicly so that people and machines can discover it.
  6. Open Feedback: Have an open comment system policy for your document so that any person or machine can give feedback.
  7. Enable: Help, encourage, and motivate others to do the same.

We believe that our Linked Research approach (see Linked Researcher for partial implementation) qualifies under enhanced research communication, and invite the community to verify against the proposed framework for article evaluation in The Five Stars of Online Journal Articles [17] or other evaluations as appropriate.


We now discuss the cognizable qualities of our non-functional requirements given that they are met.

Universal Access

One of the virtues of the Web is that it enables people and societies to communicate and transfer knowledge with everyone. In the case of scholarly communication, anyone in the world who can access the Web can consume the available knowledge, directly from the researcher or a Web space in which they control and trust. This lowers the barrier for knowledge dissemination — Web for All — without limiting its acquisition only to the privileged, regardless of location, or their ability. This is in fact at the heart of W3C’s mission to lead the Web to its full potential.

Exploration of Ideas

By enabling researchers to explore a body of knowledge in different ways, we cultivate and accommodate diverse learning.

One way of accomplishing this would be empowering the consumers to know which, and to what extent the ideas influenced a scholarly article i.e., making it possible to conduct a follow your nose type of exploration to further traverse the contributing ideas. The effect of linking ideas is that we can trace their history, understand their evolution, predict their direction, and identify missing areas. This is a view of ideas, discoveries, and innovation in motion, and it is in contrast to observing the ideas in isolation without accrediting sufficient context as to how they may have emerged. The exploration may be anywhere from navigating to related research objects, interactive analysis of the corresponding ideas, to academic references.

This article was influenced by the Connections TV Series and book [18] by James Burke.


Here we discuss critical technical problems and challenges to be solved to implement the conceptual design. The references to some of the technologies in this section is intended to exemplify the core challenges.

Authoring Tool

A fundamental requirement for authors is to be able to create documents which will work on different layers i.e., content/structure, presentation, behaviour. The tooling should pre-emptively enable the progressive enhancement strategy to emphasize on the creation of research documents which are accessible, semantically rich, accommodating different views, and interactive. This is in contrast to singular or binary documents which pre-package all the material predominantly into a single view layer.

The modular approach allows different layers to co-exist, be extended and evolve over time. For example, if a research fragment is globally identifiable on the Web and its description is looked up, it can be discussed in a future document with different presentation and behaviours than the original document.

Structure and Semantics

The POSH (plain old semantic HTML) approach reinforce the semantics and meaning into web pages such that the content can be more explicitly defined, described, and related to any hypermedia. In order to enable this, authors will require tooling with such features. There are two leading approaches to mark the content, with different design principles behind: microformats and RDFa. microformats takes the minimalist approach by focusing on lowering the barrier for entry for authors. In a nutshell, microformats are small patterns of HTML to represent commonly published things in web pages. RDFa on the other hand makes it possible to embed arbitrary statements using any vocabulary or ontology in HTML. It extends HTML with additional attributes in order to apply the RDF data model. RDFa is one of the ways to notate the RDF language such that the atomic components of a statement can be declared, e.g., (A, made, B), discovered and reused on the Web.

The application of the RDF language is the fundamental choice, and it can be notated in different ways. While transformations between different RDF notations are trivial, our decision to use RDFa in HTML over alternative approaches; Turtle and JSON-LD, is as follows. RDFa can be purposed towards the human-visible (or accessible) portions of the document, and are typically marked inline and in context of where the ideas occur in the document. It is to treat the information available as first-class data as opposed to metadata (which may be hidden from view). While Turtle and JSON-LD may also be written inline, they tend to act as raw data islands, and are slightly disconnected from the context that is human-readable in an HTML document. Concerning data quality, information marked with RDFa is less likely to rot or go stale because of its visibility (or accessibility) to humans, and due to authoring from their canonical location. Knowing and applying RDFa is a matter of extending HTML (via attributes). Therefore, RDFa is perceivably a simple entry to add semantics to documents — in contrast to learning different notations, which are quite distinct from HTML; Turtle and JSON-LD. It should however be noted that Turtle and JSON-LD can be well-purposed in the case of interactive documents. While our justification is aimed at using RDFa for semantic representations, we focus on a user interface that allows the author to work at the level of the RDF language, as opposed to on the syntax level.

In this space, one well researched authoring tool which combines WYSIWYG and WYSIWYM (What You See Is What You Mean) text authoring with the creation of rich semantic annotations is the RDFaCE (RDFa Content Editor) [19]. Authors can use both bottom-up and top-down approaches to create semantic content using different views. The user interface provides a way to enrich the content using external services.


There needs to be a way to apply layout themes for papers (e.g., in LNCS, ACM), thesis, dissertations, and journal articles for example, to both prepare its presentational layer for reading, as well as for the authors to focus on the content with the view automatically applied. For comparison, this is along the lines of programming in LaTeX, and applying the pre-made templates that are recommended by research venues and journal publishers.


The editing system should provide minimal means for authors to create interactions, which can enhance the document’s use. For example, enabling and viewing permanent identifiers that are part of the document will facilitate their sharing and reuse. Other enhancements may be along the lines of navigating through the components, reordering section hierarchy, controls to adjust the parameters of a figure or table, changing the visibility of a section based on device preferences, auto-completing or inserting information. Such features progressively improve the user experience, both for authors and consumers, without conflating with the structural and presentation layers.

Vocabularies and Ontologies

There exists a diverse set of vocabularies and ontologies, which are well suited for use in research articles. Where applicable, exploiting their use can foster better discovery and reuse. Researchers would need to have the means to select and apply the terms, which they find appropriate to describe their work. The baseline may include: SKOS (Simple Knowledge Organization System), FOAF (Friend of a Friend), DC Terms (Dublin Core Terms), SIO (Semanticscience Integrated Ontology), SPAR (Semantic Publishing and Referencing), PROV-O (Provenance Ontology), OPMW (Open Provenance Model for Workflows), RO (Wf4Ever Research Object), Disco (DDI-RDF Discovery), QB (RDF Data Cube), SIOC (Semantically-Interlinked Online Communities), and OA (Open Annotation).

Embedded Components

Generic external resources like images, graphics and browsing contexts, as well as video, audio, and mathematical expressions should be embeddable in documents to advance the reuse of independently composed objects.

The W3C Web Components are reuse-based approach to defining, implementing and composing independent components. The component model allows for encapsulation and interoperability of individual services, resources or modules in HTML.

With the application of embedded content and components, indivisible research objects can be purposed to 1) exist on their own to 2) be referenced (or imported) from external resources. For example, a research article may want to show a regression analysis figure by simply referencing and including that foreign resource, as opposed to making a copy of it.

Interactive Education

The need to provide components for interactive education arises to fulfil the need for better communication and knowledge transfer between the producer and consumer. By providing interactive elements in relation to the discussion, the reader can get a deeper understanding of the work than the written form alone.

Up and Down the Ladder of Abstraction presents an excellent demonstration of using interactive controls to teach about an algorithm to keep the car on the road.

Implementing these type of interactions are essential to user experiences in which the consumer feels good using the system since they are learning and accomplishing their tasks.

References and Citations

Citations in scholarly articles play an important role as they make a statement, which associates one document or its fragment with what is referred. As citations are vital to creating awareness, interlinking, dissemination, exploring and evaluating research, the mechanism to accomplish that should take into account at fine granularity. It follows that, citations are not merely connections at the document level, but can serve to refer to individual research fragments, from concepts, hypothesis, workflow steps, claims, evidence, arguments, to any annotation declared in other research articles.

We hypothesize that the reusable bits of information that are publicly accessible will naturally give birth to quality metrics which can be directly measured towards the research impact factor. This is akin to the Web economics to hyperlinking documents, where the quality and quantity of backlinks tend to indicate the importance of the cited work. Linking is the nuts and bolts of information for centuries, and it holds true for information shared on the Web.

Feedback System

A feedback system where individual responses to the research article can be referenced and archived is pivotal to strengthening the body of knowledge, its interpretation, and the questions it raises. In order to foster diverse readers, the functionality should welcome academic reviews, as well as comments and annotations by anyone in the spirit of open science, and machine-generated responses. Depending on the context, it may be appropriate to integrate a consensus system (e.g., up or down) to encourage well-written or approved responses to be more visible democratically. With the adoption of attributed peer reviews, responses can be rigorous and objective as they can be, while retaining transparency and identity of the reviewers, e.g., using decentralized authentication. This additionally creates an opportunity for the research community to carry out an open dialogue which will then retain substantial knowledge for the future simply by micro-contributions. Thus, efforts should be made to capture information from the readers in a machine-friendly way to allow data-mining operations. Being able to refer to responses would also mean that, a response in one document may refer to a response in another document, irrespective of their location on the Web.

Workflow and Provenance

Where applicable, the scientific process should be captured that is machine-readable, encompassing workflow and provenance level data. This type of information plays an important role in reproducibility, and foster trust towards the steps detailing the whole process.

It is a given that, publishing a Web document does not stop the authors from continuing with updates even after its announcement or submission to conferences. One way of handling academic reviews (or general feedback) for a specific version of the paper is to retain each changed copy and have sufficient annotation as to what was changed, and have links between the copies. This can also be complimented with RFC 7089 (HTTP Memento), reflecting a frozen prior state of the original resource. By submitting a digitally signed state of a document, it can fulfil submission requirements.


While complete reproducibility may not be possible, efforts should be taken to indicate or demonstrate the essential components of the experiments. Where applicable, this can be anywhere from allowing a (human or machine) consumer to discover the resources that are integral for reproduction, to creating an executable paper environment where; some code can be rerun, observing the effects of changing parameters, allowing sample data uploads, or other explorations which could be conducted inside lightweight and portable VM environments.

If the community shifts its focus towards soundness, and completeness (where possible) in papers, the underlying elements that make up the workflows can have automated tests and verifications. Thus enabling machines to do some of the mundane checks on behalf of the reviewers.

Demonstrating reproducibility inside research articles also helps the reader to experiment and learn in context of the material. For example, the Executable Papers initiative at Elsevier Connect [20] demonstrates a Web based system to encourage reproduction of computational results in research articles, in order to help readers gain deeper insight.


In order to accommodate for content availability which depends on external services, internal and local caching methods can be used to retain the last retrieved version of the information in the research article or viewing interface. In the event that remote information can not be embedded in the research article, the last cached response can be used as a fall back option. Otherwise, a new copy or reference can be called dynamically. Where applicable, caching can substantially improve the document’s load performance.

Persistent Identifiers

It is evident that Cool URIs don’t change. There is a social expectation to ensure identifier persistence. For the integrity of the Web, it is important that the information persists as identified well into the future. This is irrespective of the identifier scheme that is used to refer to a piece of research publication or data. Thus, the individuals or groups which are the owners and maintainers of the identifiers need to be committed and responsible to their persistence. This includes adopting policies to ensure that the representations of the work may be preserved by another management, for example, if the institution dissolves, or if the individual(s) pass away. Existing identifier schemes and services can be employed for persistence, e.g., URI, DOI, PURL, w3id.


In order to ensure the permanent or long-term preservation of research articles, there should be a mechanism to commit and submit the work to a trusted location. If the location of the archival copy is somewhere other than the identifier that is used to reference the canonical copy (i.e., on the author’s site), then well-known archival institutions should be used e.g., the Internet Archive, arXiv.org, ZENODO (which is in same cloud infrastructure as research data from CERN’s Large Hadron Collider).


We now discuss and demonstrate a partial implementation of our approaches.

Linked Researcher

Updated : Linked Researcher is now called dokieli and available at github:linkeddata/dokieli.

Linked Researcher is a POSH editor. It contains a template and themes for well-known layouts for research documents in Computer and Information Science, e.g., LNCS and ACM. Documents which use the provided themes can be viewed on screen devices, as well as printed to paper, or output to PDF or PS files. For instance, the canonical reference to this document (i.e., http://csarven.ca/enabling-accessible-knowledge) while semantically annotated and accessible via screen devices, can also be printed.

The canonical copy is annotated using the RDFa notation, where all important concepts, claims, arguments, evidences, and along with others, have an URI assigned to them. For example, the argument of our discussion (from earlier: adopting native Web technologies from ground up) can be precisely cited using the following URI: http://csarven.ca/enabling-accessible-knowledge#argument.

It is possible to include or exclude sections for specific media types. For example, researchers can decide to omit a paragraph in the camera-ready version for the conferences, while keeping it visible for screen devices.

Statistical Displays

Inline-charts or sparkline are intended to be concise and located where they are discussed in the text. They are treated as datawords carrying dense information. Edward Tufte describes sparkline as small intense, simple, word-sized graphic with typographic resolution [21]. For example, the GDP of Canada claimed by the World Bank Linked Dataspace is on the rise despite the market crash in 2008. This brief demonstration of Linked Statistical Data Sparkline, 1) compliments the supporting text without breaking the reader’s flow, and 2) provides an opportunity for the reader to investigate further by clicking on the data-line to access the source.

Linked Statistical Data Analysis

Statistical analysis from remote locations can be embedded in local documents, with the possibility to trace its source for further information to foster trust between publishers and consumers. Linked Statistical Data Analysis [22] offers a way to run analysis and explore distributed linked statistics, with accompanying provenance data. Figure [1] is an SVG scatter plot of mortality rate and CPI from the World Bank and Transparency International Linked Dataspaces respectively. The figure is embedded in this document by referencing its external object:

Linked Statistical Data Analysis.

The only requirement here is to use HTML’s generic external content element, and refer to an existing resource.

Linked Statistical Data Cube Designer

LSD (Linked Statistical Data) Cube Designer is a Web service with an user-interface for researchers to design their own statistical data cubes. The statistical objects that are part of the cube model are derived from and refer to Linked Statistical Dataspaces. It aims to lower the barrier for searching and reusing existing linked statistical components. LSD Cube Designer offers a way for researchers to search for existing cube component concepts from existing statistical linked dataspaces, select most suitable dimensions, measures, and attributes for their data structure and components, and then export the cube’s structural information (in RDF Turtle format). We can interact with this remote application in Figure [2]:

Linked Statistical Data Cube Designer.


In this article, we have discussed a way to improve the state of publishing and consuming scholarly articles. That is, by way of employing the true Web stack, we are able to achieve what is desirable for human and machine consumption, meanwhile supporting the current scholarly publishing methods. Let us take a look further:

What constitutes a scholarly article? While the jury may be out on their verdict, we have a few observations which may lead us to understand one aspect of the situation we are in. As the publishers are the main dependency in which the guidelines (i.e., printability) are set forth, then it follows that the researchers (and their institutions) are merely following them in order to fulfil their organisational requirements. Researcher adopt non-machine friendly practices and requirements for papers as instructed, regardless of their technical abilities to better share their knowledge with the rest of humanity. Therefore, the underlying issue to note here is due to social reasons, as opposed to technical. If for instance, steering committees, conferences or journals mandate researchers to adopt human and machine-friendly methods for their research articles i.e., using the native Web technologies and standards, we could observe a far richer research ecosystem than what we are used to.

Regardless of how the Web Science landscape evolves, there is one thing that each researcher can do now: taking initiative to have complete control of their own publications as discussed in Linked Research. This entails that researchers first share their knowledge with the rest of the Web that can be consumed by humans and machines. Second, researchers submit a copy of that work to relevant institutions using their submission requirements, in order to comply with organisational needs. If and when the existing publication workflow evolves, the second step will naturally dissolve.

We contemplate two key open challenges:

Adoption is an open issue. Needless to say, forcing to change the people’s way of working is not the most fruitful endeavour. Future work will need to focus on helping the researchers (at the grass-roots level) to publish their own research within a workflow that’s suitable to them. This may be by supporting simple POSH authoring tools, or extensions and plugins for existing content management systems. This is fundamentally about documenting the new set of use cases, and implementing them, preferably without software installations or creating accounts on third-party systems.

Identity is another open issue. As identity is usually at the centre of everything, there remains a challenge to adopt decentralized authentication methods. WebID is one promising approach — as it ties well with the Web and Linked Data stack — provided that the user experience for the generation of certificates, and their accessibility across devices is improved. In parallel, having systems recognize and use WebID, needs to be widely accepted in order for researchers to 1) assign themselves a Web identity, and 2) collaborate with other researchers.

TL;DR: Web Researchers, let us build the ecosystem that we want use based on the standards and technologies that are native to the Web.


We would like to extend our thanks to those individuals and groups working to make the Web better. A special thanks to Jodi Schneider for their early review of this article. Our topic of discussion, research and development efforts are a result of our inspiration of the values that are inherent to the World Wide Web, as well as Tim Berners-Lee’s design principles. Our goals are influenced by James Burke’s explanation of our world as it gets increasingly interconnected.


  1. What the Semantic Web can represent, http://www.w3.org/DesignIssues/RDFnot.html#Knowledge
  2. Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities, http://openaccess.mpg.de/Berlin-Declaration
  3. Revision of the PSI Directive, Digital Agenda for Europe, http://ec.europa.eu/digital-agenda/en/news/revision-psi-directive
  4. European Union: Guidelines on recommended standard licences, datasets and charging for the reuse of documents, Official Journal of the European Union, 2014/C 240/01, 57, (2014), http://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=OJ:C:2014:240:FULL&from=EN
  5. European Union: Directive 2013/37/EU of the European Parliament and of the Council, Official Journal of the European Union, 56, 2013/L 175/1 (2013), http://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=OJ:L:2013:175:FULL&from=EN
  6. Lee, D., Cyganiak, R., Decker, S.: Open Data Ireland. Best Practice Handbook, Insight Centre for Data Analytics, NUI Galway, Galway (2014), http://per.gov.ie/wp-content/uploads/Best-Practice-Handbook.pdf
  7. Harmelen, F. v., Fensel, D.: Practical Knowledge Representation for the Web, IJCAI, Workshop on Intelligent Information Integration (1999), http://www.cs.vu.nl/~frankh/postscript/IJCAI99-III.html
  8. For Reviewers, Semantic Web journal, http://www.semantic-web-journal.net/reviewers
  9. Web Content Accessibility Guidelines (WCAG) Overview, http://www.w3.org/WAI/intro/wcag
  10. Evolution of the Web, http://www.w3.org/DesignIssues/Evolution.html
  11. Principles of Design, http://www.w3.org/DesignIssues/Principles.html#PLP
  12. Berners-Lee, T., Mendelsohn, N.: The Rule of Least Power, http://www.w3.org/2001/tag/doc/leastPower.html
  13. Carpenter, B.: Architectural Principles of the Internet, http://tools.ietf.org/html/rfc1958#section-1
  14. PDF versus HTML — which do researchers prefer?, http://www.elsevier.com/connect/pdf-versus-html-which-do-researchers-prefer
  15. Capadisli, S.: Linked Research, Workshop on Semantic Publishing, ESWC (2013), urn:nbn:de:0074-994-6, http://csarven.ca/linked-research
  16. Capadisli, S.: Call for Linked Research, Developers Workshop, ISWC (2014), http://csarven.ca/call-for-linked-research
  17. Shotton, D.: The Five Stars of Online Journal Articles, D-Lib Magazine (2012), http://www.dlib.org/dlib/january12/shotton/01shotton.html
  18. Burke, J.: Connections. Simon & Schuster, New York (2007), http://www.worldcat.org/oclc/174040346
  19. Khalili, A., Auer, S., Hladky, D.: The RDFa Content Editor — From WYSIWYG to WYSIWYM, COMPSAC 2012:531-540 (2012), http://svn.aksw.org/papers/2012/COMPSAC2012_RDFaCE/public.pdf
  20. Executable Papers, http://www.elsevier.com/physical-sciences/computer-science/executable-papers
  21. Sparkline theory and practice, http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0001OR
  22. Capadisli, S., Auer, S. Riedl, R.: Linked Statistical Data Analysis, ISWC SemStats (2013), http://csarven.ca/linked-statistical-data-analysis


16 interactions

Ali Khalili’s photoAli Khalili replied on

Hi Sarven,

I liked the idea a lot. It envisages a new paradigm in publishing research work which grounds in using existing Web standards and technologies instead of inventing new tools. The main pitfall of the article IMO is that it fails to thoroughly investigate the problem from the viewpoint of an end-user (i.e. a typical researcher without any knowledge of LOD who wants to adopt this new paradigm). When reading the article, what I'd have expected was showing what additional efforts users need to adopt in order to publish linked research and how much cognitive load it impels on users (for answering the question: is it worth it?).

Another missing fact IMO was a section discussing incentives and instant user gratification when publishing Linked Research. For example, having real-time semantic tagging and recommendations enable users to instantly see the value of semantic authoring.

I printed the article in LNCS format which was quite fine (only iframe content was not well presented). For me, reading the online version of your article was easier and more convenient than the printed one. I found Hashtags really useful.

I’d recommend to:

  • Add a table of content for the online version which is linked to the hashtags of each part.
  • Add something like a breadcrumb or a floating table of content which shows where we are now when reading the paper online. I lost the flow of paper at some sections!

I got confused when arrived at ‘Acid Test’ section. It was hard to find its relation with the previous parts and to understand that challenges are linked to that. I am not sure if ‘Challenge’ is a right term here. It looks like LOD stars.

The following two assumptions were unclear to me:

  • The interactions have at least two different tool stacks. (what is tool stack here?)
  • here is no spoon. (what does spoon mean here?)

Would be great if you added a figure which gives an overview of the proposed architecture (maybe something like SW layer cake).

I’d rather use the term ‘impact’ instead of ‘qualities’.

It would be good if 'Figure 1: Linked Statistical Data Analysis’ could be more interactive and was linked to the underlying data (i.e. users can export chart data).

Best, Ali

1 interaction

Sarven Capadisli’s photoSarven Capadisli replied on

Thanks for your comments Ali.

The document was intended for Web researchers (primarily for those in Web Science dealing with Semantic Web and Linked Data). I've mentioned this or at least implied this but probably should have been more clear. The approach wasn't meant for all researchers in all aspects of science. I think getting the Web Science community on board is a big enough battle for the time being ;) If we can get some of the ideas here, accepted and used, we (i.e., the Web Science community) would have more grounds to make the case for using the Web stack in scholarly publications in other areas.

I have some of your recommendations implemented but they are not enabled yet - I'd like to test further - but they are at https://github.com/csarven/linked-research .

The Challenge was intended to be more in line with use cases.

As for the having the Linked Statistical Data Analysis figure interactive, that'd be nice, but it wasn't something that I have already implemented on the origin. I did however try to demonstrate the interactive aspect with the Linked Statistical Data Cube Designer.

I will try to integrate your other suggestions.

Thanks again.

Anonymous Reviewer replied on

This paper proposes a new system for publishing and using scholarly articles using Javascript and http. While the paper is solid, the presentation and critical analysis of existing literature on the topic could be enhanced, so the added value is limited to the extent of a case study.

Anonymous Reviewer replied on

Paper evaluates the problem of nowadays how to substitute the PDF as the current omnipotent vehicle of research publication with new means of scholarly communication.

The paper suggests to use the (open) web (HTML) stack instead of PDFs. This suggestion is now new and not original.

The evaluation inside the paper is detailed but strongly biased towards web technology. Several important issues and requirements against research publication (e.g. the completeness, closeness, printability, identification, transferability, long time preservation, etc) are not sufficiently evaluated.

The new LD, LOD movements as well as the use of these linked technologies to support and/or to transfer scientific knowledge are not yet completely elaborated and used in the larger scientific public. The simple solution, the use of current web standards instead of PDFs could not solve all the problems of the scholarly communication. Both worlds (the PDF based as well as the web based) are strongly challenged.

Based on the real life issues paper raised, the acceptance of this paper is suggested.

Anonymous Reviewer replied on

The paper outlines an environment for open research. It supports the accessiblity, citeability and re-use of scientific results including papers and raw data.

It perfectly fits the track, and points at real shortcomings of our current ways doing research.

It is true, that one master format should be enough to generate the various versions of the paper required for printing, online publishing, etc. TeX had some initial intentions in this directions, and the HTML digital publishing WG also supports this idea.

Furthermore, a workbench to collect, merge and analyze existing results would also be nice.

This part of the solution is too much based on a statistical approach, which uses a previously published result. In many cases, opinion mining or fact mining (think about nano-publications) is needed, not statistical methods, for which Linked Data browsers such as LODmilla or Linked Open Graph may be more suitable. The facts found using these tools can then be embedded and cited in the new paper.

A similar requirement would be to use/re-use bookmarks in such a workbench.

Overall, the paper conveys few concrete results but many interesting ideas.

It is well written, though the 'specification' part in the middle is a bit boring to me.

Anonymous Reviewer replied on

The aim of this investigation is to present a design and its partial implementation to discover and share knowledge and experience to receive maximal openness using Web technologies. The functional requirements like accessibility, usability, revocability are done. A conceptual design including Linked Research, HTML format, Web technologies, multilayer structure is proposed. The technical architecture is well composed and have to include: RTF extended HTML, vocabularies and ontologies, references and citations tools, embedded components, persistent identifiers, archival, interaction tool etc. The theoretical hypotheses and relationships are partially implemented through Linked Research editor and statistical displays analyzer.

In the conclusion is underlined that 1) it is possible to share the research knowledge with rest of the world and 2) to submit copies of the research works according to different requirements.

Anonymous Reviewer replied on

Yes, it does fit the track;

Yes, this is an interesting problem and the authors showed how important are and how we can evolute;

yes, the motivation is good and the subject is interesting;

conceptual development and literature is visited. I would suggest to the authors, for each topic they describe, put a citation, to improve and support the sentence/topic. For example, there are several examples but we don't know where they came from. It is possible to have a better text if they came from anything (literature review would be great and some case study, etc);

method: it is clear which is the method and what were the results expected;

Findings are interesting and well balanced between the conclusion and the body text;

clear paper;

Good English;

Paper is with the maximum, 12 pages. OK

I hope help!

Leave a comment

* marked fields are required.