Updating Wikipedia via DBpedia Mappings and SPARQL

DBpedia crystallized most of the concepts of the Semantic Web using simple mappings to convert Wikipedia articles (i.e., infoboxes and tables) to RDF data. This “semantic view” of wiki content has rapidly become the focal point of the Linked Open Data cloud, but its impact on the original Wikipedia source is limited. In particular, little attention has been paid to the benefits that the semantic infrastructure can bring to maintain the wiki content, for instance to ensure that the effects of a wiki edit are consistent across infoboxes. In this paper, we present an approach to allow ontology-based updates of wiki content. Starting from DBpedia-like mappings converting infoboxes to a fragment of OWL2 RL ontology, we discuss various issues associated with translating SPARQL updates on top of semantic data to the underlying Wiki content. On the one hand, we provide a formalization of DBpedia as an Ontology-Based Data Management framework and study its computational properties. On the other hand, we provide a novel approach to the inherently intractable update translation problem, leveraging the pre-existent data for disambiguating updates.

PDF | Bibtex

Self-Enforcing Access Control for Encrypted RDF

The amount of raw data exchanged via web protocols is steadily increasing. Although the Linked Data infrastructure could potentially be used to selectively share RDF data with different individuals or organisations, the primary focus remains on the unrestricted sharing of public data. In order to extend the Linked Data paradigm to cater for closed data, there is a need to augment the existing infrastructure with robust security mechanisms. At the most basic level both access control and encryption mechanisms are required. In this paper, we propose a flexible and dynamic mechanism for securely storing and efficiently querying RDF datasets. By employing an encryption strategy based on Functional Encryption (FE) in which controlled data access does not require a trusted mediator, but is instead enforced by the cryptographic approach itself, we allow for fine-grained access control over encrypted RDF data while at the same time reducing the administrative overhead associated with access control management.

PDF | Bibtex

Report on the 2nd Workshop on Managing the Evolution and Preservation of the Data Web (MEPDaW 2016)

Research on preserving evolving linked datasets is gaining increasingly attention in the Semantic Web community. The 2nd workshop on Managing the Evolution and Preservation of the Data Web (MEPDaW 2016) aimed at addressing the numerous and diverse emerging challenges, from change discovery and scalable archive representations/indexes/infrastructures to time-based query languages and evolution analysis. In this report, we motivate our workshop and outline the keynote by Axel Polleres and the papers (3 full papers and one industry paper) presented at MEPDaW 2016 co-located with the ESWC 2016 conference in Anissaras, Crete, Greece. Finally, we conclude with an outlook of future directions.

PDF | Bibtex

Characterizing RDF Datasets

The publication of semantic web data, commonly represented in RDF, has experienced outstanding growth over the last few years. Data from all fields of knowledge are shared publicly and interconnected in active initiatives such as Linked Open Data. However, despite the increasing availability of applications managing large-scale RDF information such as RDF stores and reasoning tools, little attention has been given to the structural features emerging in real-world RDF data. Our work addresses this issue by proposing specific metrics to characterize RDF data. We specifically focus on revealing the redundancy of each dataset, as well as common structural patterns. We evaluate the proposed metrics on several datasets, which cover a wide range of designs and models. Our findings provide a basis for more efficient RDF data structures, indexes and compressors.

PDF | Bibtex

Evaluating Query and Storage Strategies for RDF Archives

There is an emerging demand on techniques addressing the problem of efficiently archiving and (temporal) querying different versions of evolving semantic Web data. While systems archiving and/or temporal querying are still in their early days, we consider this a good time to discuss benchmarks for evaluating storage space efficiency for archives, retrieval functionality they serve, and the performance of various retrieval operations. To this end, we provide theoretical foundations on the design of data and queries to evaluate emerging RDF archiving systems. Next, we instantiate these foundations along a concrete set of queries on the basis of a real-world evolving dataset. Finally, we perform an empirical evaluation of various current archiving techniques and querying strategies on this data. Our work comprises – to the best of our knowledge – the first benchmark for querying evolving RDF data archives.

PDF | Bibtex

Towards Updating Wikipedia via DBpedia Mappings and SPARQL

DBpedia is a community effort that has created the most important cross-domain datasets in RDF, a focal point of the Linked Open Data (LOD) cloud. In its core there is a set of declarative mappings extracting the data from Wikipedia infoboxes and tables into the RDF. However, while DBpedia focuses on publishing knowledge in a machine-readable way, little attention has been paid to the benefits of supporting machine updates. This greatly restricts the possibilities of automatic curation of the DBpedia data that could be semi-automatically propagated to Wikipedia, and also prevents maintainers from evaluating the impact of their edits on the consistency of knowledge. Excluding the DBpedia taxonomy from the editing cycle is a major drawback which we aim to address. This paper starts a discussion of DBpedia making a case for a benchmark for Ontology Based Data Management (OBDM). As we show, although based on fairly restricted mappings (which we cast as a variant of nested tgds here) and minimalistic TBox language, accommodating DBpedia updates is intricate from different perspectives, ranging from conceptual (what is an adequate semantics for DBpedia SPARQL updates?) to challenges related to the user interface design.

PDF | Bibtex

Self-Indexing RDF Archives

Although Big RDF management is an emerging topic in the so-called Web of Data, existing techniques disregard the dynamic nature of RDF data. These RDF archives evolve over time and need to be preserved and queried across it. This paper presents v-RDFCSA, an RDF archiving solution that extends RDFCSA (an RDF self-index) to provide versionbased queries on top of compressed RDF archives. Our experiments show that v-RDFCSA reduces space requirements up to 35 − 60 times over a state-of-the-art baseline, and gets more than one order of magnitude ahead over it for query resolution

PDF | Bibtex

Ontology-Based Search of Genomic Metadata

The Encyclopedia of DNA Elements (ENCODE) is a huge and still expanding public repository of more than 4,000 experiments and 25,000 data files, assembled by a large international consortium since 2007; unknown biological knowledge can be extracted from these huge and largely unexplored data, leading to data-driven genomic, transcriptomic and epigenomic discoveries. Yet, search of relevant datasets for knowledge discovery is limitedly supported: metadata describing ENCODE datasets are quite simple and incomplete, and not described by a coherent underlying ontology. Here, we show how to overcome this limitation, by adopting an ENCODE metadata searching approach which uses high-quality ontological knowledge and state-of-the-art indexing technologies. Specifically, we developed S.O.S. GeM (http://www.bioinformatics.deib.polimi.it/SOSGeM/), a system supporting effective semantic search and retrieval of ENCODE datasets. First, we constructed a Semantic Knowledge Base by starting with concepts extracted from ENCODE metadata, matched to and expanded on biomedical ontologies integrated in the well-established Unified Medical Language System; we prove that this inference method is sound and complete. Then, we leveraged the Semantic Knowledge Base to semantically search ENCODE data from arbitrary biologists’ queries; this allows correctly finding more datasets than those extracted by a purely syntactic search, as supported by the other available systems. We empirically show the relevance of found datasets to the biologists’ queries.

PDF | Bibtex

Serializing RDF in compressed space

The amount of generated RDF data has grown impressively over the last decade, promoting compression as an essential tool for storage and exchange. RDF compression techniques leverage syntactic and semantic redundancies, but structural repetitions are not always addressed effectively. This paper first shows two schema-based sources of redundancy underlying to the schema-relaxed nature of RDF. Then, we revisit the W3C HDT binary format to further compact its graph structure encoding. Our HDT++ approach reduces the original HDT Triples requirements up to 2 times for more structured datasets, and reports significant improvements even for highly semi-structured datasets like DBpedia. In general, HDT++ competes with the current state of the art for structural RDF compression, leading the comparison for three of the four analyzed datasets.

PDF | Bibtex

HDT-MR: A scalable solution for RDF compression with HDT and MapReduce

HDT a is binary RDF serialization aiming at minimizing the space overheads of traditional RDF formats, while providing retrieval features in compressed space. Several HDT-based applications, such as the recent Linked Data Fragments proposal, leverage these features for diverse publication, interchange and consumption purposes. However, scalability issues emerge in HDT construction because the whole RDF dataset must be processed in a memory-consuming task. This is hindering the evolution of novel applications and techniques at Web scale. This paper introduces HDT-MR, a MapReduce-based technique to process huge RDF and build the HDT serialization. HDT-MR performs in linear time with the dataset size and has proven able to serialize datasets up to several billion triples, preserving HDT compression and retrieval features.

PDF | Bibtex

Towards Efficient Archiving of Dynamic Linked Open Data

The Linked Data paradigm has enabled a huge shared infrastructure for connecting data from different domains which can be browsed and queried together as a huge knowledge base. However, structured interlinked datasets in this Web of data are not static but continuously evolving, which suggests the investigation of approaches to preserve Linked data across time. In this article, we survey and analyse current techniques addressing the problem of archiving different versions of semantic Web data, with a focus on their space efficiency, the retrieval functionality they serve, and the performance of such operations.

PDF | Bibtex

On the Road to the Evaluation of RDF Stream Compression Techniques.

The popularization of data streaming applications, such as those related to social networks and the Internet of Things, has fostered the interest of the Semantic Web community for this kind of data. As a result of this interest, the W3C RDF Stream Processing (RSP) community group has recently been started with the goal of defining a common model “for producing, transmitting and continuously querying RDF Streams”. In this EOI we focus on the transmission model.

PDF | Bibtex

The DBpedia wayback machine

DBpedia is one of the biggest and most important focal point of the Linked Open Data movement. However, in spite of its multiple services, it lacks a wayback mechanism to retrieve historical versions of resources at a given timestamp in the past, thus preventing systems to work on the full history of RDF documents. In this paper, we present a framework that serves this mechanism and is publicly offered through a Web UI and a RESTful API, following the Linked Open Data principles.

PDF | Bibtex

BEAR: Benchmarking the Efficiency of RDF Archiving

There is an emerging demand on techniques addressing the problem of efficiently archiving and (temporal) querying different versions of evolving semantic Web data. While systems archiving and/or temporal querying are still in their early days, we consider this a good time to discuss benchmarks for evaluating storage space efficiency for archives, retrieval functionality they serve, and the performance of various retrieval operations. To this end, we provide a blueprint on benchmarking archives of semantic data by defining a concise set of operators that cover the major aspects of querying of and interacting with such archives. Next, we introduce BEAR, which instantiates this blueprint to serve a concrete set of queries on the basis of real-world evolving data. Finally, we perform an empirical evaluation of current archiving techniques that is meant to serve as a first baseline of future developments on querying archives of evolving RDF data.

PDF | Bibtex