The emerging Web of data is not a static structure of linked datasets, but a dynamic framework continuously evolving. Distributedly and without notice, novel datasets are added, others are modified, abandoned to obsolescence or removed from the Web. All this without a centralized monitoring nor prefixed policy, following the scale-free nature of the Web.

Applications and businesses leveraging the availability of certain data over time, and seeking to track data or conduct studies on the evolution of data, thus need to build their own infrastructures to preserve and query data over time.

Thus, preservation policies on Linked Data collections emerge as a novel topic with the goal of assuring quality and traceability of datasets over time. However, previous experiences in traditional Web archives, such as the Internet Archive, with petabytes of archived information, already highlight scalability problems when managing evolving volumes of information at Web-scale, making the task of longitudinal query across time a formidable challenge with current tools.

It needs to be stressed that querying Web archives has to deal mainly with text, whereas structured interlinked data archiving shall focus on structured queries across time. In particular, several research challenges arise when representing and querying evolving structured interlinked data:


How can we represent archives of continuously evolving linked datasets? How can huge archives be still processable?


How can we minimize the redundant information of archives?


How can we capture the expressiveness of emerging retrieval demands in archiving (e.g. time-traversing, traceability, evolution) and design a query language for evolving interlinked data?


How can we index these archives at large scale to still process the demanded queries efficiently?

The proposed project tackles the problem of archiving and querying evolving semantic Web data. To that end, we aim to provide a novel representation leading to compressed queryable linked data archives. Under this scenario, we will investigate on the required expressiveness to query archives across time, and we will propose an structured query language matching the specific needs of consuming local and federated archives.

Thus, the project involves several research areas, from optimized representations for archiving evolving linked data up to indexing archives at large scale, time-based query languages, federation and performance optimization. Finally, we plan to validate all our steps on real data, on the specific use case of archiving governmental Open Data. The resulting project objectives are summarized below.

Providing an optimal representation for archiving evolving linked data, minimizing inherent redundancy of different federated archives while preserving their original metadata from their source.
Designing specific query languages aimed at capturing the required expressiveness for these archives, enabling time-traversing patterns and querying evolution patterns.
Providing compressed indexes of evolving linked data archives scaling up to large volumes of data, based on succinct data structures.
Optimizing query resolution plans for archives, enabling the integration of other sources of archiving.
Building a prototype platform focused on the use case of archiving evolving governmental Open Data, verifying the feasibility and sustainability of the proposal.

The project started 1.01.2015 and has a duration of 24 months. It is funded by Austrian Science Fund (FWF): M1720- G11, and hosted by the Institute for Information Business of the WU Vienna University of Economics and Business.