The emerging Web of data is not a static structure of linked datasets, but
a dynamic framework continuously evolving. Distributedly and without notice, novel
datasets are added, others are modified, abandoned to obsolescence or removed from
the Web. All this without a centralized monitoring nor prefixed policy, following the
scale-free nature of the Web.
Applications and businesses leveraging the availability of
certain data over time, and seeking to track data or conduct studies on the evolution of
data, thus need to build their own infrastructures to preserve and query data over time.
Thus, preservation policies on Linked Data collections emerge as a novel topic with
the goal of assuring quality and traceability of datasets over time. However, previous
experiences in traditional Web archives, such as the Internet Archive, with petabytes
of archived information, already highlight scalability problems when managing evolving
volumes of information at Web-scale, making the task of longitudinal query across time a
formidable challenge with current tools.
It needs to be stressed that querying Web
archives has to deal mainly with text, whereas structured interlinked data archiving
shall focus on structured queries across time. In particular, several research challenges
arise when representing and querying evolving structured interlinked data:
How can we represent archives of continuously evolving linked datasets? How can huge archives be still processable?
How can we minimize the redundant information of archives?
How can we capture the expressiveness of emerging retrieval demands in archiving (e.g. time-traversing, traceability, evolution) and design a query language for evolving interlinked data?
How can we index these archives at large scale to still process the demanded queries efficiently?
The proposed project tackles the problem of archiving and querying evolving semantic Web data. To that end, we aim to provide a novel representation leading to compressed queryable linked data archives. Under this scenario, we will investigate on the required expressiveness to query archives across time, and we will propose an structured
query language matching the specific needs of consuming local and federated archives.
Thus, the project involves several research areas, from optimized representations for
archiving evolving linked data up to indexing archives at large scale, time-based query
languages, federation and performance optimization. Finally, we plan to validate all our
steps on real data, on the specific use case of archiving governmental Open Data. The resulting project objectives are summarized below.
The project started 1.01.2015 and has a duration of 24 months. It is funded by Austrian Science Fund (FWF): M1720- G11, and hosted by the Institute for Information Business of the WU Vienna University of Economics and Business.