BEAR

There is an emerging demand on efficiently archiving and (temporal) querying different versions of evolving semantic Web data. As novel archiving systems are starting to address this challenge, foundations/standards for benchmarking RDF archives are needed to evaluate its storage space efficiency and the performance of different retrieval operations.

To this end, we have developed a BEnchmark of RDF ARchives (BEAR), a test suite composed of three real-world datasets together with queries with varying complexity, covering a broad range of archiving use cases.

The DATA:

BEAR comprises three main datasets, namely BEAR-A, BEAR-B, and BEAR-C, each having different characteristics.

Dynamic Linked Data

BEAR-A is composed of 58 weekly snapshots from the Dynamic Linked Data Observatory. BEAR-A provides triple pattern queries to test atomic queries such as Materialization, Diff, Version, etc.

DBpedia Live

The BEAR-B dataset has been compiled from DBpedia Live changesets over the course of three months and contains the 100 most volatile resources along with their updates and real-world triple pattern queries from user logs.

Open Data portals

BEAR-C used the Open Data Portal Watch project to take the datasets descriptions of the European Open Data portal for 32 weeks. With the help of Open Data experts, we created 10 complex queries that retrieve different information from datasets and files.

The QUERIES:

The evaluation consists in defining SPARQL queries for each dataset and computing three different operations for each query:

Version Materialisation, Mat(Q, V_i): it provides the SPARQL query resolution of the query Q at the given version V_i. In order to test the applicability in all scenarios, one should provide the time for each version in the dataset, that is, V₀...V_n.
Delta Materialisation, Diff(Q, V_i,V_j): it provides the different results of the query Q between the given V_i and V_j versions. A minimum test consists of performing diffs between the initial version and increasing intervals of 5 versions, i.e., di f f (Q, V₀ , V_i) for i in {5, 10, 15, · · · , 55, n}.
Version Queries, Ver(Q): it provides the results of the query Q annotated with the version label in which each of them holds.

Note: additional (non-mandatory) operations can be found in the BEAR-A related article.

Description of the dataset

We build our RDF archive on the data hosted by the Dynamic Linked Data Observatory. BEAR-A data are composed of the first 58 weekly snapshots, i.e. 58 versions, from this corpus. We removed the context information and manage the resultant set of triples, disregarding duplicates. We also replaced Blank Nodes with Skolem IRIs (with a prefix http://example.org/bnode/) in order to simplify the computation of diffs.

Statistics of Statements

We report the data configuration features that are relevant for the benchmark. The following table lists basic statistics of the dataset.

Versions	Triples in Version 0	Triples in Version 57	Growth	Change ratio	Change ratio adds	Change ratio deletes	Static core	Version-oblivious triples
58	30m	66m	101%	31%	33%	27%	3.5m	376m

Number of statements per version:

Data growth:

As can be seen, although the number of statement in the last version doubles the initial size, the mean version data growth between versions is almost marginal (101%). A closer look to the figures above allows one to identify that the latest versions are highly contributing to this increase. Similarly, the version change ratios point to the concrete adds and delete operations. A mean of 31% of the data change between two versions.

The number of version-oblivious triples (376m) points to a relatively low number of different triples in all the history if we compare this against the number of versions and the size of each version. Finally, note the remarkably small static core (3.5m).

Statistics of the vocabulary

We present below the RDF vocabulary (different subjects, predicates and objects) per version and per delta (adds and deletes). As can be seen, the number of different subjects and predicates remains stable except for the noticeable increase in the latests versions, as already identified in the number of statements per versions. However, the number of added and deleted subjects and objects fluctuates greatly and remain high (one order of magnitude of the total number of elements). In turn, the number or predicates are proportionally smaller, but it presents a similar behaviour.

Subjects per version:

Predicates per version:

Objects per version:

Get the dataset

Policy	Description	Size (tar.gz)	Download
IC	One Ntriples file per version	22 GB	alldata.IC.nt.tar.gz
CB	Two Ntriples files (added and deleted triples) per version	13 GB	alldata.CB.nt.tar.gz
TB	One NQuad file where the named graph annotates the version/s of the triple	4 GB	alldata.TB.nq.gz
CBTB	One NQuad file where the named graph annotates the version/s where the triple has been added/removed	14 GB	alldata.CBTB.nq.gz

Description of the queries

BEAR-A provides SPARQL triple pattern queries (See SPARQL spec for further information on triple patterns) and their results in all 58 versions. The evaluation consists then in computing the defined Materialization, Query and Version operations over them.

Triple pattern queries are split in low and high number of results (cardinality). 50 triple pattern queries S??, ?P? and ??O are carefully selected to provide a small deviation between versions, such that the difference in performance results can be only explain by the efficiency of the underlying archiving system. Random S?O, SP? and ?PO are selected from the previous ones. We additionally sample 50 (SPO) queries from the static core (present in all versions).

Get the Queries

Triple Pattern Query	Cardinality	Number	Download Queries	Mat Results	Diff Results	Ver Results
S??	Low	50	Get queries	Get mat results	Get diff results	Get ver results
S??	High	50	Get queries	Get mat results	Get diff results	Get ver results
?P?	Low	6	Get queries	Get mat results	Get diff results	Get ver results
?P?	High	10	Get queries	Get mat results	Get diff results	Get ver results
??O	Low	50	Get queries	Get mat results	Get diff results	Get ver results
??O	High	50	Get queries	Get mat results	Get diff results	Get ver results
SP?	Low	50	Get queries	Get mat results	Get diff results	Get ver results
SP?	High	13	Get queries	Get mat results	Get diff results	Get ver results
?PO	Low	43	Get queries	Get mat results	Get diff results	Get ver results
?PO	High	46	Get queries	Get mat results	Get diff results	Get ver results
S?O	Low	50	Get queries	Get mat results	Get diff results	Get ver results
S?O	High	N/A
SPO	N/A	50	Get queries	Get mat results	Get diff results	Get ver results

Description of the dataset

The BEAR-B dataset has been compiled from DBpedia Live changesets over the course of three months (August to October 2015). DBpedia Live records all updates to Wikipedia articles and hence re-extracts and instantly updates the respective DBpedia Live resource descriptions.

BEAR-B contains the resource descriptions of the 100 most volatile resources along with their updates. The most volatile resource (dbr:Deaths_in_2015) changes 1,305 times, the least volatile resource contained in the dataset (dbr:Once_Upon_a_Time_(season_5)) changes 263 times.

As dataset updates in DBpedia Live occur instantly, for every single update the dataset shifts to a new version. In practice, one would possibly aggregate such updates in order to have less dataset modifications. Therefore, we also aggregated these updates on an hourly and daily level. Hence, we get three time granularities from the changesets for the very same dataset: instant (21,046 versions), hour (1,299 versions), and day (89 versions).