HDTQ: Managing RDF Datasets in Compressed Space

Javier D. Fernández¹^,², Miguel A. Martínez-Prieto³, Axel Polleres¹^,², and Julain Reindorf¹^{1. Vienna University of Economics and Business (Austria)

2. Complexity Science Hub Vienna (Austria)

3. Dept. of Computer Science, Universidad de Valladolid, Spain

firstname.lastname@wu.ac.at, migumar2@infor.uva.es, julian.reindorf@gmail.com}

This work has been submitted to ESWC 2018. In the following, we provide a brief overview of the proposal and further information regarding the empirical evaluation. Please see the “RDF/HDT” project website for a further details on HDT.

1. Introduction
2. Empirical evaluation
--- 2.1 Space requirements
--- 2.2 Indexing
--- 2.3 Performance for Quad Pattern Resolution

1. Introduction

HDT (Header-Dictionary-Triples) is a well-known compressed representation of RDF data that supports retrieval features without prior decompression. Yet, RDF datasets often contain additional graph information, such as the origin, version or validity time of a triple. Traditional HDT is not capable of handling this additional parameter(s).

This work introduces HDTQ (HDT Quads), an extension of HDT, which is able to represent quadruples (or quads) while still being highly compact and queryable. Two approaches of this extension, Annotated Triples and Annotated Graphs, are introduced and their performance is compared to the leading open-source RDF stores on the market. Results show that HDTQ achieves the best compression rates and is a competitive alternative to well-established systems.

2. Empirical evaluation

We evaluate the performance of HDTQ in terms of space and efficiency on quad pattern resolution. The HDTQ prototype (Get source code), built on top of the existing HDT-Java library, implements both HDT-AG and HDT-AT approaches using existing compressed bitmaps (called Roaring Bitmaps), which are optimal for sparse bitsequences.

Experiments are carried out on the following heterogeneous configuration of RDF datasets (download links are provided to get the N-Quads version of the dataset):

Dataset	Download	Subjects	Predicates	Objects	Graphs	Triples	Quads
BEAR-A	Get N-Quads	74,908,887	41,209	64,215,355	58	378,476,570	2,071,287,964
BEAR-B day	Get N-Quads	100	1,725	69,650	89	82,401	3,460,896
BEAR-B hour	Get N-Quads	100	1,744	148,866	1,299	167,281	51,632,164
LUBM-500 G1	Get N-Quads	10,847,183	17	8,072,358	1	66,731,200	66,731,200
LUBM-500 G10	Get N-Quads	10,847,183	17	8,072,358	10	66,731,200	66,740,200
LUBM-500 G20	Get N-Quads	10,847,183	17	8,072,358	20	66,731,200	66,750,200
LUBM-500 G30	Get N-Quads	10,847,183	17	8,072,358	30	66,731,200	66,760,200
LUBM-500 G40	Get N-Quads	10,847,183	17	8,072,358	40	66,731,200	66,770,200
LUBM-500 G50	Get N-Quads	10,847,183	17	8,072,358	50	66,731,200	66,780,200
LUBM-500 G60	Get N-Quads	10,847,183	17	8,072,358	60	66,731,200	66,790,200
LUBM-500 G70	Get N-Quads	10,847,183	17	8,072,358	70	66,731,200	66,800,200
LUBM-500 G80	Get N-Quads	10,847,183	17	8,072,358	80	66,731,200	66,810,200
LUBM-500 G90	Get N-Quads	10,847,183	17	8,072,358	90	66,731,200	66,820,200
LUBM-500 G100	Get N-Quads	10,847,183	17	8,072,358	100	66,731,200	66,830,200
LUBM-500 G1000	Get N-Quads	10,847,183	17	8,072,358	1000	66,731,200	67,634,864
LUBM-500 G2000	Get N-Quads	10,847,183	17	8,072,358	2000	66,731,200	68,112,009
LUBM-500 G3000	Get N-Quads	10,847,183	17	8,072,358	3000	66,731,200	68,351,144
LUBM-500 G4000	Get N-Quads	10,847,183	17	8,072,358	4000	66,731,200	68,491,797
LUBM-500 G5000	Get N-Quads	10,847,183	17	8,072,358	5000	66,731,200	68,604,538
LUBM-500 G6000	Get N-Quads	10,847,183	17	8,072,358	6000	66,731,200	68,647,386
LUBM-500 G7000	Get N-Quads	10,847,183	17	8,072,358	7000	66,731,200	68,692,484
LUBM-500 G8000	Get N-Quads	10,847,183	17	8,072,358	8000	66,731,200	68,736,452
LUBM-500 G9000	Get N-Quads	10,847,183	17	8,072,358	9000	66,731,200	68,780,614
LUBM-500 G9998	Get N-Quads	10,847,183	17	8,072,358	9998	66,731,200	68,823,803
LDBC	Get N-Quads	668,711	16	2,743,645	190,961	5,000,197	5,000,197
LIDDI	Get N-Quads	392,344	23	981,928	392,340	1,952,822	2,051,959

Note that triples are mostly present in 1 single graph in LUBM500. Repetitions are mainly due to triples specifying the type of a university. The distribution of the repetitions (number of triples ocurring with X graph) is shown below:

2.1 Space requirements

This section provides the space results (already reflected in the paper) and links to download the corresponding HDTQ files.

Dataset	Size (GB)	gzip	HDT-AG	HDT-AT	Jena	Virtuoso	Virtuoso+
BEAR-A	396.9	5.8%	2.3% Download (.index)	2.8% Download(.index)	96.8%	NA	NA
BEAR-B day	0.6	4.8%	0.7% Download(.index)	0.8% Download(.index)	97.7%	13.7%	33.7%
BEAR-B hour	9.7	4.8%	0.3% Download(.index)	0.1% Download(.index)	96.4%	4.3%	25.6%
LUBM500 G1	11.4	3.0%	6.6% Download(.index)	17.0% Download(.index)	118.8%	17.2%	21.0%
LUBM500 G9998	11.6	3.0%	6.6% Download(.index)	16.7% Download(.index)	120.1%	17.5%	27.5%
LDBC	0.9	9.7%	15.9% Download(.index)	25.1% Download(.index)	126.3%	71.2%	80.8%
LIDDI	0.7	3.7%	11.8% Download(.index)	15.6% Download(.index)	78.1%	49.9%	53.4%

(*) All LUBM500 in HDTQ (in AG and AT format) are available for download.

In addition, we provide the size of the structures in HDT-AG and HDT-AT:

Dataset	HDT-AG Size (MB)					HDT-AT (MB)
Dataset	Total	HDT	HDT indexes	Graph Dictionary	Quad Information	Total	HDT	HDT indexes	Graph Dictionary	Quad Information
BEAR-A	9,471	4,410	2,444	<1KB	2,617	11,179	4,410	2,444	<1KB	4,325
BEAR-B day	4.3	3.2	150KB	<1KB	1.1	4.9	3.2	150KB	<1KB	1.7
BEAR-B hour	32	7	300KB	6KB	25	14	7	300KB	6.2KB	7
LUBM500 G1	778	392	386	<1KB	15K	1987	392	386	<1KB	1209
LUBM500 G9998	778.6	392.6	386	50K	4.4	1991.6	392.6	386	50K	1213
LDBC	150	121	23	1.5	4.5	236.5	121	23	1.5	91
LIDDI	81.5	50	9.5	12	10	107.5	50	9.5	12	36

2.2 Indexing

Measures on RAM consumption for indexing will be provided soon...

2.3 Performance for Quad Pattern Resolution

In this section we provide additional information to test the performance of HDTQ and other systems (Jena and Virtuoso).

2.3.1 Get the queries

We select, for each dataset, 100 random queries for each combination of quad patterns (expect for the pattern ???? and those patterns such as ?P?? where the data distribution prevents from having 100 different queries).

Download all quad patterns.

2.3.2 Get the source code and prepare the setup

We provide a first alpha version of the HDTQ source code. While this build has been extensively tested, the current alpha state is still subject to bugs and optimizations. HDTQ content is licensed by Lesser General Public License.

We compare HDTQ against Apache Jena TDB 2.10 store (Download) and Virtuoso 7.1 Open Source (Download). For this latter, following Virtuoso instructions, we also consider a variant, named as Virtuoso+, which includes an additional index (GPOS) that may speed up quad patterns where the subject is not given. Please download and follow the Virtuoso instructions to create this additional index.

In order to alleviate the burden of executing all three systems, we have created a single jar file and a bash script to configure and run all experiments. The final lines also check that the results are the same for all systems. Please make sure transactions are disabled (by default) in all systems.

2.3.3 Raw Results

We include two JSON files that describe the results for the systems in cold and warm.

The JSON file contains the following structure:

{  
   "Dataset":{  
      "System":{  
         "Pattern":{  
            "averageTime":0,
            "Concrete measures":"[query1,query2,query100]",
            "minTime":0,
            "medianTime":0,
            "maxTime":0,
            "numResultsforEachQuery":"[query1,query2,query100]",
            "NumOfqueries":100,
            "repetitions":3
         }
      }
   }
}

(*) We performed 3 repetitions of the patterns except for minor exceptions of stable and time-consuming patterns were we used 2 repetitions.

2.3.4 Additional plots

In addition to the plots included in our ESWC paper, we describe here additional plots completing the experimentation. In particular, we provide the query times for LIDDI in HDT-AG: