HDTQ: Managing RDF Datasets in Compressed Space

Javier D. Fernández1, 2, Miguel A. Martínez-Prieto3, Axel Polleres1, 2, and Julain Reindorf1


1. Vienna University of Economics and Business (Austria)
2. Complexity Science Hub Vienna (Austria)
3. Dept. of Computer Science, Universidad de Valladolid, Spain
firstname.lastname@wu.ac.at, migumar2@infor.uva.es, julian.reindorf@gmail.com


Go to Evaluation

This work has been submitted to ESWC 2018. In the following, we provide a brief overview of the proposal and further information regarding the empirical evaluation. Please see the “RDF/HDT” project website for a further details on HDT.

1. Introduction
2. Empirical evaluation
--- 2.1 Space requirements
--- 2.2 Indexing
--- 2.3 Performance for Quad Pattern Resolution


1. Introduction

HDT (Header-Dictionary-Triples) is a well-known compressed representation of RDF data that supports retrieval features without prior decompression. Yet, RDF datasets often contain additional graph information, such as the origin, version or validity time of a triple. Traditional HDT is not capable of handling this additional parameter(s).

This work introduces HDTQ (HDT Quads), an extension of HDT, which is able to represent quadruples (or quads) while still being highly compact and queryable. Two approaches of this extension, Annotated Triples and Annotated Graphs, are introduced and their performance is compared to the leading open-source RDF stores on the market. Results show that HDTQ achieves the best compression rates and is a competitive alternative to well-established systems.


2. Empirical evaluation

We evaluate the performance of HDTQ in terms of space and efficiency on quad pattern resolution. The HDTQ prototype (Get source code), built on top of the existing HDT-Java library, implements both HDT-AG and HDT-AT approaches using existing compressed bitmaps (called Roaring Bitmaps), which are optimal for sparse bitsequences.

Experiments are carried out on the following heterogeneous configuration of RDF datasets (download links are provided to get the N-Quads version of the dataset):

Dataset
Download
Subjects
Predicates
Objects
Graphs
Triples
Quads
BEAR-A Get N-Quads 74,908,887 41,209 64,215,355 58 378,476,570 2,071,287,964
BEAR-B day Get N-Quads 100 1,725 69,650 89 82,401 3,460,896
BEAR-B hour Get N-Quads 100 1,744 148,866 1,299 167,281 51,632,164
LUBM-500 G1 Get N-Quads 10,847,183 17 8,072,358 1 66,731,200 66,731,200
LUBM-500 G10 Get N-Quads 10,847,183 17 8,072,358 10 66,731,200 66,740,200
LUBM-500 G20 Get N-Quads 10,847,183 17 8,072,358 20 66,731,200 66,750,200
LUBM-500 G30 Get N-Quads 10,847,183 17 8,072,358 30 66,731,200 66,760,200
LUBM-500 G40 Get N-Quads 10,847,183 17 8,072,358 40 66,731,200 66,770,200
LUBM-500 G50 Get N-Quads 10,847,183 17 8,072,358 50 66,731,200 66,780,200
LUBM-500 G60 Get N-Quads 10,847,183 17 8,072,358 60 66,731,200 66,790,200
LUBM-500 G70 Get N-Quads 10,847,183 17 8,072,358 70 66,731,200 66,800,200
LUBM-500 G80 Get N-Quads 10,847,183 17 8,072,358 80 66,731,200 66,810,200
LUBM-500 G90 Get N-Quads 10,847,183 17 8,072,358 90 66,731,200 66,820,200
LUBM-500 G100 Get N-Quads 10,847,183 17 8,072,358 100 66,731,200 66,830,200
LUBM-500 G1000 Get N-Quads 10,847,183 17 8,072,358 1000 66,731,200 67,634,864
LUBM-500 G2000 Get N-Quads 10,847,183 17 8,072,358 2000 66,731,200 68,112,009
LUBM-500 G3000 Get N-Quads 10,847,183 17 8,072,358 3000 66,731,200 68,351,144
LUBM-500 G4000 Get N-Quads 10,847,183 17 8,072,358 4000 66,731,200 68,491,797
LUBM-500 G5000 Get N-Quads 10,847,183 17 8,072,358 5000 66,731,200 68,604,538
LUBM-500 G6000 Get N-Quads 10,847,183 17 8,072,358 6000 66,731,200 68,647,386
LUBM-500 G7000 Get N-Quads 10,847,183 17 8,072,358 7000 66,731,200 68,692,484
LUBM-500 G8000 Get N-Quads 10,847,183 17 8,072,358 8000 66,731,200 68,736,452
LUBM-500 G9000 Get N-Quads 10,847,183 17 8,072,358 9000 66,731,200 68,780,614
LUBM-500 G9998 Get N-Quads 10,847,183 17 8,072,358 9998 66,731,200 68,823,803
LDBC Get N-Quads 668,711 16 2,743,645 190,961 5,000,197 5,000,197
LIDDI Get N-Quads 392,344 23 981,928 392,340 1,952,822 2,051,959

Note that triples are mostly present in 1 single graph in LUBM500. Repetitions are mainly due to triples specifying the type of a university. The distribution of the repetitions (number of triples ocurring with X graph) is shown below:

2.1 Space requirements

This section provides the space results (already reflected in the paper) and links to download the corresponding HDTQ files.

Dataset
Size (GB)
gzip
HDT-AG
HDT-AT
Jena
Virtuoso
Virtuoso+
BEAR-A 396.9 5.8% 2.3% Download (.index) 2.8% Download(.index) 96.8% NA NA
BEAR-B day 0.6 4.8% 0.7% Download(.index) 0.8% Download(.index) 97.7% 13.7% 33.7%
BEAR-B hour 9.7 4.8% 0.3% Download(.index) 0.1% Download(.index) 96.4%4.3% 25.6%
LUBM500 G1 11.4 3.0% 6.6% Download(.index) 17.0% Download(.index) 118.8%17.2% 21.0%
LUBM500 G9998 11.6 3.0% 6.6% Download(.index) 16.7% Download(.index) 120.1%17.5% 27.5%
LDBC 0.9 9.7% 15.9% Download(.index) 25.1% Download(.index) 126.3%71.2% 80.8%
LIDDI 0.7 3.7% 11.8% Download(.index) 15.6% Download(.index) 78.1%49.9% 53.4%

(*) All LUBM500 in HDTQ (in AG and AT format) are available for download.

In addition, we provide the size of the structures in HDT-AG and HDT-AT:

Dataset
HDT-AG Size (MB)
HDT-AT (MB)
Total
HDT
HDT indexes
Graph Dictionary
Quad Information
Total
HDT
HDT indexes
Graph Dictionary
Quad Information
BEAR-A 9,4714,4102,444<1KB2,617 11,1794,4102,444<1KB4,325
BEAR-B day 4.33.2150KB<1KB1.14.93.2150KB<1KB1.7
BEAR-B hour 327300KB6KB25147300KB6.2KB7
LUBM500 G1 778392386<1KB15K1987392386<1KB1209
LUBM500 G9998 778.6 392.6 38650K4.41991.6392.6 38650K1213
LDBC 150121231.54.5236.5121231.591
LIDDI 81.5509.51210107.5509.51236

2.2 Indexing

Measures on RAM consumption for indexing will be provided soon...

2.3 Performance for Quad Pattern Resolution

In this section we provide additional information to test the performance of HDTQ and other systems (Jena and Virtuoso).

2.3.1 Get the queries

We select, for each dataset, 100 random queries for each combination of quad patterns (expect for the pattern ???? and those patterns such as ?P?? where the data distribution prevents from having 100 different queries).

Download all quad patterns.

2.3.2 Get the source code and prepare the setup

We provide a first alpha version of the HDTQ source code. While this build has been extensively tested, the current alpha state is still subject to bugs and optimizations. HDTQ content is licensed by Lesser General Public License.

We compare HDTQ against Apache Jena TDB 2.10 store (Download) and Virtuoso 7.1 Open Source (Download). For this latter, following Virtuoso instructions, we also consider a variant, named as Virtuoso+, which includes an additional index (GPOS) that may speed up quad patterns where the subject is not given. Please download and follow the Virtuoso instructions to create this additional index.

In order to alleviate the burden of executing all three systems, we have created a single jar file and a bash script to configure and run all experiments. The final lines also check that the results are the same for all systems. Please make sure transactions are disabled (by default) in all systems.

2.3.3 Raw Results

We include two JSON files that describe the results for the systems in cold and warm.

The JSON file contains the following structure:

{  
   "Dataset":{  
      "System":{  
         "Pattern":{  
            "averageTime":0,
            "Concrete measures":"[query1,query2,query100]",
            "minTime":0,
            "medianTime":0,
            "maxTime":0,
            "numResultsforEachQuery":"[query1,query2,query100]",
            "NumOfqueries":100,
            "repetitions":3
         }
      }
   }
}
(*) We performed 3 repetitions of the patterns except for minor exceptions of stable and time-consuming patterns were we used 2 repetitions.

2.3.4 Additional plots

In addition to the plots included in our ESWC paper, we describe here additional plots completing the experimentation. In particular, we provide the query times for LIDDI in HDT-AG: