HDTQ: Managing RDF Datasets in Compressed Space
Javier D. Fernández1, 2, Miguel A. Martínez-Prieto3, Axel Polleres1, 2, and Julain Reindorf1
1. Vienna University of Economics and Business (Austria)
2. Complexity Science Hub Vienna (Austria)
3. Dept. of Computer Science, Universidad de Valladolid, Spain
firstname.lastname@wu.ac.at, migumar2@infor.uva.es, julian.reindorf@gmail.com
1. Vienna University of Economics and Business (Austria)
2. Complexity Science Hub Vienna (Austria)
3. Dept. of Computer Science, Universidad de Valladolid, Spain
firstname.lastname@wu.ac.at, migumar2@infor.uva.es, julian.reindorf@gmail.com
This work has been submitted to ESWC 2018. In the following, we provide a brief overview of the proposal and further information regarding the empirical evaluation. Please see the “RDF/HDT” project website for a further details on HDT.
1. Introduction 2. Empirical evaluation --- 2.1 Space requirements --- 2.2 Indexing --- 2.3 Performance for Quad Pattern Resolution |
1. Introduction
HDT (Header-Dictionary-Triples) is a well-known compressed representation of RDF data that supports retrieval features without prior decompression. Yet, RDF datasets often contain additional graph information, such as the origin, version or validity time of a triple. Traditional HDT is not capable of handling this additional parameter(s).
This work introduces HDTQ (HDT Quads), an extension of HDT, which is able to represent quadruples (or quads) while still being highly compact and queryable. Two approaches of this extension, Annotated Triples and Annotated Graphs, are introduced and their performance is compared to the leading open-source RDF stores on the market. Results show that HDTQ achieves the best compression rates and is a competitive alternative to well-established systems.
2. Empirical evaluation
We evaluate the performance of HDTQ in terms of space and efficiency on quad pattern resolution. The HDTQ prototype (Get source code), built on top of the existing HDT-Java library, implements both HDT-AG and HDT-AT approaches using existing compressed bitmaps (called Roaring Bitmaps), which are optimal for sparse bitsequences.
Experiments are carried out on the following heterogeneous configuration of RDF datasets (download links are provided to get the N-Quads version of the dataset):
Dataset |
Download |
Subjects |
Predicates |
Objects |
Graphs |
Triples |
Quads |
---|---|---|---|---|---|---|---|
BEAR-A | Get N-Quads | 74,908,887 | 41,209 | 64,215,355 | 58 | 378,476,570 | 2,071,287,964 |
BEAR-B day | Get N-Quads | 100 | 1,725 | 69,650 | 89 | 82,401 | 3,460,896 |
BEAR-B hour | Get N-Quads | 100 | 1,744 | 148,866 | 1,299 | 167,281 | 51,632,164 |
LUBM-500 G1 | Get N-Quads | 10,847,183 | 17 | 8,072,358 | 1 | 66,731,200 | 66,731,200 |
LUBM-500 G10 | Get N-Quads | 10,847,183 | 17 | 8,072,358 | 10 | 66,731,200 | 66,740,200 |
LUBM-500 G20 | Get N-Quads | 10,847,183 | 17 | 8,072,358 | 20 | 66,731,200 | 66,750,200 |
LUBM-500 G30 | Get N-Quads | 10,847,183 | 17 | 8,072,358 | 30 | 66,731,200 | 66,760,200 |
LUBM-500 G40 | Get N-Quads | 10,847,183 | 17 | 8,072,358 | 40 | 66,731,200 | 66,770,200 |
LUBM-500 G50 | Get N-Quads | 10,847,183 | 17 | 8,072,358 | 50 | 66,731,200 | 66,780,200 |
LUBM-500 G60 | Get N-Quads | 10,847,183 | 17 | 8,072,358 | 60 | 66,731,200 | 66,790,200 |
LUBM-500 G70 | Get N-Quads | 10,847,183 | 17 | 8,072,358 | 70 | 66,731,200 | 66,800,200 |
LUBM-500 G80 | Get N-Quads | 10,847,183 | 17 | 8,072,358 | 80 | 66,731,200 | 66,810,200 |
LUBM-500 G90 | Get N-Quads | 10,847,183 | 17 | 8,072,358 | 90 | 66,731,200 | 66,820,200 |
LUBM-500 G100 | Get N-Quads | 10,847,183 | 17 | 8,072,358 | 100 | 66,731,200 | 66,830,200 |
LUBM-500 G1000 | Get N-Quads | 10,847,183 | 17 | 8,072,358 | 1000 | 66,731,200 | 67,634,864 |
LUBM-500 G2000 | Get N-Quads | 10,847,183 | 17 | 8,072,358 | 2000 | 66,731,200 | 68,112,009 |
LUBM-500 G3000 | Get N-Quads | 10,847,183 | 17 | 8,072,358 | 3000 | 66,731,200 | 68,351,144 |
LUBM-500 G4000 | Get N-Quads | 10,847,183 | 17 | 8,072,358 | 4000 | 66,731,200 | 68,491,797 |
LUBM-500 G5000 | Get N-Quads | 10,847,183 | 17 | 8,072,358 | 5000 | 66,731,200 | 68,604,538 |
LUBM-500 G6000 | Get N-Quads | 10,847,183 | 17 | 8,072,358 | 6000 | 66,731,200 | 68,647,386 |
LUBM-500 G7000 | Get N-Quads | 10,847,183 | 17 | 8,072,358 | 7000 | 66,731,200 | 68,692,484 |
LUBM-500 G8000 | Get N-Quads | 10,847,183 | 17 | 8,072,358 | 8000 | 66,731,200 | 68,736,452 |
LUBM-500 G9000 | Get N-Quads | 10,847,183 | 17 | 8,072,358 | 9000 | 66,731,200 | 68,780,614 |
LUBM-500 G9998 | Get N-Quads | 10,847,183 | 17 | 8,072,358 | 9998 | 66,731,200 | 68,823,803 |
LDBC | Get N-Quads | 668,711 | 16 | 2,743,645 | 190,961 | 5,000,197 | 5,000,197 |
LIDDI | Get N-Quads | 392,344 | 23 | 981,928 | 392,340 | 1,952,822 | 2,051,959 |
Note that triples are mostly present in 1 single graph in LUBM500. Repetitions are mainly due to triples specifying the type of a university. The distribution of the repetitions (number of triples ocurring with X graph) is shown below:

2.1 Space requirements
This section provides the space results (already reflected in the paper) and links to download the corresponding HDTQ files.
Dataset |
Size (GB) |
gzip |
HDT-AG |
HDT-AT |
Jena |
Virtuoso |
Virtuoso+ |
---|---|---|---|---|---|---|---|
BEAR-A | 396.9 | 5.8% | 2.3% Download (.index) | 2.8% Download(.index) | 96.8% | NA | NA |
BEAR-B day | 0.6 | 4.8% | 0.7% Download(.index) | 0.8% Download(.index) | 97.7% | 13.7% | 33.7% |
BEAR-B hour | 9.7 | 4.8% | 0.3% Download(.index) | 0.1% Download(.index) | 96.4% | 4.3% | 25.6% |
LUBM500 G1 | 11.4 | 3.0% | 6.6% Download(.index) | 17.0% Download(.index) | 118.8% | 17.2% | 21.0% |
LUBM500 G9998 | 11.6 | 3.0% | 6.6% Download(.index) | 16.7% Download(.index) | 120.1% | 17.5% | 27.5% |
LDBC | 0.9 | 9.7% | 15.9% Download(.index) | 25.1% Download(.index) | 126.3% | 71.2% | 80.8% |
LIDDI | 0.7 | 3.7% | 11.8% Download(.index) | 15.6% Download(.index) | 78.1% | 49.9% | 53.4% |
(*) All LUBM500 in HDTQ (in AG and AT format) are available for download.
In addition, we provide the size of the structures in HDT-AG and HDT-AT:
Dataset |
HDT-AG Size (MB) |
HDT-AT (MB) |
||||||||
---|---|---|---|---|---|---|---|---|---|---|
Total |
HDT |
HDT indexes |
Graph Dictionary |
Quad Information |
Total |
HDT |
HDT indexes |
Graph Dictionary |
Quad Information |
|
BEAR-A | 9,471 | 4,410 | 2,444 | <1KB | 2,617 | 11,179 | 4,410 | 2,444 | <1KB | 4,325 |
BEAR-B day | 4.3 | 3.2 | 150KB | <1KB | 1.1 | 4.9 | 3.2 | 150KB | <1KB | 1.7 |
BEAR-B hour | 32 | 7 | 300KB | 6KB | 25 | 14 | 7 | 300KB | 6.2KB | 7 |
LUBM500 G1 | 778 | 392 | 386 | <1KB | 15K | 1987 | 392 | 386 | <1KB | 1209 |
LUBM500 G9998 | 778.6 | 392.6 | 386 | 50K | 4.4 | 1991.6 | 392.6 | 386 | 50K | 1213 |
LDBC | 150 | 121 | 23 | 1.5 | 4.5 | 236.5 | 121 | 23 | 1.5 | 91 |
LIDDI | 81.5 | 50 | 9.5 | 12 | 10 | 107.5 | 50 | 9.5 | 12 | 36 |
2.2 Indexing
Measures on RAM consumption for indexing will be provided soon...2.3 Performance for Quad Pattern Resolution
In this section we provide additional information to test the performance of HDTQ and other systems (Jena and Virtuoso).
2.3.1 Get the queries
We select, for each dataset, 100 random queries for each combination of quad patterns (expect for the pattern ???? and those patterns such as ?P?? where the data distribution prevents from having 100 different queries).
2.3.2 Get the source code and prepare the setup
We provide a first alpha version of the HDTQ source code. While this build has been extensively tested, the current alpha state is still subject to bugs and optimizations. HDTQ content is licensed by Lesser General Public License.
We compare HDTQ against Apache Jena TDB 2.10 store (Download) and Virtuoso 7.1 Open Source (Download). For this latter, following Virtuoso instructions, we also consider a variant, named as Virtuoso+, which includes an additional index (GPOS) that may speed up quad patterns where the subject is not given. Please download and follow the Virtuoso instructions to create this additional index.
In order to alleviate the burden of executing all three systems, we have created a single jar file and a bash script to configure and run all experiments. The final lines also check that the results are the same for all systems. Please make sure transactions are disabled (by default) in all systems.
2.3.3 Raw Results
We include two JSON files that describe the results for the systems in cold and warm.
The JSON file contains the following structure:
{
"Dataset":{
"System":{
"Pattern":{
"averageTime":0,
"Concrete measures":"[query1,query2,query100]",
"minTime":0,
"medianTime":0,
"maxTime":0,
"numResultsforEachQuery":"[query1,query2,query100]",
"NumOfqueries":100,
"repetitions":3
}
}
}
}
(*) We performed 3 repetitions of the patterns except for minor exceptions of stable and time-consuming patterns were we used 2 repetitions.
2.3.4 Additional plots
In addition to the plots included in our ESWC paper, we describe here additional plots completing the experimentation. In particular, we provide the query times for LIDDI in HDT-AG:
