Benchmark & Results

In this section you can find benchmarking of our qikkDB against competition along with testing queries, hardware used as well as steps to reproduce the testing dataset.

Test Queries

This benchmark was inspired by Mark Litwintschik's benchmark in which four SQL queries of different complexity were used for testing. The queries were defined as follows:

--Query #1:
SELECT cab_type, COUNT(*) FROM trips GROUP BY cab_type;
--Query #2:
SELECT passenger_count, AVG(total_amount) FROM trips GROUP BY passenger_count;
--Query #3:
SELECT passenger_count, YEAR(pickup_datetime) AS pickup_year,
COUNT(*) FROM trips GROUP BY passenger_count, pickup_year;
--Query #4:
SELECT passenger_count, YEAR(pickup_datetime) AS pickup_year,
CAST(trip_distance AS INT) AS distance,
COUNT(*) AS the_count
FROM trips GROUP BY passenger_count, pickup_year, distance
ORDER BY pickup_year, the_count DESC;

Execution Times Results (Single Server Instance)

We compared our database qikkDB with three leading databases aimed at analytics workloads of different types:

  • GPU database (referred as "GiraffeDB"),

  • columnar database (referred as "CatDB"),

  • relational database (referred as "RaccoonDB").

The results in the table show execution times in milliseconds. For each of 4 queries average execution time was obtained based on 200 query runs, while the first runs was not counted towards the average. The results were obtained using the latest versions of the databases which were available on the 2th of October 2019. The benchmarking dataset contained 1.2B data rows in 7 columns with various data types in a single table. In our database we did not used indexing. The qikkDB results have been tested on version 1.4.1.

Speed comparison of qikkDB against competition based on 4 SQL queries.

Query

qikkDB@ p3.8xlarge

qikkDB@ g4dn.12xlarge

GiraffeDB@ p3.8xlarge

CatDB@ c5d.9xlarge

RaccoonDB@ c5d.9xlarge

#1

22

37

25

435

22

#2

37

82

235

1061

964

#3

228

925

231

1630

3491

#4

283

1105

417

2174

3996

Average

143

537

227

1325

2118

Price per Query

As the databases require different hardware we provide also comparison from effectiveness perspective what brings normalized view on performance.

We have estimated price per 10 000 query runs ($/10kQ) based on Amazon's on-demand EC2 pricelist (region: EU Frankfurt) for GPU optimized instances and compute optimized (CPU) instances. These prices were calculated as follows: we cumulated ten thousand query execution times and calculated the price for the total time according to Amazon's on-demand EC2 pricelist. The prices include hardware running costs only excluding any licence costs.

Query

qikkDB@ p3.8xlarge

qikkDB@ g4dn.12xlarge

GiraffeDB@ p3.8xlarge

CatDB@ c5d.9xlarge

RaccoonDB@ c5d.9xlarge

#1

0.935 $/10kQ

0.503 $/10kQ

1.062 $/10kQ

2.414 $/10kQ

0.122 $/10kQ

#2

1.572 $/10kQ

1.114 $/10kQ

9.982 $/10kQ

5.889 $/10kQ

5.350 $/10kQ

#3

9.685 $/10kQ

12.565 $/10kQ

9.812 $/10kQ

9.047 $/10kQ

19.375 $/10kQ

#4

12.021 $/10kQ

15.010 $/10kQ

17.713 $/10kQ

12.066 $/10kQ

22.178 $/10kQ

Average

6.053 $/10kQ

7.298 $/10kQ

9.642 $/10kQ

7.354 $/10kQ

11.756 $/10kQ

Effectiveness comparison of qikkDB against competition based on 4 SQL queries.

Hardware & Software Used for Testing

We used Amazon Web Services for testing our database qikkDB as well as databases of our competitors. We tested the same queries on the same dataset. The GPU based databases were tested on the same GPU optimized instance. The CPU based databases were tested on the same CPU optimized instance. We used single instance for testing.

GPU Optimized Instance

p3.8xlarge

g4dn.12xlarge

vCPU

32x Intel® Xeon® Scalable (Skylake)

48x Intel® Xeon® Scalable (Cascade Lake)

GPU

4x NVIDIA Tesla V100 16GB

4x NVIDIA Tesla T4 16GB

Memory (RAM)

244 GB

192 GB

OS

Ubuntu 18.04 LTS

Ubuntu 18.04 LTS

NVIDIA CUDA Toolkit

10.1.168

10.1.243

NVIDIA Driver

430.26

418.87

CPU Optimized Instance

c5d.9xlarge

vCPU

36x 3.0 GHz Intel® Xeon® Platinum 8000 (3.5 GHz Intel Turbo Boost)

Memory (RAM)

72 GB

OS

Ubuntu 18.04 LTS

Similar Hardware Prices

We were not able to find out the exact prices for above mentioned GPU and CPU optimized instaces, but prices for servers according to Dell's PowerEdge R840 Rack Server configurator would be as follows (updated on 8th of September 2019):

Similar GPU Optimized Instance: GPUs: 4x NVIDIA Tesla™ V100 16G Passive GPU CPU: 2x Intel® Xeon® Platinum 8253 2.2G, 16C/32T, 10.4GT/s, 22M Cache, Turbo, HT (125W) DDR4-2933 Memory: 30x 8GB (total: 240GB) RDIMM, 2666MT/s, Single Rank Total cost: 62,850.16 $

Similar CPU Optimized Instance: CPU: 2x Intel® Xeon® Gold 6254 3.1G, 18C/36T, 10.4GT/s, 24.75M Cache, Turbo, HT (200W) DDR4-2933 Memory: 9x 8GB (total: 72GB) RDIMM, 2666MT/s, Single Rank Total cost: $14,445.47 $

Script for Reproducing the Dataset

We used subset of the TLC Trip Record Data which contained 1.2B rows of data. You can follow these steps to download and preprocess data by yourself or you can download preprocessed dataset here:

  1. Create or download the file named 'urls_data.txt' which will contain the following URLs:

https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-01.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-02.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-03.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-04.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-05.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-06.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-07.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-08.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-09.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-10.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-11.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-12.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2011-01.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2011-02.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2011-03.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2011-04.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2011-05.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2011-06.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2011-07.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2011-08.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2011-09.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2011-10.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2011-11.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2011-12.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2012-01.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2012-02.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2012-03.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2012-04.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2012-05.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2012-06.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2012-07.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2012-08.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2012-09.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2012-10.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2012-11.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2012-12.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2013-01.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2013-02.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2013-03.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2013-04.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2013-05.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2013-06.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2013-07.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2013-08.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2013-09.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2013-10.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2013-11.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2013-12.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-01.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-02.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-03.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-04.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-05.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-06.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-07.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-08.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-09.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-10.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-11.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-12.csv

2. Create or download the bash script named 'download_data.sh' which will contain the following commands:

#!/bin/bash
readonly YELLOW_RESULT_PATH=aggregated_yellow_tripdata.csv
readonly GREEN_RESULT_PATH=aggregated_green_tripdata.csv
readonly PART_ONE=trips-part1.csv
readonly PART_TWO=trips-part2.csv
readonly PART_ONE_RESULT=preprocessed-trips-part1.csv
readonly PART_TWO_RESULT=preprocessed-trips-part2.csv
echo "The process has started."
echo "Downloading data. Please wait. Step [1/4]"
wget -i urls_data.txt
rm urls_data.txt
echo "Downloading data has finished. Removing bad rows. Step [2/4]"
# concat yellow taxi files into a single csv file
head -1 yellow_tripdata_2010-01.csv | cut -d, -f1,2,3,4,5,18,19 >> $YELLOW_RESULT_PATH
head -1 yellow_tripdata_2010-01.csv | cut -d, -f1,2,3,4,5,18,19 >> $GREEN_RESULT_PATH
for f in yellow_tripdata_2010*.csv
do
cut -d, -f1,2,3,4,5,18,19 $f | tail -n +2 -q >> $YELLOW_RESULT_PATH
rm $f
echo "File $f has been concated and removed."
done
# remove the second line, because it is broken
sed -i '2d' $YELLOW_RESULT_PATH
for f in yellow_tripdata_2011*.csv
do
cut -d, -f1,2,3,4,5,18,19 $f | tail -n +2 -q >> $YELLOW_RESULT_PATH
rm $f
echo "File $f has been concated and removed."
done
for f in yellow_tripdata_2012*.csv
do
cut -d, -f1,2,3,4,5,18,19 $f | tail -n +2 -q >> $YELLOW_RESULT_PATH
rm $f
echo "File $f has been concated and removed."
done
for f in yellow_tripdata_2013*.csv
do
cut -d, -f1,2,3,4,5,18,19 $f | tail -n +2 -q >> $YELLOW_RESULT_PATH
cut -d, -f1,2,3,4,5,18,19 $f | tail -n +2 -q >> $GREEN_RESULT_PATH
echo "File $f has been concated."
done
# remove the second line, because it is broken
sed -i '2d' $GREEN_RESULT_PATH
for f in yellow_tripdata_2014*.csv
do
cut -d, -f1,2,3,4,5,18,19 $f | tail -n +2 -q >> $YELLOW_RESULT_PATH
cut -d, -f1,2,3,4,5,18,19 $f | tail -n +2 -q >> $GREEN_RESULT_PATH
echo "File $f has been concated."
done
echo "Concating has finished. Adding column 'cab_type' with value '1' to data rows. Step [3/4]"
# remove Windows end lines
tr -d '\r' < $YELLOW_RESULT_PATH >> $PART_ONE
# add new column with data '1'
gawk -i inplace '{ printf("%s,1\n", $0); }' $PART_ONE
# rename column name in header from '1' to 'cab_type'
sed -i '1s/1/cab_type/' $PART_ONE
echo "Concating has finished. Adding column 'cab_type' with value '2' to data rows. Step [4/4]"
# remove Windows end lines
tr -d '\r' < $GREEN_RESULT_PATH >> $PART_TWO
# add new column with data '2'
gawk -i inplace '{ printf("%s,2\n", $0); }' $PART_TWO
# rename column name in header from '2' to 'cab_type'
sed -i '1s/2/cab_type/' $PART_TWO
rm yellow_tripdata_20*
rm $YELLOW_RESULT_PATH
rm $GREEN_RESULT_PATH
# remove bad rows
awk -F',' 'NF==7' $PART_ONE >> $PART_ONE_RESULT
awk -F',' 'NF==7' $PART_TWO >> $PART_TWO_RESULT
echo "Finished. Result is saved in 'preprocessed-trips-part1.csv' and 'preprocessed-trips-part2.csv'."
rm $PART_ONE
rm $PART_TWO
rm $0

3. Run the script. We recommend to run the script in background, because it will take a lot of time to finnish.

List of Competitors

We consider our greatest competitors to be:

  • Blazing Distributed and ACID-compliant GPU-accelerated SQL engine with data lake integration.

  • Brylyt Scalable GPU-accelerated RDBMS for fast analytic and streaming workloads, leveraging PostgreSQL.

  • ClickHouse Column-oriented RDBMS powering Yandex.

  • OmniSci A high performance, in-memory, column-oriented RDBMS, designed to run on GPUs.

  • Kinetica GPU-accelerated database for real-time analysis of large and streaming datasets.

  • Microsoft SQL Server Well known RDBMS developed by Microsoft.

  • SQream Scalable GPU-accelerated RDBMS for fast analytic and streaming workloads, leveraging PostgreSQL.

  • Vertica Columnar RDBMS designed to handle modern analytic workloads, enabling fast query performance.

We encourage you to test our database and our competitors products with your own queries simulating your use cases.