Introduction

“It is not science to build fast software on over-scaled hardware, but to build fast software on available resources”, Peter Čapkovič, CTO of Instarea

What is qikkDB?

QikkDB is an ultra fast database that uses graphic cards (GPUs) to accelerate computations over big data. In Instarea we work with big telco data what makes us challenged everyday with optimal approaches to manipulation with massive datasets. This project started as a reaction to our needs for fast querying over long tables filled with event records.

The main focus of qikkDB is to query a single flat and huge table with realtime response. This approach will find its place in use-cases in which speed is crucial e.g. fraud detection in finance data, web logs analysis and realtime decisions, realtime population analytics on telco data, etc.

Already suitable for

  • Filtering and aggregations over single flat huge table

  • Complex polygon operations (contains, intersect, union)

  • Numeric and datetime data

  • Incremental data which are growing over time

Not yet suitable for

  • Joining normalized data in multiple tables

  • Complex string manipulation

  • Modification of historical data

  • Advanced scripting (procedures)

Engine for Advanced Visualization

QikkDB is an essential part of a visualization tool TellStory. This tool enables to create a visual stories over big data with realtime responses. To learn more about TellStory see visit this page.

Basic Concepts of Superb Performance

Along with enormous increase in data volume, IT companies seek new possibilities of effective data processing. Especially, queries with real-time or near real-time responses in context of big data are super expensive or hardly possible to achieve with standard database systems and sometimes even various specialized noSql solutions. qikkDB has utilized new hardware opportunities and it has integrated modern graphic cards that enable fast and efficient processing of vast amount of data. And therefore, even billions of records could be queried in milliseconds.

Utilization of Modern Hardware

Since 2007, when NVIDIA launched the CUDA Toolkit, using graphic processing units (GPUs) for high performance and scientific computing has become increasingly popular. This moved focus of graphics cards vendors from visualization towards more general computing. General processing graphics processing units (GPGPUs) enable highly parallelized processing of data thanks to hundreds and even thousands of cores doing work in single GPU. Although these cores can handle simpler logic than those of CPUs, their high number in single unit creates a powerful army. And in result it can process large data vectors much faster than any CPU.

Figure 1. GPUs have thousands of arithmetic logic units (ALUs) in one piece of hardware.

Rebirth of Columnar Data Storage

Utilizing GPUs computation power requires different approach to storing data. The most suitable database architecture that works well with parallel processing is columnar storage. In contrast to conventional relational databases which store data in row-based format, columnar databases store data in separate columns. This concept is decades of years old, but it did not find its place in relational databases that were heavily used until era of big data started. Columnar databases have become popular mostly for analytical workloads on big datasets.

In context of parallel processing, GPUs love long vectors of the same data type as all values are next to each other. In addition to this, data transfers are significantly reduced as only relevant columns are loaded and used for processing. Also, column-oriented databases are more suitable for horizontal scaling, enabling use of low-cost hardware to work in terabytes of data.

Architectural Insight

Query is processed by Parser and then Listener generates individual processing instructions. Disparcher executes these instructions. Dispatcher is actually the control unit of the entire database engine. It remembers execution states (e.g. whether a filter mask is used) and incrementally processes the entire query on one block of data. The blocks are processed in parallel so that each graphic card works with just one block of data at a time. One dispatcher is assigned to each graphics card. After all blocks are processed, the results are combined and returned to the client.

How We Gained Our Speed

Our main advantage is that we have used the latest technology currently available. We use the latest features of the C ++ 17 and NVIDIA CUDA 10 Toolkit. Thanks to the latest technologies, some mathematical operations are optimized and the resulting query time is shorter. In addition, we have saved some time by using the C++ 17’s static if statements together with function templates - some conditions are evaluated during the program compilation, which results in even shorter query execution time. This kind of programming is difficult, but the result is truly rewarding.

In order to achieve the highest processing speed, we have implemented the latest algorithms described in scientific papers in recent years, the oldest ones from 2015 (mostly used for GROUP BY and JOIN clauses). Some algorithms (such as the key aggregation within GROUP BY) have been extended beyond accelerating data structures that bypass the deceleration caused by atomic instructions.

All algorithms implemented in our database engine must take into account parallelism and therefore data that are processed in parallel must not depend on each other. However, there are many that depend on each other and therefore must be pre-prepared or divided into independent parts and re-assembled after processing. Fortunately, this overhead is not so huge and consequently parallel processing is faster than sequential processing on the CPU. This overhead is mainly due to reconstructing the query result (compression of data after filtering with a WHERE condition - we compute a prefix sum of the filter mask and then select the rows that match the condition in parallel.

We try to minimize copying data from RAM to GPU (VRAM). The intermediate results remain in the GPU VRAM and we copy the final output back (which is usually very small thanks to GROUP BY or LIMIT clauses). We also use so-called memory pinning, which ensures that the data stored in RAM is not pageable. As a result, the subsequent copying of data to the GPU will take approximately half the time (depending on the graphics card, the differences between the graphics cards are huge). We have also implemented our own graphics memory allocator, which is approximately 3x faster than the CUDA allocator, and our own cache memory, which makes it no longer necessary to copy the input data when you run a similar query.

Data Flow Within System

Datasets are compressed and persisted on disc storage, but also cached in both CPU memory and GPU memory to minimize latency. GPU processing needs data to be present in its local memory (GPU VRAM) and this results in main limitation of GPU computing – moving data between CPU and GPU memory.

QikkDB reduces this transfer time using three strategies. Firstly, when a query is hit, qikkDB analyzes which columns are relevant and only those are copied to GPU VRAM. Secondly, thanks to compression of data, much lower volume needs to be transferred. Finally, data are kept in GPU VRAM as late as possible to avoid duplicate transfers.

When data are transferred, GPU decompresses it and run functions on top of the data including various filters, aggregations, etc. At this point GPU can finally show how big beast it is 😊.

Figure 2. Data flow from disc storage, through main memory RAM to GPU VRAM.

GPU Caching

Especially, in analytics workloads it is very common to use one dataset more than once in a short period of time. Once data are transferred to GPU VRAM it should remain there for repeated use. This is caching at GPU. Of course, GPU VRAM has not unlimited capacity (usually ~16 GB) and we need to wisely manage memory when using various datasets. Thanks to caching, results could be expected in 10x shorter time when queried the same dataset.

Data Compression [Under Development]

As mentioned above, data are internally organized in columnar-wise format. One great advantage of columnar storage is that columns could be very efficiently compressed. QikkDB uses compression to reduce data volume on disk, but also across the whole hierarchy of memory. A compression scheme that is used is rather lightweight in order to provide maximum decompression speed. Therefore, data could be kept compressed even in GPU memory sparing space in GPU RAM that is naturally much smaller. And thus columns are decompressed only when necessary.

Software Toolkit Used

The database layer was implemented in C++ 17 and extension to this are CUDA kernels for computation at NVIDIA graphics cards. The Console is written in C# and therefor it requires .NET runtime environment. We use CMAKE for building and testing the code base. For parsing queries, we use ANTLR v4 library. We also use Google Protocol Buffers in networking. Libraries YAML and Boost are also used. Unit and integration tests are written using GTest library.

Key Capabilities

SQL Language

In the world of databases, SQL is a common interface that almost every data developer or analyst knows and also many tools use it to communicate with persistence layer. QikkDB provides basic SQL statements to select and manipulate data.

Geospatial Functions

Based on our experience with telco data we added geospatial functionality to work with points and polygons. Currently, qikkDB supports inputs in WKT format and provides three main functions – CONTAINS (check whether point is inside polygon), INTERSECTION (new polygon as intersection of two polygons), UNION (new polygon as union of two polygons).

In-Memory

When speed is the most important thing, the whole dataset could be preloaded into RAM. This way is latency decreased to minimum.

Indexing

QikkDB currently provides single clustered index. Similarly, to traditional databases, data are sorted according to chosen columns. Slower insert in turn results in magnitude of order faster querying.

Multiplatform

Big data world happens on Unix systems. We know that, but we also do not want to forget about companies that run Windows.

Connectors to Other Tools

In order to integrate the database into the processing pipeline we provide currently C# and C++ connectors and other connectors (including ODBC) are planned in the future.

Scalability

Multiple GPUs within a single server node are detected automatically and the load can be balanced. Adding new nodes is a must when working with terabytes and petabytes of data what could be easily configured. It is possible to use up to 8 Tesla GPUs in a single machine. Such machine is comparable to cluster of 20 high-end CPU nodes.