Apache Arrow Spark+AI Summit 2020

The Spark + AI summit for 2020 is currently happening from June 22nd to June 26th. There were talks by big names in the industry like Matei Zaharia and even by Nate Silver from fivethirtyeight.com. There were multiple topics being talked about simultaneously and it was a difficult to just pick one to listen to as all of them seemed interesting! I took some time to listen to Tomer Shiran, the founder of the Dremio project about the origins, the need and use cases of Apache Arrow and how they used it in Dremio alongside the Apache Arrow Gandiva.

The data querying conundrum in Data Lakes paradigm

Data Lakes are the talk of the town today! In today’s age where data is the chief asset that companies prize, they don’t want to lose any of it and would rather have it stored in a Data Lake even if it does not make much sense to them right now, with the prospect of it being useful at a later point in time. That very point makes the data unwieldy.

In such a scenario, when your data store has vast amounts of data, querying becomes slow because it is not just that the data is big but that transferring data over the network when you query such stores also becomes a non-trivial factor. To get around this problem, companies again copy part of their data from the Data Lake into Data Warehousing solutions like AWS Redshift. They then construct their OLAP cubes and aggregations on top of this for fast querying of the underlying data. The drawback of this is that the flexibility of doing anything else with this data reduces as you copy it into the Data Warehouse. This is the problem that the folks at Dremio / Apache Arrow are trying to solve - how do you query your data directly from the Data Lake without having to copy it anywhere else?

Dremio

The folks at Dremio claim that they have built technology to query Data Lakes with 4 to 100 times better performance than existing solutions. The way they were able to do it is by defining and implementing a new standard - Apache Arrow, of storing data in-memory in a columnar format for efficient operations on today’s hardware. I will talk more about Apache Arrow later in this article.

Following are some features that power Dremio -

Data Reflections

Dremio creates a view of your data, an optimized data structure that lends itself well to boost different query patterns. This data structure can also auto refresh itself.

Columnar Cloud Cache (C3)

The Columnar Cloud Cache feature takes advantage of local storage (generally NVMe - non-volatile memory express) for a distributed real time caching solution that increases the amount of read throughput and reduces the amount of data transferred over the network. It is automatic, does not need any user involvement and caches data based on SQL query patterns, workload management and file directory structures to optimize what to store and evict.

Predictive Pipelining

Dremio can predict access patterns of data which reduces its query response times. Based on usage patterns of analytical workloads and their understanding of columnar data format, the Dremio folks provide this feature wherein they fetch data only before execution engine needs it.

Difference between Dremio and Facebook’s Presto

Presto by Facebook is a distributed SQL engine over multiple data sources. Dremio offers more features on top of that. It sits between your data over disparate sources and you who want an analysis of your data - it squashes the need for data warehousing solutions, OLAP cubes, extracts and aggregations.

Dremio claims to offer speeds of interactive nature with data of any volume
It has deep integrations with some data stores and is thus able to push down some compute when queries are made over those data sources
It is able to show how different datasets are related to each other by creating a lineage of data

Apache Arrow

Apache Arrow is a specification for storing data in a columnar format in memory, serializing the metadata and for transferring data over the network. It uses Google Flatbuffers for metadata serialization. It was born out of Dremio’s internal memory format.

It provides constant time random access, data adjacency for sequential scans and lends itself easily for SIMD (single instruction, multiple data) class of instructions in modern processors.

It serializes data as Arrow buffers / vectors which are basically arrays of the same size with different data types. This whole package together constitutes its schema.

Gandiva

Gandiva is a part of Dremio which provides a high performance execution engine over Apache Arrow data buffers. Gandiva takes a sql expression, compiles it into LLVM bytecode and translates it to machine code. To get this high performance it is written in C++.

Tomer wrapped up the talk on Dremio with a demo of the platform as well.

Go give Dremio a try!