The team behind Pivotal’s GemFire in-memory transactional data store recently unveiled a new database solution powered by GemFire and Apache Spark, called SnappyData.
SnappyData is another recent example of Spark employed as a component in a larger database solution, with or without other pieces from Apache Hadoop.
Snap and spark
SnappyData — the name of both the new database and the organization producing it — was built to span two worlds. It uses the Apache Spark in-memory data-analytics engine so that it can perform live SQL analytics on both static data sets and streams. Queries against SnappyData can be written as conventional SQL or as Spark abstractions, so existing work done in both paradigms can be reused, alone or together, on the same data.
To store and retrieve the data, SnappyData has a distributed data store called Snappy-Store, derived from a variant of Pivotal’s GemFire technology. It works as either its own data store or as a sort of asynchronous write-back cache to other data sources, such as Hadoop/HDFS. This implies that existing data sets could be accessed through SnappyData without having to be formally migrated.
SnappyData also tries to offer novel solutions to problems that can arise when using streaming data. For instance, if there’s too much data coming through to get a real-time response to a query in a timely fashion, SnappyData uses approximate query processing (AQP) or a method of sampling streaming data to generate an answer.
The results are less exact than operating on the entire data set, and AQP isn’t available for every kind of query. That said, AQP queries are intended to be faster to run and are less demanding of CPU and memory than working on the full data set.
One among many
This isn’t the first time Spark has been used at the heart of a data analysis solution that covers both OLTP and OLAP workloads. In-memory database system Splice Machine was originally built on top of Hadoop components and leveraged them to scale out and be able to run both OLTP and OLAP jobs under the same hood. Version 2.0 of that product added Spark as an OLAP processing engine.
Where SnappyData diverges from Splice Machine, though, is in how Spark is used. SnappyData claims it’s extending Spark Streaming in various manners, such as allowing streams to be manipulated and queried as though they were tables, including operations like joins.
SnappyData also seems like a good environment to leverage changes that are slated for Apache Spark in the near term. For instance, Spark 2.0, scheduled to come out later this year, will heavily rework how Spark handles memory management and introduce changes to its streaming system that make it easier to pull down streaming data.