Apache Spark, the extremely popular data analytics execution engine, was initially released in 2012. It wasn’t until 2015 that Spark really saw an uptick in support, but by November 2015, Spark saw 50 percent more activity than the core Apache Hadoop project itself, with more than 750 contributors from hundreds of companies participating in its development in one form or another.
Spark is a hot new commodity for a reason. Its performance, general-purpose applicability, and programming flexibility combine to make it a versatile execution engine. Yet that variety also leads to varying levels of support for the product and different ways solutions are delivered.
While evaluating analytic software products that support Spark, customers should look closely under the hood and examine four key facets of how the support for Spark is implemented:
- How Spark is utilized inside the platform
- What you get in a packaged product that includes Spark
- How Spark is exposed to you and your team
- How you perform analytics with the different Spark libraries
Spark can be used as a developer tool via its APIs, or it can be used by BI tools via its SQL interface. Or Spark can be embedded in an application, providing access to business users without requiring programming skills and without limiting Spark’s utility through a SQL interface. I examine each of these options below and explain why all Spark support is not the same.
Programming on Spark
If you want the full power of Spark, you can program directly to its processing engine. There are APIs that are exposed through Java, Python, Scala, and R. In addition to stream and graph processing components, Spark offers a machine-learning library (MLlib) as well as Spark SQL, which allows data tools to connect to a Spark engine and query structured data, or programmers to access data via SQL queries they write themselves.
A number of vendors offer standalone Spark implementations; the major Hadoop distribution suppliers also offer Spark within their platforms. Access is exposed either through a command line or Notebook interface.
But performing analytics on core Spark with its APIs is a time-consuming, programming-intensive process. While Spark offers an easier programming model than, say, native Hadoop, it still requires developers. Even for organizations with developer resources, deploying them to work on lengthy data analytics projects may amount to an intolerable hidden cost. With many organizations, programming on Spark is not an actionable course for this reason.
BI on Spark
Spark SQL is a standards-based way to access data in Spark. It has been relatively easy for BI products to add support for Spark SQL to query tabular data in Spark. The dialect of SQL used by Spark is similar to that of Apache Hive, making Spark SQL akin to earlier SQL-on-Hadoop technologies.
Although Spark SQL uses the Spark engine behind the scenes, it suffers from the same disadvantages as Hive and Impala: Data must be in a structured, tabular format to be queried. This forces Spark to be treated as if it were a relational database, which cripples many of the advantages of a big data engine. Simply put, putting BI on top of Spark requires the transformation of the data into a reasonable tabular format that can be consumed by the BI tools.
Another way to leverage Spark is to abstract away its complexity by embedding it deep into a product and taking full advantage of its power behind the scenes. This allows users to leverage the speed and power of Spark without needing developers.
This architecture brings up three key questions. First, does the platform truly hide all of the technical complexities of Spark? As a customer, one needs to examine all aspects of how you would create each step of the analytic cycle — integration, preparation, analysis, visualization, and operationalization. A number of products offer self-service capabilities that abstract away Spark’s complexities, but others force the analyst to dig down and code — for example, in performing integration and preparation. These products may also require you to first ingest all your data into the Hadoop file system for processing. This adds extra length to your analytic cycles, creates fragile and fragmented analytic processes, and requires specialized skills.
Second, how does the platform take advantage of Spark? It’s critical to understand how Spark is used in the execution framework. Spark is sometimes embedded in a fashion that does not have the full scalability of a true cluster. This can limit overall performance as the volume of analytic jobs increases.
Third, how are you protected for the future? The strength of being tightly coupled with the Spark engine is also a weakness. The big data industry moves quickly. MapReduce was the predominant engine in Hadoop for six years. Apache Tez became mainstream in 2013, and now Spark has become a major engine. Assuming the technology curve continues to produce new engines at the same rate, Spark will almost certainly be supplanted by a new engine within 18 months, forcing products tightly coupled to Spark to be reengineered — a far from trivial undertaking. Even with that effort put aside, you must consider whether the redesigned product will be fully compatible with what you’ve built in the older version.
The first step to uncovering the full power of Spark is to understand that not all Spark support is created equal. It’s crucial that organizations grasp the differences in Spark implementations and what each approach means for their overall analytic workflow. Only then can they make a strategic buying decision that will meet their needs over the long haul.