Apache Beam, a unified programming model for both batch and streaming data, has graduated from the Apache Incubator to become a top-level Apache project.
Aside from becoming another full-fledged widget in the ever-expanding Apache tool belt of big-data processing software, Beam addresses ease of use and dev-friendly abstraction, rather than simply offering raw speed or a wider array of included processing algorithms.
Beam us up!
Beam provides a single programming model for creating batch and stream processing jobs (the name is a hybrid of “batch” and “stream”), and it offers a layer of abstraction for dispatching to various engines used to run the jobs. The project originated at Google, where it’s currently a service called GCD (Google Cloud Dataflow). Beam uses the same API as GCD, and it can use GCD as an execution engine, along with Apache Spark, Apache Flink (a stream processing engine with a highly memory-efficient design), and now Apache Apex (another stream engine for working closely with Hadoop deployments).
The Beam model involves five components: the pipeline (the pathway for data through the program); the “PCollections,” or data streams themselves; the transforms, for processing data; the sources and sinks, where data is fetched and eventually sent; and the “runners,” or components that allow the whole thing to be executed on an engine.
Apache says it separated concerns in this fashion so that Beam can “easily and intuitively express data processing pipelines for everything from simple batch-based data ingestion to complex event-time-based stream processing.” This is in line with reworking tools like Apache Spark to support stream and batch processing within the same product and with similar programming models. In theory, it’s one fewer concept for prospective developers to wrap their head around, but that presumes Beam is used in lieu of Spark or other frameworks, when it’s more likely it’ll be used — at first — to augment them.
One possible drawback to Beam’s approach is that while the layers of abstraction in the product make operations easier, they also put the developer at a distance from the underlying layers. A good case in point: Beam’s current level of integration with Apache Spark; the Spark runner doesn’t yet use Spark’s more recent DataFrames system, and thus may not take advantage of the optimizations those can provide. But this isn’t a conceptual flaw, it’s an issue with the implementation, which can be addressed in time.
The big payoff of using Beam, as noted by Ian Pointer in his discussion of Beam in early 2016, is that it makes migrations between processing systems less of a headache. Likewise, Apache says Beam “cleanly [separates] the user’s processing logic from details of the underlying engine.”
Separation of concern and ease of migration will be good to have if the ongoing rivalries, and competitions between the various big data processing engines continues. Granted, Apache Spark has emerged as one of the undisputed champs of the field and become a de facto standard choice. But there’s always room for improvement or an entirely new streaming or processing paradigm. Beam is less about offering a specific alternative than about providing developers and data-wranglers with more breadth of choice between them.