Building a Basic Apache Beam Pipeline in 4 Steps with Java- 5 mins
Getting started with building data pipelines using Apache Beam.
- Step 1: Define Pipeline Options
- Step 2: Create the Pipeline
- Step 3: Apply Transformations
- Step 4: Run it!
In this post, I would like to show you how you can get started with Apache Beam and build the first, simple data pipeline in 4 steps.
Step 1: Define Pipeline Options
Let’s start with creating a helper object to configure our pipelines. This is not an absolute necessity, however defining the pipeline options might save you some time later, especially if your pipeline is dependent on a few arguments, that might have pre-defined, default values that you don’t want to provide at every run.
Step 2: Create the Pipeline
Now that you have created the pipeline options object, you will need to create the pipeline object itself and provide the options to it:
(Check out the documentation for the PipelineOptionsFactory class for the description of the methods used above.)
Step 3: Apply Transformations
After defining the pipeline and providing the options class, we can start by applying the transformations using
.apply(...). Those can be chained after each other by applying yet another
.apply(...), for instance:
Note that a
PCollection<T> is an immutable collection of values of type
T and that you can provide names for the transformations as the first string argument in the
apply(), like in the first two and the last
Here we can also specify custom transformations that can be done in parallel. In Beam, they are being referred to as
ParDo methods. They are similar to the
Reducer class of a MapReduce-style algorithm. In this post, we will not be focusing on the contents of such pipeline (i.e. what it is doing), but a simple example of a
ParDo can be looking like the second
apply in the code above (look for the link in the conclusion for the entire running example).
Step 4: Run it!
After defining the pipeline, its options, and how they are connected, we can finally run the pipeline. The great thing about running the pipelines in Apache Beam is that it is very easy to switch between various runners. Beam provides a portable API layer for building sophisticated pipelines that may be executed across various execution engines or runners. In our example, we can switch from running the pipeline locally (with
direct-runner), to running the same pipeline in the Cloud as a managed service (with
dataflow-runner) by simply adjusting the values we provide when running the code.
Here is an example of running the pipeline with
And here is the example of running the same pipeline in the Cloud as a managed service, using Google Cloud Dataflow. Note that most of the parameters provided are still the same, with a few additional parameters needed for this specific runner.
In case you would like to be using various runners or interested in switching between them, it might be a good idea to check the capability matrix in the documentation, as the core concepts of Beam Model can sometimes be implemented to varying degrees in each of the Beam runners.
We have now seen the basic steps needed to create a simple data-parallel processing pipeline and how that can be run and deployed both in the local and managed Cloud environments. We are were also able to run the same pipeline with just a few adjustments to the command line parameters and, in our case, without any changes to the pipeline code.