<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://mehmandarov.com/tag/pipelines/feed.xml" rel="self" type="application/atom+xml"/><link href="https://mehmandarov.com/tag/pipelines/" rel="alternate" type="text/html"/><updated>2020-02-21T07:35:00+01:00</updated><id>https://mehmandarov.com/tag/pipelines/feed.xml</id><title type="html">Rustam Mehmandarov - tag: pipelines</title><subtitle type="text">Posts tagged &quot;pipelines&quot; on Rustam Mehmandarov.</subtitle><author><name>Rustam Mehmandarov</name></author><entry><title type="html">Building a Basic Apache Beam Pipeline in 4 Steps with Java</title><link href="https://mehmandarov.com/beam-pipeline-in-four-steps/" rel="alternate" type="text/html" title="Building a Basic Apache Beam Pipeline in 4 Steps with Java"/><published>2020-02-21T07:35:00+01:00</published><updated>2020-02-21T07:35:00+01:00</updated><id>https://mehmandarov.com/beam-pipeline-in-four-steps</id><content type="html" xml:base="https://mehmandarov.com/beam-pipeline-in-four-steps/"><![CDATA[<p><em>Getting started with building data pipelines using Apache Beam.</em></p>

<ul>
  <li><a href="#step-1-define-pipeline-options">Step 1: Define Pipeline Options</a></li>
  <li><a href="#step-2-create-the-pipeline">Step 2: Create the Pipeline</a></li>
  <li><a href="#step-3-apply-transformations">Step 3: Apply Transformations</a></li>
  <li><a href="#step-4-run-it">Step 4: Run it!</a></li>
  <li><a href="#conclusion">Conclusion</a></li>
</ul>

<hr />

<p>In this post, I would like to show you how you can get started with Apache Beam and build the first, simple data pipeline in 4 steps.</p>

<h2 id="step-1-define-pipeline-options">Step 1: Define Pipeline Options</h2>

<p>Let&#8217;s start with creating a helper object to configure our pipelines. This is not an absolute necessity, however defining the pipeline options might save you some time later, especially if your pipeline is dependent on a few arguments, that might have pre-defined, default values that you don&#8217;t want to provide at every run.</p>

<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="kd">public</span> <span class="kd">interface</span> <span class="nc">OsloCityBikeOptions</span> <span class="kd">extends</span> <span class="nc">PipelineOptions</span> <span class="o">{</span>

    <span class="cm">/**
     * By default, the code reads from a public dataset containing a subset of
     * bike station metadata for city bikes. Set this option to choose a different input file or glob
     * (i.e. partial names with *, like "*-stations.txt").
     */</span>
    <span class="nd">@Description</span><span class="o">(</span><span class="s">"Path of the file with the availability data"</span><span class="o">)</span>
    <span class="nd">@Default</span><span class="o">.</span><span class="na">String</span><span class="o">(</span><span class="s">"src/main/resources/bikedata-stations-example.txt"</span><span class="o">)</span>
    <span class="nc">String</span> <span class="nf">getStationMetadataInputFile</span><span class="o">();</span>
    <span class="kt">void</span> <span class="nf">setStationMetadataInputFile</span><span class="o">(</span><span class="nc">String</span> <span class="n">value</span><span class="o">);</span>

    <span class="c1">// some other options here...</span>
<span class="o">}</span></code></pre></figure>

<h2 id="step-2-create-the-pipeline">Step 2: Create the Pipeline</h2>

<p>Now that you have created the pipeline options object, you will need to create the pipeline object itself and provide the options to it:</p>

<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="nc">OsloCityBikeOptions</span> <span class="n">options</span> <span class="o">=</span> 
        <span class="nc">PipelineOptionsFactory</span><span class="o">.</span><span class="na">fromArgs</span><span class="o">(</span><span class="n">args</span><span class="o">)</span>
                                <span class="o">.</span><span class="na">withValidation</span><span class="o">()</span>
                                <span class="o">.</span><span class="na">as</span><span class="o">(</span><span class="nc">OsloCityBikeOptions</span><span class="o">.</span><span class="na">class</span><span class="o">);</span>

<span class="nc">Pipeline</span> <span class="n">pipeline</span> <span class="o">=</span> <span class="nc">Pipeline</span><span class="o">.</span><span class="na">create</span><span class="o">(</span><span class="n">options</span><span class="o">);</span></code></pre></figure>

<p>(<em>Check out the documentation for the <a href="https://beam.apache.org/releases/javadoc/2.19.0/org/apache/beam/sdk/options/PipelineOptionsFactory.html">PipelineOptionsFactory</a> class for the description of the methods used above.</em>)</p>

<h2 id="step-3-apply-transformations">Step 3: Apply Transformations</h2>

<p>After defining the pipeline and providing the options class, we can start by applying the transformations using <code class="language-plaintext highlighter-rouge">.apply(...)</code>. Those can be chained after each other by applying yet another <code class="language-plaintext highlighter-rouge">.apply(...)</code>, for instance:</p>

<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="nc">PCollection</span> <span class="o">&lt;</span><span class="no">KV</span><span class="o">&lt;</span><span class="nc">Integer</span><span class="o">,</span> <span class="nc">LinkedHashMap</span><span class="o">&gt;&gt;</span> <span class="n">stationMetadata</span> <span class="o">=</span> <span class="n">pipeline</span>
                <span class="o">.</span><span class="na">apply</span><span class="o">(</span><span class="s">"ReadLines: StationMetadataInputFiles"</span><span class="o">,</span> <span class="nc">TextIO</span><span class="o">.</span><span class="na">read</span><span class="o">().</span><span class="na">from</span><span class="o">(</span><span class="n">options</span><span class="o">.</span><span class="na">getStationMetadataInputFile</span><span class="o">()))</span>
                <span class="o">.</span><span class="na">apply</span><span class="o">(</span><span class="s">"Station Metadata"</span><span class="o">,</span> <span class="nc">ParDo</span><span class="o">.</span><span class="na">of</span><span class="o">(</span><span class="n">fnExtractStationMetaDataFromJSON</span><span class="o">()));</span>
                <span class="o">.</span><span class="na">apply</span><span class="o">(</span><span class="nc">MapElements</span><span class="o">.</span><span class="na">into</span><span class="o">(</span><span class="nc">TypeDescriptor</span><span class="o">.</span><span class="na">of</span><span class="o">(</span><span class="nc">String</span><span class="o">.</span><span class="na">class</span><span class="o">)).</span><span class="na">via</span><span class="o">(</span><span class="n">o</span> <span class="o">-&gt;</span> <span class="n">o</span><span class="o">.</span><span class="na">toString</span><span class="o">()))</span>
                <span class="o">.</span><span class="na">apply</span><span class="o">(</span><span class="s">"WriteStationMetaData"</span><span class="o">,</span> <span class="nc">TextIO</span><span class="o">.</span><span class="na">write</span><span class="o">().</span><span class="na">to</span><span class="o">(</span><span class="n">options</span><span class="o">.</span><span class="na">getMetadataOutput</span><span class="o">()));</span></code></pre></figure>

<p>Note that a <a href="https://beam.apache.org/releases/javadoc/2.19.0/org/apache/beam/sdk/values/PCollection.html"><code class="language-plaintext highlighter-rouge">PCollection&lt;T&gt;</code></a> is an immutable collection of values of type <code class="language-plaintext highlighter-rouge">T</code> and that you can provide names for the transformations as the first string argument in the <code class="language-plaintext highlighter-rouge">apply()</code>, like in the first two and the last <code class="language-plaintext highlighter-rouge">apply</code> methods.</p>

<p>Here we can also specify custom transformations that can be done in parallel. In Beam, they are being referred to as <a href="https://beam.apache.org/releases/javadoc/2.19.0/org/apache/beam/sdk/transforms/ParDo.html"><code class="language-plaintext highlighter-rouge">ParDo</code></a> methods. They are similar to the <code class="language-plaintext highlighter-rouge">Mapper</code> or <code class="language-plaintext highlighter-rouge">Reducer</code> class of a MapReduce-style algorithm. In this post, we will not be focusing on the contents of such pipeline (i.e. what it is doing), but a simple example of a <code class="language-plaintext highlighter-rouge">ParDo</code> can be looking like the second <code class="language-plaintext highlighter-rouge">apply</code> in the code above (look for the link in the <a href="#conclusion">conclusion</a> for the entire running example).</p>

<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="n">pipeline</span><span class="o">.</span><span class="na">apply</span><span class="o">(</span><span class="s">"Station Metadata"</span><span class="o">,</span> <span class="nc">ParDo</span><span class="o">.</span><span class="na">of</span><span class="o">(</span><span class="n">fnExtractStationMetaDataFromJSON</span><span class="o">()));</span></code></pre></figure>

<h2 id="step-4-run-it">Step 4: Run it!</h2>

<p>After defining the pipeline, its options, and how they are connected, we can finally run the pipeline. The great thing about running the pipelines in Apache Beam is that it is very easy to switch between various runners. Beam provides a portable API layer for building sophisticated pipelines that may be executed across various execution engines or <em>runners</em>. In our example, we can switch from running the pipeline locally (with <a href="https://beam.apache.org/documentation/runners/direct/"><code class="language-plaintext highlighter-rouge">direct-runner</code></a>), to running the same pipeline in the Cloud as a managed service (with <a href="https://beam.apache.org/documentation/runners/dataflow/"><code class="language-plaintext highlighter-rouge">dataflow-runner</code></a>) by simply adjusting the values we provide when running the code.</p>

<h3 id="local-runner">Local runner</h3>

<p>Here is an example of running the pipeline with <code class="language-plaintext highlighter-rouge">direct-runner</code>:</p>

<figure class="highlight"><pre><code class="language-bash" data-lang="bash">mvn compile <span class="nb">exec</span>:java <span class="se">\</span>
      <span class="nt">-Pdirect-runner</span> <span class="se">\</span>
      <span class="nt">-Dexec</span>.mainClass<span class="o">=</span>com.mehmandarov.beam.OsloCityBike <span class="se">\</span>
      <span class="nt">-Dexec</span>.args<span class="o">=</span><span class="s2">"--inputFile=src/data-example.txt </span><span class="se">\</span><span class="s2">
      --output=bikedatalocal"</span></code></pre></figure>

<h3 id="dataflow-runner">Dataflow runner</h3>

<p>And here is the example of running the same pipeline in the Cloud as a managed service, using Google Cloud Dataflow. Note that most of the parameters provided are still the same, with a few additional parameters needed for this specific runner.</p>

<figure class="highlight"><pre><code class="language-bash" data-lang="bash">mvn compile <span class="nb">exec</span>:java <span class="se">\</span>
      <span class="nt">-Pdataflow-runner</span> <span class="se">\</span>
      <span class="nt">-Dexec</span>.mainClass<span class="o">=</span>com.mehmandarov.beam.OsloCityBike <span class="se">\</span>
      <span class="nt">-Dexec</span>.args<span class="o">=</span><span class="s2">"--project=rm-cx-211107 </span><span class="se">\</span><span class="s2">
      --inputFile=gs://my_oslo_bike_data/data-2018-*.txt </span><span class="se">\</span><span class="s2">
      --stagingLocation=gs://my_oslo_bike_data/testing </span><span class="se">\</span><span class="s2">
      --output=gs://my_oslo_bike_data/testing/output </span><span class="se">\</span><span class="s2">
      --tempLocation=gs://my_oslo_bike_data/testing/ </span><span class="se">\</span><span class="s2">
      --runner=DataflowRunner </span><span class="se">\</span><span class="s2">
      --region=europe-west1"</span></code></pre></figure>

<h3 id="other-runners">Other runners</h3>
<p>In case you would like to be using various runners or interested in switching between them, it might be a good idea to check the <a href="https://beam.apache.org/documentation/runners/capability-matrix/">capability matrix</a> in the documentation, as the core concepts of Beam Model can sometimes be implemented to varying degrees in each of the Beam runners.</p>

<h2 id="conclusion">Conclusion</h2>
<p>We have now seen the basic steps needed to create a simple data-parallel processing pipeline and how that can be run and deployed both in the local and managed Cloud environments. We are were also able to run the same pipeline with just a few adjustments to the command line parameters and, in our case, without any changes to the pipeline code.</p>

<p>The entire working example that we have been using here can be found in <a href="https://github.com/mehmandarov/oslocitybike-basic-beam">my GitHub repository</a>, as well as a more advanced example in <a href="https://github.com/mehmandarov/oslocitybike-beam">another repository</a>.</p>]]></content><author><name>Rustam Mehmandarov</name></author><summary type="html">Getting started with building data pipelines using Apache Beam Java SDK</summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://mehmandarov.com/assets/images/posts-images/pipes.jpg"/><category term="blog"/><category term="java"/><category term="apache beam"/><category term="data"/><category term="pipelines"/><category term="english"/></entry><entry><title type="html">Getting a Graph Representation of a Pipeline in Apache Beam</title><link href="https://mehmandarov.com/apache-beam-pipeline-graph/" rel="alternate" type="text/html" title="Getting a Graph Representation of a Pipeline in Apache Beam"/><published>2019-11-27T08:15:00+01:00</published><updated>2019-11-27T08:15:00+01:00</updated><id>https://mehmandarov.com/apache-beam-pipeline-graph</id><content type="html" xml:base="https://mehmandarov.com/apache-beam-pipeline-graph/"><![CDATA[<p><em>Getting a pipeline representation in Apache Beam explained step-by-step.</em></p>

<ul>
  <li><a href="#intro">Intro</a></li>
  <li><a href="#tldr-getting-graph-representation">TL;DR: Getting Graph Representation</a></li>
  <li><a href="#a-full-example">A Full Example</a></li>
  <li><a href="#what-now">What Now?</a></li>
</ul>

<hr />

<h2 id="intro">Intro</h2>
<p>Constructing advanced pipelines, or trying to wrap your head around the existing pipelines, in <a href="https://beam.apache.org/">Apache Beam</a> can sometimes be challenging. We have seen some nice visual representations of the pipelines in the managed Cloud versions of this software, but figuring out how to get a graph representation of the pipeline required a little bit of research. Here is how it is done in a few steps using Beam&#8217;s Java SDK.</p>

<h2 id="tldr-getting-graph-representation">TL;DR: Getting Graph Representation</h2>

<p>If you just want to see a few lines that let you generate the <a href="https://en.wikipedia.org/wiki/DOT_(graph_description_language)">DOT</a> representation of the graph, here it is:</p>

<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="kn">import</span> <span class="nn">org.apache.beam.runners.core.construction.renderer.PipelineDotRenderer</span><span class="o">;</span>

<span class="nc">Pipeline</span> <span class="n">p</span> <span class="o">=</span> <span class="nc">Pipeline</span><span class="o">.</span><span class="na">create</span><span class="o">(</span><span class="n">options</span><span class="o">);</span>
<span class="c1">// do stuff with your pipeline</span>
<span class="nc">String</span> <span class="n">dotString</span> <span class="o">=</span> <span class="nc">PipelineDotRenderer</span><span class="o">.</span><span class="na">toDotString</span><span class="o">(</span><span class="n">p</span><span class="o">);</span></code></pre></figure>

<p>Now, if you want a slightly more comprehensive example, keep on reading.</p>

<h2 id="a-full-example">A Full Example</h2>
<p>Here we will be using <a href="https://beam.apache.org/get-started/quickstart-java/#get-the-wordcount-code">word count example</a>, particularly the <a href="https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/MinimalWordCount.java"><code class="language-plaintext highlighter-rouge">MinimalWordCount</code></a> class.</p>

<h4 id="adding-maven-dependency">Adding Maven Dependency</h4>
<p>First, we need to add a dependency to the Maven file under <code class="language-plaintext highlighter-rouge">&lt;dependencies&gt;</code> section:</p>

<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><span class="nt">&lt;dependencies&gt;</span>
    <span class="c">&lt;!-- ... all the other dependencies you may have --&gt;</span>
    <span class="nt">&lt;dependency&gt;</span>
        <span class="nt">&lt;groupId&gt;</span>org.apache.beam<span class="nt">&lt;/groupId&gt;</span>
        <span class="nt">&lt;artifactId&gt;</span>beam-runners-core-construction-java<span class="nt">&lt;/artifactId&gt;</span>
        <span class="nt">&lt;version&gt;</span>${beam.version}<span class="nt">&lt;/version&gt;</span>
    <span class="nt">&lt;/dependency&gt;</span>
<span class="nt">&lt;/dependencies&gt;</span></code></pre></figure>

<h4 id="the-code">The Code</h4>
<p>Now, we will need to add a few imports (assuming you already added the Maven dependency mentioned earlier):</p>

<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="kn">import</span> <span class="nn">org.apache.beam.runners.core.construction.renderer.PipelineDotRenderer</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.slf4j.Logger</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.slf4j.LoggerFactory</span><span class="o">;</span></code></pre></figure>

<p>To get the <a href="https://en.wikipedia.org/wiki/DOT_(graph_description_language)">DOT</a> representation of the pipeline graph we will be passing the pipeline object to the <code class="language-plaintext highlighter-rouge">PipelineDotRenderer</code> class, and in this example, we are only logging the output to the console (hence the log4j imports).</p>

<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="c1">// Create the Pipeline object with the options we defined above</span>
<span class="nc">Pipeline</span> <span class="n">p</span> <span class="o">=</span> <span class="nc">Pipeline</span><span class="o">.</span><span class="na">create</span><span class="o">(</span><span class="n">options</span><span class="o">);</span>

<span class="c1">// ... do stuff with your pipeline ...</span>

<span class="c1">// Add this piece of code just before running the pipeline:</span>
<span class="kd">final</span> <span class="nc">Logger</span> <span class="n">log</span> <span class="o">=</span> <span class="nc">LoggerFactory</span><span class="o">.</span><span class="na">getLogger</span><span class="o">(</span><span class="nc">MinimalWordCount</span><span class="o">.</span><span class="na">class</span><span class="o">);</span>
<span class="nc">String</span> <span class="n">dotString</span> <span class="o">=</span> <span class="nc">PipelineDotRenderer</span><span class="o">.</span><span class="na">toDotString</span><span class="o">(</span><span class="n">p</span><span class="o">);</span>
<span class="n">log</span><span class="o">.</span><span class="na">info</span><span class="o">(</span><span class="s">"MY GRAPH REPR: "</span> <span class="o">+</span> <span class="n">dotString</span><span class="o">);</span>

<span class="n">p</span><span class="o">.</span><span class="na">run</span><span class="o">().</span><span class="na">waitUntilFinish</span><span class="o">();</span></code></pre></figure>

<p>That&#8217;s it. To see the code in action, run it from the command line:</p>

<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>mvn compile <span class="nb">exec</span>:java <span class="se">\</span>
        <span class="nt">-Dexec</span>.mainClass<span class="o">=</span>org.apache.beam.examples.MinimalWordCount <span class="se">\</span>
        <span class="nt">-Pdirect-runner</span></code></pre></figure>

<p>This code will produce a DOT representation of the pipeline and log it to the console.</p>

<h4 id="a-complete-example">A Complete Example</h4>

<p>A fully working example can be found in <a href="https://github.com/mehmandarov/word-count-mini-beam">my repository</a>, based on <a href="https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/MinimalWordCount.java"><code class="language-plaintext highlighter-rouge">MinimalWordCount</code></a>
code. There, in addition to logging to the console, we will be storing the DOT representation to a file.</p>

<p>In the next section, we will have a brief look at what can be done with the DOT representations.</p>

<h2 id="what-now">What Now?</h2>
<p>Now that we have a DOT representation of the pipeline graph, we can use it to get a better understanding of the pipeline. For instance, you can generate an SVG or a PNG image from the data. Note that the generated graph might be a bit verbose, but gives a good overview of the pipeline graph.</p>

<p>Here, I have also included examples of the <a href="https://github.com/mehmandarov/word-count-mini-beam/blob/master/pipeline_graph.dot">DOT graph</a> and the <a href="https://github.com/mehmandarov/word-count-mini-beam/blob/master/pipeline_graph.png">PNG file</a> generated for that particular pipeline.</p>

<p>Assuming that you have Graphviz <a href="https://www.graphviz.org/download/">tools</a> installed, you can convert a DOT file to a PNG image using this command:</p>

<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>dot <span class="nt">-Tpng</span> <span class="nt">-o</span> pipeline_graph.png pipeline_graph.dot</code></pre></figure>

<p>In addition to <a href="https://www.graphviz.org/">Grapgviz</a> (Wikipedia <a href="https://en.wikipedia.org/wiki/Graphviz">link</a>), there are also online services for converting DOT graphs to graphical representations, like <a href="https://dreampuf.github.io/GraphvizOnline">this</a> one.</p>

<p><img src="https://raw.githubusercontent.com/mehmandarov/word-count-mini-beam/master/pipeline_graph_partial.png" alt="Training your own model" class="bigger-image" /></p>
<figcaption class="caption">A part of a graphical representation for the pipeline in the MinimalWordCount example. </figcaption>

<hr />]]></content><author><name>Rustam Mehmandarov</name></author><summary type="html">How to get a graph representation of your data pipeline in Apache Beam, step by step.</summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://mehmandarov.com/assets/images/posts-images/golden-gate.jpg"/><category term="blog"/><category term="java"/><category term="apache beam"/><category term="data"/><category term="pipelines"/><category term="english"/></entry></feed>
