CI/CD DAGs Pipeline Deployment On Kubernetes

Published in

NI Tech Blog

4 min readFeb 2, 2020

In the previous blog we talked about how to deploy Airflow DAGs, handle task processing at scale and how to handle DAGs at scale.

Today, I wish to share with you two additional important lessons we have learned along the way:

DAG templating and why you should generate DAGs
How to build a trustworthy Airflow production environment

Natural Intelligence is growing exponentially. We are already running 4 different Airflow environments running over 100 DAG’s that in turn process over 100 Terabytes of data. Likewise, we have 4 different teams that are needed for support. Each with a different set of requirements and needs. While complexity is growing — stability becomes top priority.

Why you should generate DAGs

Our typical DAG flow consists of orchestrating ETL tasks on EMR.

Our DAGs follow a standard structure and the variation comes from environment properties such as: the version of the Jar to run on EMR, EMR name, number of threads, etc.

Airflow provides the ability to manage variation through its variables. And at first it made sense to use this mechanism to configure our common DAGs externally.

But this approach comes with several limitations:

Context — all the variables are global and the changes impact all DAGs
Versioning — configuration updates through variables aren’t recorded or versioned

That’s why we decided to take a different approach, and instead of relying on variable stores to manage our flow specific configuration. We would set the variables at compile time and generate versioned DAGs based on config file.

There are several different ways to generate DAGs templates and configurations:

Home grown generator based on template and a config file.
Maven based generator with Mustache template and a config file.

These two approaches are quite different. Home grown generator can involve more logic and allow the creation of multiple complex DAG files. However, a home grown generator requires constant maintenance and knowledge preservation. That’s why we chose to generate our DAGs from a template language like Mustache and leverage the fact that it is generally used, well documented, and allow us to get results quickly.

Why do we use Mustache templates?

We chose Mustache because our java builds are automated by maven. Mustache is a popular template language in the java ecosystem and there’s a maven plug-in one can run to generate code during build.

We have a generic DAG template in a common repo:

# !! AUTO-GENERATED !!
# application_id: {{{ ni_application_id }}}
# version: {{{ version }}}import airflowfrom airflow
import DAGfrom airflow.contrib.operators.emr_add_steps_operator
import EmrAddStepsOperator

You can see the example here

Each code repository provides a config file that defines how DAG is generated from the template:

The lesson: Generated DAGs mechanism helps us to keep tracking version/s changes.

How to build a trustworthy production

As we build and verify our code it passes through three distinct maturity stages:

Testing
Staging
Production

Testing is the first quality gateway on the way to production. It is more or less well known how to do testing for software services but data pipelines are another story. There are no existing best practices for Airflow DAGs testing. That means we needed to experiment and to find something that we think will work for us.

All CI\CD in Natural Intelligence are automated by Jenkins, therefore we are also using it for Airflow.

To illustrate a typical Airflow CI\CD workflow, consider the following scenario:

Checkout: Clone the project and download dependencies such as dumped tables from our production DB.

Unit Tests: Test individual DAGs tasks, variables, connections and custom operator.

Test Env Setup: Launch Airflow helm chart including a DB which will be used for the integration tests and load the dumped tables.

Integration Tests: Test DAGs on a clone production DB in Airflow. This approach verifies the DAG in its early stages before running in production.

Artifacts: Pack, Version and Store DAGs in s3.

Here is how we define our pipeline:

This approach aims to provide a clear way when writing a pipeline. This relies on the adoption of automation to streamline manual processes and enforce consistency and repeatability in the software delivery pipeline.

The lesson: CI/CD for DAGs is the core base of having a stable Airflow.

To summarise

As we have covered in this post, there are two key factors we have encountered in a journey to scale your DAGs development:

It is important to create a versioning mechanism to properly manage your generated DAG’s.
DAGs are software and as such, they require the same software development life cycle including testing, artifact management and deployment pipelines.

CI/CD DAGs Pipeline Deployment On Kubernetes

Why you should generate DAGs

How to build a trustworthy production

To summarise

Written by Lidor Ettinger