DataOps
Understanding Airflow for Data Engineering

Understanding Airflow for Data Engineering

airflow-logo

Airflow is an open-source platform to programmatically author, schedule, and monitor workflows. It was initially developed at Airbnb by Maxime Beauchemin in 2014, and later on, it became an Apache incubator project in 2016. Airflow is mainly used for data processing pipelines, ETL, and machine learning workflows. It is written in Python and is highly extensible.

In this blog post, we will discuss Airflow's fundamental concepts, architecture, and features. We will also cover how to install, configure and run Airflow. Finally, we will discuss some use cases for Airflow.

Airflow Architecture

Airflow has a distributed architecture, where each component plays a specific role. The main components of the Airflow architecture are:

  • Scheduler: The scheduler is responsible for scheduling tasks based on their dependencies and executing them on workers.
  • Webserver: The webserver provides the web-based user interface for Airflow. It's where the DAGs are defined, and tasks can be monitored and executed.
  • Database: Airflow requires a database to store its state and metadata. The database can be either SQLite, MySQL, or PostgreSQL.
  • Executor: The executor is responsible for executing tasks on workers. Airflow supports several executors like CeleryExecutor, LocalExecutor, and KubernetesExecutor.
  • Workers: The workers are processes that execute tasks. Each worker has its own isolated environment.

The figure below illustrates how these components work together:

airflow-architecture

Airflow Concepts

Before we dive into Airflow's features, let's first understand some of its core concepts:

  • DAGs: Directed Acyclic Graphs (DAGs) are a collection of tasks that are dependent on each other. DAGs define the flow and dependencies between tasks. In Airflow, each DAG is defined as a Python file.
  • Tasks: Tasks are the smallest unit of work in a DAG. Tasks are defined as Python functions or Bash commands.
  • Operators: Operators are classes that define how to execute a task. Airflow has many built-in operators for executing Python functions, Bash commands, SQL queries, and more.
  • Sensors: Sensors are also operators, but they are used to wait for specific conditions before executing a task. For example, a file sensor waits for a specific file to appear before continuing.
  • Hooks: Hooks are used to connect Airflow to external systems. For example, there are hooks for connecting to databases, cloud services, and message queues.

Airflow Features

Airflow has several features that make it a popular choice for data engineering pipelines:

  • Dynamic: Airflow's dynamic task generation allows for easy parallelization of tasks and faster pipeline execution times. It can automatically generate tasks based on dependencies and parameters.
  • Scalable: Airflow has a highly scalable architecture that enables it to run pipelines of any size. It can also run on distributed systems like Kubernetes and Mesos.
  • Extensible: Airflow is highly extensible, allowing users to develop custom operators, sensors, hooks, and executors to interface with any system.
  • Easy to Use: Airflow's web-based UI provides an easy-to-use interface for defining and monitoring workflows. It also has a command-line interface for more advanced users.
  • Portable: Airflow is written in Python, which makes it highly portable and easy to run on any platform.
  • Testable: Airflow has a testing framework that enables unit testing and end-to-end testing of pipelines.

Installing and Running Airflow

Now that we have a good understanding of Airflow's concepts and features, let's install and run Airflow.

Prerequisites

Before we can install and run Airflow, the following prerequisites must be met:

  • Python 3.6+ installed
  • pip installed
  • One of the supported databases (SQLite, MySQL, or PostgreSQL)

Install Airflow

To install Airflow, we can use pip:

pip install apache-airflow

If we want to install specific components or extras, we can use the following command:

pip install apache-airflow[component-name]

Initialize Airflow

After installing Airflow, we need to initialize the database by running the following command:

airflow db init

This command will create the necessary tables in the database.

Start Airflow

To start Airflow's webserver and scheduler, we can run the following command:

airflow webserver --port 8080
airflow scheduler

These commands will start the web-based user interface and the scheduler, respectively.

Creating a DAG

Now that Airflow is up and running, let's create a DAG. We can create a DAG by defining a Python file in the DAGs folder. Here's an example DAG:

from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
 
default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2021, 1, 1),
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}
 
with DAG('my_dag', default_args=default_args, schedule_interval=timedelta(days=1)) as dag:
    task_1 = BashOperator(task_id='task_1', bash_command='echo "Hello World"')
    task_2 = BashOperator(task_id='task_2', bash_command='echo "Goodbye World"')
    
    task_1 >> task_2

This DAG will run two Bash commands, "Hello World" and "Goodbye World," in sequence. We can load this DAG by placing it in the DAGs folder.

Running a DAG

To run a DAG, we can trigger it manually from the Airflow UI or wait for its scheduled time to run. Once it is triggered, we can monitor the DAG's progress from the Airflow UI.

Conclusion

In this blog post, we discussed Airflow's fundamental concepts, its architecture, and its core features. We also went through the steps to install, configure, and run Airflow. Finally, we looked at how to create a DAG and run it. With Airflow's dynamic task generation, scalability, and extensibility, it is one of the most popular open-source tools for data engineering pipelines.

Category: DataOps