Skip to content

Airflow 101

Posted on:June 22, 2025 at 07:20 AM

A bit of foreground before jumping into airflow

If you are here you probably already know airflow is an orchestration tool

But what is orchestration?
Data Orchestration is the coordination and automation of data flow across various tools and systems to deliver quality data products and analytics

What modern data orchestration brings to the table?
  • Benefit of pipeline-as-code in python.
  • Ability to integrate with hundreds od external systems.
  • Time and event based scheduling and finally feature rich software with built-in observability and centralized UI.

Now lets dive into Apache airflow

Keywords
  • DAG: A directed acyclical graph that represents a single data pipeline
  • Task: An individual unit of work in a DAG
  • Operator: The specific work that a Task performs

Types of operators
  • Assume you are building a DAG and want to run a Python script - Action operator

  • Assume you are building a DAG and want to wait for a file to load before executing the next task. - Sensor operator

  • Assume you are building a DAG and want to move data from one system or data source to another - Transfer operator


Airflow architecture
  • API server - FastAPI server serving UI and handling task execution requests
  • Scheduler- Schedule tasks when dependencies are fulfilled
  • DAG File Processor- Dedicated process of parsing DAGs.
  • Metadata Database - A database where all metadata are stored
  • Executor - To determine how and where tasks are executed Doesn’t execute a task but define where the task is to be executed for eg. kubernetes
  • Queue - Defines the execution task order
  • Worker- Process executing the tasks, defined by the executor.It is in charge of executing the logic of a DAG’s tasks
  • Triggerer- Process running asyncio to support deferrable operator

How does airflow run a dag?

Core components of Airflow
  • The import statements, where the specific operators or classes that are needed for the DAG are defined.
  • The DAG definition, where the DAG object is called and where specific properties, such as its name or how often you want to run it, are defined.
  • The DAG body, where tasks are defined with the specific operators they will run.
  • The dependencies, where the order of execution of tasks is defined using the right bitshift operator (>>) and the left bitshift operator (<<).

A DAG’s task can be either upstream or downstream to another task.


Time to get hands dirty and install airflow through docker (my fav), with astro cli sitting on top. Assuming you have docker installed and setup

Run the following(Linux):

curl -sSL install.astronomer.io | sudo bash -s

And viola1! Airflow is installed

Verify using below command:

astro version

Make new folder in your desired location and initialize new airflow project

astro dev init

To run the airflow instance:

astro dev start

To stop the airflow instance:

astro dev stop

To start the airflow instance from scratch

astro dev kill

Astro creates the following directories upon initialization

Folder structure
  • The dags directory: Contains Python files corresponding to data pipelines, including examples such as example_dag_advanced and example_dag_basic.
  • The include directory: Used for storing files like SQL queries, bash scripts, or Python functions needed in data pipelines to keep them clean and organized.
  • The plugins directory: Allows for the customization of the Airflow instance by adding new operators or modifying the UI.
  • The tests directory: Contains files for running tests on data pipelines, utilizing tools like the pytest library.
  • The .dockerignore file: Describes files and folders to exclude from the Docker image during the build.
  • The .env file: Used for configuring Airflow instances via environment variables.
  • The .gitignore file: Specifies files and folders to ignore when pushing the project to a git repository, useful for excluding sensitive information like credentials.
  • The airflow_settings.yaml file: Stores configurations such as connections and variables to prevent loss when recreating the local development environment.
  • The Dockerfile: Used for building the Docker image to run Airflow, with specifications for the Astro runtime Docker image and the corresponding Airflow version.
  • The packages.txt file: Lists additional operating system packages to install in the Airflow environment.
  • The README file: Provides instructions and information about the project.

What is a DAG ? In Airflow, a directed acyclic graph (DAG) is a data pipeline defined in Python code. Each DAG represents a collection of tasks you want to run and is organized to show relationships between tasks in the Airflow UI. The mathematical properties of DAGs make them useful for building data pipelines:
  • Directed: If multiple tasks exist, each must have at least one defined upstream or downstream task.
  • Acyclic: Tasks cannot have a dependency on themselves. This avoids infinite loops.
  • Graph: All tasks can be visualized in a graph structure, with relationships between tasks defined by nodes and vertices.

Best practice for defining a DAG:

from airflow.sdk import dag
from pendulum import datetime

@dag(
        schedule="@daily",
        start_date=datetime(2025, 7, 1),
        description="A DAG that runs daily starting from July 1, 2025.",
        tags=['team_a', 'source_a'],
        max_consecutive_failed_dag_runs=3
)
def my_dag():
    pass

my_dag()