Integrating dbt with Airflow: A Comprehensive Guide
Written on
Chapter 1: The Importance of Data Integration
In today's rapidly evolving business environment, companies depend on data-driven decisions to thrive. To achieve this, a dependable data platform and trustworthy data are essential. This involves not only effective data collection and storage but also ensuring the information's accuracy and reliability.
Managing numerous data models can be quite challenging. Data teams often face hurdles when it comes to efficiently building, testing, and sustaining their data assets. At Plum, we utilize the data build tool (dbt), a command-line interface designed to help teams manage their data models proficiently.
Nonetheless, the command-line aspect of dbt poses challenges. This raises the question: how can teams leverage dbt in production environments effectively? While dbt Cloud might appear to be a straightforward solution, not all organizations can justify the expense. Consequently, an alternative approach is necessary to orchestrate dbt model executions while maintaining their interdependencies.
Why We Developed Our Own Integration
Beyond the financial implications of dbt Cloud, our team sought support for specific features that were not fully available at the time. Here are the essential requirements we identified for our internal use:
- dbt Project Scheduling: We need to schedule dbt projects similarly to cron jobs, specifying various time intervals.
- Task Granularity: Each dbt entity (models, tests, seeds, and snapshots) should be treated independently, allowing for individual triggers.
- Maintain dbt Dependencies: It's crucial to preserve the dependencies between dbt entities. For example, if Model A has tests that serve as an upstream dependency for Model B, the execution sequence must be respected.
- Example: [dbt run A] -> [dbt test A] -> [dbt run B]
- Triggering Other Workflows: Certain workflows should kick off after specific dbt entities complete, without waiting for the entire dbt project to finish.
- Example: [dbt run A] -> [execute a non-dbt related workflow] -> [dbt run B]
- Alerting: Notifications via Slack are necessary when issues arise.
- Running Containerized dbt Projects: Each dbt project should have its own Docker image to accommodate different dbt versions as needed.
- Triggering a Subset of Models: dbt entities can be filtered using tags, facilitating the execution of specific model groups.
- Creating Multiple Schedules: Running the same projects on different schedules, such as hourly, daily, or weekly, is essential.
After exploring various alternatives, including the Cosmos package from Astronomer, none met our specific needs. Thus, we decided to create a custom Python package that seamlessly integrates dbt with Airflow, enabling our data team to manage and streamline data models effectively.
Chapter 2: Choosing Airflow for Orchestration
In crafting our data pipeline orchestration strategy, we prioritized the capability to schedule and manage jobs at defined intervals—hourly, daily, or weekly. Given our team's familiarity with Airflow and the existing infrastructure on Google Cloud's Cloud Composer service, Airflow was a logical choice.
Both Airflow and dbt are built around Directed Acyclic Graphs (DAGs), allowing us to convert dbt DAGs into Airflow DAGs for orchestration. DAGs represent tasks or data models in a directional graph, ensuring that dependencies are respected.
From Concept to Execution: Using Airflow with dbt
Our initial task was straightforward: to run and test dbt projects using Airflow without the need to create separate tasks for each model, test, snapshot, or seed. We established a CI/CD pipeline with GitHub Actions to copy the relevant dbt project to the Cloud Composer bucket and create a DAG within Airflow.
This basic DAG comprised two tasks: one for running models and another for testing them. However, this method had significant limitations, including the need to rerun all models if one failed, leading to increased execution time and costs.
Video: An open source project that integrates dbt and Airflow - YouTube
Our initial attempt served as a Proof-of-Concept, confirming that we could develop a solution that met the team's expectations.
Chapter 3: Transforming dbt Projects into Airflow DAGs
To convert dbt DAGs into Airflow DAGs, we began by extracting dependencies from dbt entities. This involved parsing the relationships between data models and mapping them to Airflow tasks within DAGs. By representing dbt dependencies as Airflow tasks, we could orchestrate our data workflows seamlessly.
The first iteration included developing a parser to read the manifest.json files generated by dbt, which contain metadata about the project’s resources. This step was crucial for shaping our dbt-airflow integration.
Video: Airflow with DBT tutorial - The best way! - YouTube
In this iteration, we focused on ensuring that when a task fails, downstream dependencies pause until the underlying issues are resolved, thus avoiding the need to rerun all tasks.
Chapter 4: Enhancing Functionality and Flexibility
As we moved to subsequent iterations, we introduced features such as the ability to filter tags and create multiple schedules for executing models. Additionally, we containerized our dbt projects to run them on Kubernetes, allowing us to manage various versions of dbt easily.
The culmination of our work resulted in a project that met our initial requirements and proved stable and scalable.
Getting Started with dbt-airflow
If you're looking for efficient ways to deploy and schedule your dbt projects, dbt-airflow can simplify the process. The Plum data team has successfully used it in production for over a year, demonstrating its stability and scalability. This platform-agnostic package can work with any destination supported by dbt.
For more information, visit our GitHub repository or check out the official documentation. We invite you to contribute to the project if you identify areas for improvement.
Happy Coding,
Plum Data Engineering Team ❤