<Title: Containerizing the Jaffle Shop dbt Project with Docker>
Written on
Running a Jaffle Shop dbt Project in Docker
This guide is tailored for those new to the data build tool (dbt) who may have encountered the Jaffle Shop project, a widely utilized example for testing scenarios.
The Jaffle Shop serves as a fictional e-commerce platform. This dbt initiative processes raw data from an application database into a structured model for customers and orders, suitable for analysis.
A Note on Jaffle Shop's Configuration Needs
One significant challenge with the Jaffle Shop project is its expectation that users, particularly those unfamiliar with dbt, will set up and manage a local database for the dbt models to materialize.
In this guide, I will walk you through how to establish a containerized version of the project using Docker. This approach will enable us to deploy a PostgreSQL instance and configure the dbt project to interact with that database. I will also provide a link to a GitHub repository that simplifies getting all necessary services operational quickly.
Creating the Dockerfile and docker-compose.yml
Let's start by outlining the services to run in Docker. First, we'll create a docker-compose.yml file where we will define two services: the PostgreSQL database and a custom service that we will define next with a Dockerfile.
# docker-compose.yml
version: "3.9"
services:
postgres:
container_name: postgres
image: postgres:15.2-alpine
environment:
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=postgres
ports:
- 5432
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 5s
timeout: 5s
retries: 5
dbt:
container_name: dbt
build: .
image: dbt-jaffle-shop
volumes:
- ./:/usr/src/dbt
depends_on:
postgres:
condition: service_healthy
This file specifies that we are using Docker Compose version 3.9. It outlines two services: postgres and dbt, each with specific configurations.
The postgres service leverages the official PostgreSQL Docker image version 15.2-alpine. It designates the container name as postgres, maps port 5432 (the default for PostgreSQL) to the host, and establishes environment variables for the PostgreSQL user and password. The healthcheck section includes a command to verify the container's health, with specified timeouts and retries.
The dbt service is based on a Docker image created from the current directory (using a Dockerfile). It mounts the current directory as a volume within the container and indicates that it depends on the postgres service, which must be healthy before dbt starts.
To containerize the Jaffle Shop project, we need to create a Dockerfile that installs the necessary dependencies for both Python and dbt and ensures that the container remains active once the environment is established.
# Dockerfile
FROM --platform=linux/amd64 python:3.10-slim-buster
RUN apt-get update
&& apt-get install -y --no-install-recommendsWORKDIR /usr/src/dbt
# Install the dbt Postgres adapter and dbt-core
RUN pip install --upgrade pip
RUN pip install dbt-postgres==1.2.0
RUN pip install pytz
# Install dbt dependencies and build models
CMD dbt deps && dbt build --profiles-dir ./profiles && sleep infinity
Configuring Postgres with dbt
To work with dbt, we’ll utilize the dbt Command Line Interface (CLI). A directory containing a dbt_project.yml file is recognized as a dbt project by the CLI.
We will set up this file to include basic configurations, such as the project name and the profile we will create in the next step. Additionally, we will define the paths for various dbt entities and provide information on their materialization.
# dbt_project.yml
name: 'jaffle_shop'
config-version: 2
version: '0.1'
profile: 'jaffle_shop'
model-paths: ["models"]
seed-paths: ["seeds"]
test-paths: ["tests"]
analysis-paths: ["analysis"]
macro-paths: ["macros"]
target-path: "target"
clean-targets:
- "target"
- "dbt_modules"
- "logs"
require-dbt-version: [">=1.0.0", "<2.0.0"]
models:
jaffle_shop:
materialized: table
staging:
materialized: view
Next, the profiles.yml file will house the dbt profiles. A profile contains targets that outline the connection details and credentials for the database or data warehouse.
# profiles.yml
jaffle_shop:
target: dev
outputs:
dev:
type: postgres
host: postgres
user: postgres
password: postgres
port: 5432
dbname: postgres
schema: public
threads: 1
This file defines a profile named jaffle_shop that details connection attributes for a PostgreSQL database operating within a Docker container called postgres.
Profile Structure: - jaffle_shop: The profile name, chosen by the user. - target: dev: The default target for the profile, named dev. - outputs: Lists the output configurations for the profile, with the default output being dev. - dev: Specifies connection details for the dev target, which utilizes a PostgreSQL database. - type: postgres: Indicates the output type, which is a PostgreSQL database. - host: postgres: Specifies the address of the PostgreSQL database server. - user: postgres: The username for connecting to the PostgreSQL database. - password: postgres: The password for authenticating to the PostgreSQL database. - port: 5432: The port number for PostgreSQL connections. - dbname: postgres: The name of the PostgreSQL database to connect to. - schema: public: The schema name for executing queries against the database. - threads: 1: The number of threads for executing dbt tasks.
Jaffle Shop dbt Models and Seeds
The source data for the Jaffle Shop project comprises CSV files for customers, payments, and orders. In dbt, we can import this data into our database through seeds, which we then utilize to create dbt models.
Here’s an example of a model that computes metrics for our customers:
with customers as (
select * from {{ ref('stg_customers') }}),
orders as (
select * from {{ ref('stg_orders') }}),
payments as (
select * from {{ ref('stg_payments') }}),
customer_orders as (
select
customer_id,
min(order_date) as first_order,
max(order_date) as most_recent_order,
count(order_id) as number_of_orders
from orders
group by customer_id
),
customer_payments as (
select
orders.customer_id,
sum(amount) as total_amount
from payments
left join orders on payments.order_id = orders.order_id
group by orders.customer_id
),
final as (
select
customers.customer_id,
customers.first_name,
customers.last_name,
customer_orders.first_order,
customer_orders.most_recent_order,
customer_orders.number_of_orders,
customer_payments.total_amount as customer_lifetime_value
from customers
left join customer_orders on customers.customer_id = customer_orders.customer_id
left join customer_payments on customers.customer_id = customer_payments.customer_id
)
select * from final
Running the Services via Docker
Now we will build and launch our Docker services. To do this, simply execute the following commands:
$ docker-compose build
$ docker-compose up
The above commands will initiate a PostgreSQL instance and subsequently build the dbt resources for the Jaffle Shop as specified in the repository. These containers will remain active, allowing you to:
- Query the PostgreSQL database and the tables created from dbt models
- Execute further dbt commands via the dbt CLI
Executing dbt Commands via CLI
The dbt container has already built the specified models. However, you can still access the container to execute dbt commands for new or modified models. Begin by accessing the container.
Use the command below to list all active containers:
$ docker ps
Copy the ID of the dbt container and use it in the following command:
$ docker exec -it <container-id> /bin/bash
This command provides access to the container's bash shell, allowing you to run dbt commands.
# Install dbt dependencies (not required if dbt_packages.yml is empty)
dbt deps
# Build seeds
dbt seeds --profiles-dir profiles
# Build data models
dbt run --profiles-dir profiles
# Build snapshots
dbt snapshot --profiles-dir profiles
# Run tests
dbt test --profiles-dir profiles
Keep in mind that since we’ve mounted the local directory to the running container, any changes made locally will be reflected within the container immediately. This means you can create new models or edit existing ones, then access the running container to build models, run tests, and so on.
Querying the dbt Models on the PostgreSQL Database
You can also query the PostgreSQL database and the dbt models or snapshots created within it. To do this, access the running PostgreSQL container.
# Get the container ID for the postgres service
$ docker ps
# Then, copy the container ID and enter the running container
$ docker exec -it <container-id> /bin/bash
Next, use psql, a terminal interface for PostgreSQL, to query the database:
$ psql -U postgres
The following commands can be used to list tables and views respectively:
postgres=# dt
List of relationsSchema | Name | Type | Owner
--------+---------------+-------+----------
public | customers | table | postgres
public | orders | table | postgres
public | raw_customers | table | postgres
public | raw_orders | table | postgres
public | raw_payments | table | postgres
(5 rows)
postgres=# dv
List of relationsSchema | Name | Type | Owner
--------+---------------+------+----------
public | stg_customers | view | postgres
public | stg_orders | view | postgres
public | stg_payments | view | postgres
(3 rows)
You can also query dbt models with a SELECT statement:
SELECT * FROM <table_or_view_name>;
Getting the Full Code
I’ve created a GitHub repository that you can clone to your local machine to swiftly run the containerized version of the Jaffle Shop dbt project. You can find the project and the code from this tutorial through the following link.
GitHub - gmyrianthous/jaffle_shop: A containerized version of the Jaffle Shop dbt project
This repository contains a containerized version of the popular Jaffle Shop dbt project published by dbt Labs. You can utilize this project…
Final Thoughts
The data build tool (dbt) is increasingly becoming an essential technology in contemporary data stacks. If you're just starting to learn dbt, I strongly encourage you to experiment with the Jaffle Shop project. It's a self-contained initiative developed by dbt Labs for testing and experimentation.
dbt is a tool frequently employed by data analysts and analytics engineers (alongside data engineers), necessitating a connection to a database or data warehouse. However, many analysts might feel uneasy about configuring and initializing a local database.
In this guide, we showcased how to get started with dbt and operate all the necessary services for materializing dbt models on a local PostgreSQL database. I hope this tutorial assists you in getting your dbt project and database operational as smoothly as possible. Should you encounter any challenges while running the project, feel free to reach out in the comments, and I’ll do my utmost to help you troubleshoot your code and configuration.