Intriguing SQL Riddles to Sharpen Your Skills
Written on
Chapter 1: Introduction to SQL Riddles
SQL, while appearing straightforward, can often lead to unexpected outcomes. Its various dialects allow users to interact with databases using syntax that resembles English. However, things are not always as they seem. Occasionally, I stumble upon queries that yield surprising results, revealing subtle aspects of the language. In this article, I've gathered three recent puzzles, presenting them as riddles for added intrigue. Challenge yourself to find the solution before checking the end of each section!
Additionally, I've included brief common table expressions (CTEs) to create the necessary tables for each example, so you won’t have to query live production tables. For a deeper understanding of SQL, I highly recommend setting up your own database and tables for experimentation. You can read more about that in this post.
Keep in mind that all examples are in Postgres, and results may differ in other SQL dialects. The data and topics in these queries are merely illustrative.
Section 1.1: Riddle 1 - Timestamp Precision
Consider a table named purchases that holds purchase IDs, amounts, and timestamps of transactions. Here’s a glimpse of how the data might appear:
In CTE format, this would look like:
WITH purchases(id, amount, dt) AS (
VALUES
(1::bigint, 0.99::float, '2023-02-15 00:00:00 GMT'::timestamp),
(2, 9.99, '2023-02-15 07:15:00 GMT'),
(3, 15.99, '2023-02-15 23:01:15 GMT'),
(4, 7.99, '2023-02-16 14:22:09 GMT')
)
Now, let's determine the total of purchases made on February 15. We can execute a query as follows:
SELECT
SUM(amount) AS sum
FROM purchases
WHERE
dt = '2023-02-15'
Surprisingly, we receive the following output. What occurred?
There were three transactions on February 15: IDs 1, 2, and 3, which should sum to $26.97. Instead, only the initial purchase was counted.
Hint: If you adjust the filter to 2023-02-16, no rows will be returned.
Answer: The dt column is a timestamp that includes both date and time. When we only specify the date in our WHERE clause, Postgres reformats the date string to 2023-02-15 00:00:00, matching only the first transaction. To select all transactions from February 15, we should cast the timestamp to a date:
SELECT
SUM(amount) AS sum
FROM purchases
WHERE
DATE(dt) = '2023-02-15'
With this adjustment, we now obtain the expected result.
Video Description: In this tutorial, we dive into a SQL take-home challenge, exploring essential concepts and practical solutions.
Section 1.2: Riddle 2 - Filters: Dependent vs. Independent
Next, we explore a table named users. Our goal is to filter out any rows meeting one of three conditions. For instance, we want to return only tenured and active users—those who have logged in during the last 28 days, have made prior posts, and are not new accounts.
As shown in the table below, our query should ideally return only user 8, who has False values for no_login_l28, has_never_posted, and is_new_account.
WITH users(id, no_login_l28, has_never_posted, is_new_account) AS (
VALUES
(1, True, True, True),
(2, True, True, False),
(3, True, False, True),
(4, True, False, False),
(5, False, True, True),
(6, False, True, False),
(7, False, False, True),
(8, False, False, False)
)
How should we structure our WHERE clause?
Options:
- Multiple AND NOT
- Multiple OR NOT
- NOT + grouped OR
- NOT + grouped AND
Hint: Consider how conditions in a filter are evaluated.
Answer: Option 1 is the correct choice. Although it initially confused me, the independent evaluation of conditions means that this filter correctly removes users with False values across all three columns.
Video Description: Test your logic with a series of challenging riddles that will put your critical thinking skills to the test!
Section 1.3: Riddle 3 - Left Joins Mimicking Inner Joins
In our final riddle, we examine two tables: customers and reviews. The customers table includes customer IDs along with their total spending.
WITH customers(id, total_spend) AS (
VALUES
(100, 1583.49),
(200, 8739.03),
(300, 431.00),
(400, 1.00),
(500, 22.27)
),
reviews(id, customer_id, reported_as_spam) AS (
VALUES
(1, 100, False),
(2, 100, False),
(3, 400, True),
(4, 400, True),
(5, 500, False)
)
Let’s say we want to investigate the correlation between a customer’s total spending and the number of non-spam reviews they provide. A left join between reviews and customers would typically look like this:
SELECT
c.id,
c.total_spend,
COALESCE(COUNT(r.id), 0) AS n_reviews
FROM customers c
LEFT JOIN reviews r
ON c.id = r.customer_id
WHERE
NOT r.reported_as_spam
GROUP BY
1, 2
ORDER BY
1
But wait—why have users 200, 300, and 400 vanished?
Hint: If we create a CTE for reviews with spam reviews filtered out and join on that CTE, will we see the same result?
Answer: The issue arises because WHERE clauses are processed after joins. The left join should maintain null values for users without reviews, but the subsequent WHERE filter excludes them. To correct this, we can add the filter within the LEFT JOIN:
SELECT
c.id,
c.total_spend,
COALESCE(COUNT(r.id), 0) AS n_reviews
FROM customers c
LEFT JOIN reviews r
ON c.id = r.customer_id
AND NOT r.reported_as_spam
GROUP BY
1, 2
ORDER BY
1
With this adjustment, we achieve the expected results.
Chapter 2: Conclusion
In this article, we explored three SQL challenges that can produce unexpected results: timestamp specificity, the difference between dependent and independent filters, and the nuances of left joins acting like inner joins. While I provided simple examples for clarity, these types of SQL intricacies can often arise in larger, more complex queries.
Identifying these issues can be daunting, particularly in intricate queries. When faced with confusion, I recommend breaking down the query into its individual components to verify each result. If you're still uncertain, creating simple CTEs with test data is a great way to validate your expectations.
Happy querying!
Matt