Subqueries & CTEs
Subqueries and CTEs are interview staples because they help you break complex problems into clean, testable steps, especially when a question involves multiple stages of filtering, aggregation, and joining.
This lesson teaches when to use a CTE vs. an inline subquery, how to structure multi-CTE solutions, and common patterns like “filter on an aggregate,” “join to an aggregated table,” and “build an intermediate cohort.” You’ll also cover correlated subqueries and when they’re elegant vs. when they’re slow or confusing.
Example prompts include:
“Find users whose spend is above the overall average.”
and
“Compute retention by signup cohort.”
All examples below use the employees users, orders, and returns tables in "Example data" section (at the bottom). Refer back to that section if you want to confirm keys or relationships.
When to choose subquery vs CTE
- Subquery / EXISTS / scalar subquery → best for one-off presence checks and per-entity scalar picks (e.g., latest id). Keeps the query tight and lets the optimizer push filters down.
- CTE (
WITH …) → best for multi-step logic, pre-aggregation, window functions, or when you’ll reuse the result. Improves readability and testing.
Heuristic:
- Simple filter or single lookup → subquery.
- Multi-step (pre-agg → rank → select) or reused → CTE.
Existence checks
EXISTS and NOT EXISTS let you filter rows based on whether a matching row exists in another table. They’re a clean fit for interview questions that read like “include X only if condition A is true, and condition B is never true,” and they’re robust to duplicates on the right side.
Unlike the earlier “has / has-not” patterns, this is a compound existence filter (multiple conditions at once) and EXISTS keeps the output at the left table’s grain without needing DISTINCT.
Interview prompt: Find users who have a paid order and no returns.
SQLSELECT u.user_id, u.country
FROM users u
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.user_id = u.user_id AND o.status = 'paid'
)
AND NOT EXISTS (
SELECT 1
FROM orders o
JOIN returns r ON r.order_id = o.order_id
WHERE o.user_id = u.user_id
);
Output
Explanation:
EXISTSshort-circuits on first match and ignores duplicates;NOT EXISTSis the cleanest “no matching rows” test.
Clarifying questions
- Define the scope: paid only? any date window? returns of which orders?
- Keep entities with no activity at all?
Common pitfalls
- Replacing
NOT EXISTSwith LEFT JOIN +IS NULLbut putting filters inWHERE(turns LEFT → INNER). - Using
INwith nullable subqueries.
Edge cases
- Multiple returns still excluded (correct).
- For multi-column tests (tuples),
EXISTSis safer thanIN.
Scalar pick per entity
When to use: you need one value per entity (id, timestamp, amount).
Example: Using subquery
Interview prompt: For each user, return their latest paid order id.
SQLSELECT u.user_id,
(
SELECT o.order_id
FROM orders o
WHERE o.user_id = u.user_id AND o.status = 'paid'
ORDER BY o.created_at DESC, o.order_id DESC
LIMIT 1
) AS latest_paid_order
FROM users u
ORDER BY u.user_id;
Output
Explanation
- Correlated scalar subquery returns exactly one value per
usersrow (orNULL).
Clarifying questions
- Latest by which timestamp? Tie-break rule?
Common pitfalls
MAX(created_at)then join back without tie-breaker → duplicates.
Edge cases
- Different SQL dialects:
TOP 1/FETCH FIRST 1 ROW. - If you’ll extend to top-k later, prefer the CTE + window version below.
Example: Using CTE + window
Interview prompt: Return latest paid order per user (keep users with none).
SQLWITH paid AS (
SELECT order_id, user_id, created_at,
ROW_NUMBER() OVER (
PARTITION BY user_id ORDER BY created_at DESC, order_id DESC
) AS rn
FROM orders
WHERE status = 'paid'
)
SELECT u.user_id, p.order_id AS latest_paid_order
FROM users u
LEFT JOIN paid p
ON p.user_id = u.user_id AND p.rn = 1
ORDER BY u.user_id;
Pre-aggregate before downstream logic
- Goal: Avoid row inflation
- When to use: You’ll join many-to-many tables later or apply window ranking per group.
- Idea: CTE (pre-agg spend per user), then window rank per group (country)
Example
Interview prompt: Find the top spender in each country (paid orders only).
SQLWITH user_spend AS (
SELECT u.user_id, u.country, SUM(o.amount) AS spend
FROM users u
JOIN orders o ON o.user_id = u.user_id
WHERE o.status = 'paid'
GROUP BY u.user_id, u.country
),
ranked AS (
SELECT user_id, country, spend,
ROW_NUMBER() OVER (
PARTITION BY country ORDER BY spend DESC, user_id DESC
) AS rn
FROM user_spend
)
SELECT country, user_id, spend
FROM ranked
WHERE rn = 1
ORDER BY country;
Output
Explanation
- Reduce to one row per (user,country) first; then the window step is correct and stable.
Clarifying questions
- Rank ties: return all or pick one (
RANKvsROW_NUMBER)? - Include groups with no spend (need LEFT joins +
COALESCE)?
Common pitfalls
- Ranking raw many-to-many joins → duplicates inflate metrics.
- Missing a deterministic tie key.
Edge cases
- If you only need group totals (no ranking), stop after
user_spendor sum it again.
Set containment (“has all required Xs”)
- When to use: “entity has every member in a required set” (classic interview task).
- Idea: CTE counting distinct required members per entity
Example
Interview prompt: Find users who bought both SKU A and SKU B.
SQLWITH per_user AS (
SELECT o.user_id,
COUNT(DISTINCT CASE WHEN oi.sku = 'A' THEN 'A' END) AS has_A,
COUNT(DISTINCT CASE WHEN oi.sku = 'B' THEN 'B' END) AS has_B
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
WHERE o.status = 'paid'
GROUP BY o.user_id
)
SELECT user_id
FROM per_user
WHERE has_A = 1 AND has_B = 1
ORDER BY user_id;
Output
Explanation
- Turn “must have A and B” into two boolean counts; filter to both = 1.
Clarifying questions
- Paid only? One appearance enough or frequency matters?
- “At least these SKUs” vs “exactly these SKUs” (if exact:
COUNT(DISTINCT sku)=2guard).
Common pitfalls
- Using
COUNT(*)withoutDISTINCT→ overcounts. - Relying on
IN ('A','B')+COUNT(*)=2without deduplication.
Edge cases
- Larger required sets → drive from a small table of required items and pivot/aggregate.
Compare to a group statistic
- Idea: correlated vs CTE join
- When to use: “above/below group metric” (avg, max, percentile).
Example: Using correlated subquery
Interview prompt: Find employees paid above their department average.
SQLSELECT e.emp_id, e.name, e.dept, e.salary
FROM employees e
WHERE e.salary > (
SELECT AVG(salary) FROM employees x WHERE x.dept = e.dept
)
ORDER BY e.dept, e.salary DESC;
Output
Example: Using CTE + join
SQLWITH dept_avg AS (
SELECT dept, AVG(salary) AS avg_salary
FROM employees
GROUP BY dept
)
SELECT e.emp_id, e.name, e.dept, e.salary
FROM employees e
JOIN dept_avg d ON d.dept = e.dept
WHERE e.salary > d.avg_salary
ORDER BY e.dept, e.salary DESC;
Explanation
- Compute the group metric once, then compare—clear and composable.
Clarifying questions
- Strictly greater or ≥ ? Which salary version (snapshot vs current)?
Common pitfalls
- Floating precision and borderline comparisons; be explicit if rounding is required.
Edge cases
- Groups with one row → nobody above average (expected).
AVGignores NULLs—decide if that’s intended.
Example data
We used these tables in the examples above.
employees
users
orders
returns
Questions
Warm up
Practice
- Employee hierarchy (Hard)
- Post success by age group (Medium)
- Initial contact attribution (Hard)
- Post success after failure (Hard)
Go deeper
Want to dive deeper into subqueries and CTEs? Check out our more detailed lesson here on Subqueries and Common Table Expressions (CTEs)