Skip to main content

Subqueries & CTEs

Subqueries and CTEs are interview staples because they help you break complex problems into clean, testable steps, especially when a question involves multiple stages of filtering, aggregation, and joining.

This lesson teaches when to use a CTE vs. an inline subquery, how to structure multi-CTE solutions, and common patterns like “filter on an aggregate,” “join to an aggregated table,” and “build an intermediate cohort.” You’ll also cover correlated subqueries and when they’re elegant vs. when they’re slow or confusing.

Example prompts include:

“Find users whose spend is above the overall average.”

and

“Compute retention by signup cohort.”

All examples below use the employees users, orders, and returns tables in "Example data" section (at the bottom). Refer back to that section if you want to confirm keys or relationships.

When to choose subquery vs CTE

  • Subquery / EXISTS / scalar subquery → best for one-off presence checks and per-entity scalar picks (e.g., latest id). Keeps the query tight and lets the optimizer push filters down.
  • CTE (WITH …) → best for multi-step logic, pre-aggregation, window functions, or when you’ll reuse the result. Improves readability and testing.

Heuristic:

  • Simple filter or single lookup → subquery.
  • Multi-step (pre-agg → rank → select) or reused → CTE.

Existence checks

EXISTS and NOT EXISTS let you filter rows based on whether a matching row exists in another table. They’re a clean fit for interview questions that read like “include X only if condition A is true, and condition B is never true,” and they’re robust to duplicates on the right side.

Unlike the earlier “has / has-not” patterns, this is a compound existence filter (multiple conditions at once) and EXISTS keeps the output at the left table’s grain without needing DISTINCT.

Interview prompt: Find users who have a paid order and no returns.

SQL
SELECT u.user_id, u.country FROM users u WHERE EXISTS ( SELECT 1 FROM orders o WHERE o.user_id = u.user_id AND o.status = 'paid' ) AND NOT EXISTS ( SELECT 1 FROM orders o JOIN returns r ON r.order_id = o.order_id WHERE o.user_id = u.user_id );

Output

user_idcountry
1US

Explanation:

  • EXISTS short-circuits on first match and ignores duplicates; NOT EXISTS is the cleanest “no matching rows” test.

Clarifying questions

  • Define the scope: paid only? any date window? returns of which orders?
  • Keep entities with no activity at all?

Common pitfalls

  • Replacing NOT EXISTS with LEFT JOIN + IS NULL but putting filters in WHERE (turns LEFT → INNER).
  • Using IN with nullable subqueries.

Edge cases

  • Multiple returns still excluded (correct).
  • For multi-column tests (tuples), EXISTS is safer than IN.

Scalar pick per entity

When to use: you need one value per entity (id, timestamp, amount).

Example: Using subquery

Interview prompt: For each user, return their latest paid order id.

SQL
SELECT u.user_id, ( SELECT o.order_id FROM orders o WHERE o.user_id = u.user_id AND o.status = 'paid' ORDER BY o.created_at DESC, o.order_id DESC LIMIT 1 ) AS latest_paid_order FROM users u ORDER BY u.user_id;

Output

user_idlatest_paid_order
111
2null
3null

Explanation

  • Correlated scalar subquery returns exactly one value per users row (or NULL).

Clarifying questions

  • Latest by which timestamp? Tie-break rule?

Common pitfalls

  • MAX(created_at) then join back without tie-breaker → duplicates.

Edge cases

  • Different SQL dialects: TOP 1 / FETCH FIRST 1 ROW.
  • If you’ll extend to top-k later, prefer the CTE + window version below.

Example: Using CTE + window

Interview prompt: Return latest paid order per user (keep users with none).

SQL
WITH paid AS ( SELECT order_id, user_id, created_at, ROW_NUMBER() OVER ( PARTITION BY user_id ORDER BY created_at DESC, order_id DESC ) AS rn FROM orders WHERE status = 'paid' ) SELECT u.user_id, p.order_id AS latest_paid_order FROM users u LEFT JOIN paid p ON p.user_id = u.user_id AND p.rn = 1 ORDER BY u.user_id;

Pre-aggregate before downstream logic

  • Goal: Avoid row inflation
  • When to use: You’ll join many-to-many tables later or apply window ranking per group.
  • Idea: CTE (pre-agg spend per user), then window rank per group (country)

Example

Interview prompt: Find the top spender in each country (paid orders only).

SQL
WITH user_spend AS ( SELECT u.user_id, u.country, SUM(o.amount) AS spend FROM users u JOIN orders o ON o.user_id = u.user_id WHERE o.status = 'paid' GROUP BY u.user_id, u.country ), ranked AS ( SELECT user_id, country, spend, ROW_NUMBER() OVER ( PARTITION BY country ORDER BY spend DESC, user_id DESC ) AS rn FROM user_spend ) SELECT country, user_id, spend FROM ranked WHERE rn = 1 ORDER BY country;

Output

countryuser_idspend
US1120

Explanation

  • Reduce to one row per (user,country) first; then the window step is correct and stable.

Clarifying questions

  • Rank ties: return all or pick one (RANK vs ROW_NUMBER)?
  • Include groups with no spend (need LEFT joins + COALESCE)?

Common pitfalls

  • Ranking raw many-to-many joins → duplicates inflate metrics.
  • Missing a deterministic tie key.

Edge cases

  • If you only need group totals (no ranking), stop after user_spend or sum it again.

Set containment (“has all required Xs”)

  • When to use: “entity has every member in a required set” (classic interview task).
  • Idea: CTE counting distinct required members per entity

Example

Interview prompt: Find users who bought both SKU A and SKU B.

SQL
WITH per_user AS ( SELECT o.user_id, COUNT(DISTINCT CASE WHEN oi.sku = 'A' THEN 'A' END) AS has_A, COUNT(DISTINCT CASE WHEN oi.sku = 'B' THEN 'B' END) AS has_B FROM orders o JOIN order_items oi ON oi.order_id = o.order_id WHERE o.status = 'paid' GROUP BY o.user_id ) SELECT user_id FROM per_user WHERE has_A = 1 AND has_B = 1 ORDER BY user_id;

Output

user_id
1

Explanation

  • Turn “must have A and B” into two boolean counts; filter to both = 1.

Clarifying questions

  • Paid only? One appearance enough or frequency matters?
  • “At least these SKUs” vs “exactly these SKUs” (if exact: COUNT(DISTINCT sku)=2 guard).

Common pitfalls

  • Using COUNT(*) without DISTINCT → overcounts.
  • Relying on IN ('A','B') + COUNT(*)=2 without deduplication.

Edge cases

  • Larger required sets → drive from a small table of required items and pivot/aggregate.

Compare to a group statistic

  • Idea: correlated vs CTE join
  • When to use: “above/below group metric” (avg, max, percentile).

Example: Using correlated subquery

Interview prompt: Find employees paid above their department average.

SQL
SELECT e.emp_id, e.name, e.dept, e.salary FROM employees e WHERE e.salary > ( SELECT AVG(salary) FROM employees x WHERE x.dept = e.dept ) ORDER BY e.dept, e.salary DESC;

Output

emp_idnamedeptsalary
1AnaEng120
3CamEng120
5EveSales150

Example: Using CTE + join

SQL
WITH dept_avg AS ( SELECT dept, AVG(salary) AS avg_salary FROM employees GROUP BY dept ) SELECT e.emp_id, e.name, e.dept, e.salary FROM employees e JOIN dept_avg d ON d.dept = e.dept WHERE e.salary > d.avg_salary ORDER BY e.dept, e.salary DESC;

Explanation

  • Compute the group metric once, then compare—clear and composable.

Clarifying questions

  • Strictly greater or ≥ ? Which salary version (snapshot vs current)?

Common pitfalls

  • Floating precision and borderline comparisons; be explicit if rounding is required.

Edge cases

  • Groups with one row → nobody above average (expected).
  • AVG ignores NULLs—decide if that’s intended.

Example data

We used these tables in the examples above.

employees

emp_idnamedeptsalary
1AnaEng120
2BenEng90
3CamEng120
4DanSales80
5EveSales150

users

user_idcountry
1US
2US
3SG

orders

order_iduser_idcreated_atamountstatus
1012025-01-0250paid
1112025-01-0570paid
1222025-01-0315draft

returns

order_idreturned_at
112025-01-08

Questions

Warm up

Practice

Go deeper

Want to dive deeper into subqueries and CTEs? Check out our more detailed lesson here on Subqueries and Common Table Expressions (CTEs)