Remove Duplicate Emails
Your company has a users table that has accumulated many duplicate email addresses. Some of these duplicates are due to differences in case sensitivity, while others are due to unintentional whitespaces before or after the email. For instance, [email protected] and [email protected] are considered duplicates.
Consider the users table schema:
SQLCREATE TABLE users ( id INTEGER PRIMARY KEY, email TEXT NOT NULL );
Write a SQL query that returns records from the users table while excluding any duplicate email entries. Your solution should:
- Treat email addresses as duplicates if they match after accounting for case and whitespace
- Include only one record for each email, keeping the record with the smallest
id. - Return the cleaned table, ordered by
id.
Example Output:
We can solve this problem in several ways—including subqueries, self-joins, and window functions. Let’s explore all three solutions.
Subquery Method
In this solution, we group the records by the cleaned email and then select the minimum id for each group. The outer query then fetches only the records whose id matches these minimum id values, thereby excluding duplicates.
SELECT id, email FROM users WHERE id IN ( SELECT MIN(id) FROM users GROUP BY TRIM(LOWER(email)) ) ORDER BY id;
Self-Join Method
In this method, we perform a left self-join on the table based on the cleaned email. The condition u1.id > u2.id ensures that for each pair of duplicate records, only one of them (the one with the smaller id) gets included in the result. The WHERE u2.id IS NULL condition ensures that records without duplicates are also included.
SELECT DISTINCT u1.id, u1.email FROM users u1 LEFT JOIN users u2 ON TRIM(LOWER(u1.email)) = TRIM(LOWER(u2.email)) AND u1.id > u2.id WHERE u2.id IS NULL ORDER BY u1.id;
Window Function Method
This solution leverages the ROW_NUMBER() window function, which assigns a unique sequential integer to rows within a partition of a result set. By partitioning based on the cleaned email, we can effectively group duplicates together. With the use of ORDER BY id, we ensure that the smallest id for each group gets the number 1. This enables us to filter out the other duplicates by selecting only the rows where the row number (rn) is 1.
SELECT id, email
FROM (
SELECT id, email,
ROW_NUMBER() OVER (PARTITION BY TRIM(LOWER(email)) ORDER BY id) as rn
FROM users
) AS subquery
WHERE rn = 1
ORDER BY id;
SELECT MIN(id) AS id, TRIM(LOWER(email)) AS cleaned_email FROM users GROUP BY cleaned_email ORDER BY id