Handle Missing Data
You are given a dataset containing some missing values. Your task is to handle these missing values according to the instructions provided. The dataset is as follows:
Practice all the options below to get a feel for how you may handle missing data during your coding interviews.
After completing each option, click the "Reset" button on the code editor to clear your changes and start fresh before working on the next option.
Option 1: Drop all rows with missing data
The interviewer determines that all rows with missing data are useless. Drop all rows with missing data. Your result should look like this:
Option 2: Drop all rows with missing GPA
The interviewer decides that only the rows with missing gpa values are worthless. Drop such rows. Your result should look like this:
Option 3: Replace missing values
Instead of removing rows, the interviewer decides to handle missing values as follows:
- For missing
gpavalues, fill them with the mean GPA. - For missing
credits_completedvalues, replace them with the meancredits_completedvalue based on the student's year. For example, if the student is a sophomore, use the averagecredits_completedof all sophomores. - For missing
yearvalues, assign the most frequently occurring year to these missing values.
Your result should look like this:
- Mean
gpacalculated as: (3.5 + 3.7 + 3.2 + 3.9 + 3.8 + 3.4 + 3.1 + 3.6) / 8 = 3.525 - Mean
credits_completedfor Sophomores: (35 + 25) / 2 = 30
Option 4: Interpolate
Unfortunately, we won't be interpolating on this dataset. Check out the coding lesson on interpolation to learn more!
Option 1: Drop all rows with missing data
This solution removes any row in the dataset that contains at least one missing value.
Pythondef handle_missing_data(data: pd.DataFrame) -> pd.DataFrame:
# Drop all rows with missing data
return data.dropna()
Option 2: Drop all rows with missing GPA
This approach removes rows where the gpa value is missing, while retaining rows with missing values in other columns.
Pythondef handle_missing_data(data: pd.DataFrame) -> pd.DataFrame:
# Drop all rows with missing GPA
return data.dropna(subset=['gpa'])
Option 3: Replace missing values
In this solution:
- GPA Replacement: Missing
gpavalues are filled with the mean GPA of the dataset. - Credits Replacement: Missing
credits_completedvalues are filled with the mean credits for students in the same year. - Year Replacement: Missing
yearvalues are filled with the most common year in the dataset.
Pythondef handle_missing_data(data: pd.DataFrame) -> pd.DataFrame:
# Replace missing GPA with mean GPA
mean_gpa = data['gpa'].mean()
data['gpa'] = data['gpa'].fillna(mean_gpa)
# Replace missing credits_completed with mean credits_completed for the same year
for year in data['year'].dropna().unique():
mean_credits = data[data['year'] == year]['credits_completed'].mean()
data.loc[(data['year'] == year) & (data['credits_completed'].isna()), 'credits_completed'] = mean_credits
# Replace missing year with the most common year
most_common_year = data['year'].mode()[0]
data['year'] = data['year'].fillna(most_common_year)
return data