Find Statistical Evidence for Conversion Rate
During your interview loop, you may receive coding questions related to statistics, data manipulation, machine learning, or software engineering.
In this video, we'll walk through a statistics-based coding question that you could receive in a technical screen. This question comprises of several parts, each intended to assess your understanding of statistical concepts and your ability to apply these concepts using programming skills.
Practice this interview question in your preferred .ipynb environment.
Dataset
You have been provided with a dataset containing information on user interactions, categorized into two columns: geo and convert.
- The
geocolumn indicates the user's state with state abbreviations (e.g., TX for Texas, CA for California). - The
convertcolumn is a boolean value (True or False) that denotes whether the user converted (took a desired action) or not.
You can download the dataset from the provided link.
Task
Calculate the conversion rate for both the experiment groups (users from "Michigan" (MI), "Texas" (TX), or "Washington" (WA)) and the control group (users from all other states). Then, assess whether the difference in conversion rates between these groups is statistically significant.
For our solution, we opted for a simulation-based approach over formulaic calculations for a couple of key reasons. If you are not allowed to look up formulas during an interview, remembering them can be challenging. Simulations offer a practical alternative that doesn't require memorizing specific formulas or their underlying assumptions.
What is the primary assumption you are making when you run a simulation on sample data?
Steps
First we load the dataset from a CSV file and display the first few rows. It's a crucial first step for understanding the structure of the data, including the columns available and the types of values they contain.
Pythonimport pandas as pd df = pd.read_csv('/my_dir/geo_convert.csv') # replace with your own directory df.head()
Then, we define which states belong to the experiment group and then create a new column in the dataframe to label each row as either experiment or control based on the user's state. This step simplifies the subsequent analysis by clearly distinguishing between the two groups.
Pythonexperiments = ['MI', 'TX', 'WA'] df['experiment_group'] = ['experiment' if val in experiments else 'control' for val in df['geo']] df.head()
Subsequently, we can aggregate the data by the experiment group to calculate the total number of conversions (sum) and the total number of users (count) in each group. This aggregation is essential for computing the conversion rates.
Pythonagg_df = df.groupby('experiment_group').agg(['sum', 'count'])['convert'] agg_df
After calculating the aggregate conversion data, we can compute the actual conversion rate for each group by dividing the number of conversions by the total count.
Pythonprop_df = (agg_df['sum']/agg_df['count']).reset_index() experiment_prop = prop_df.query('experiment_group == "experiment"')[0] control_prop = prop_df.query('experiment_group == "control"')[0] float(experiment_prop) - float(control_prop)
Next, we can perform a bootstrapping simulation by resampling the dataset 10,000 times with replacement. For each sample, it calculates the difference in conversion rates between the experiment and control groups. This process generates a distribution of differences, allowing for a more robust estimation of the true difference and its variability.
Bootstrapping is a statistical technique that involves resampling data with replacement to estimate the distribution of a statistic. It allows for robust estimation of confidence intervals and significance testing without relying on specific parametric assumptions.
Pythondiffs = [] for _ in range(10000): sample_df = df.sample(df.shape[0], replace=True) agg_df = sample_df.groupby('experiment_group').agg(['sum', 'count'])['convert'] prop_df = (agg_df['sum']/agg_df['count']).reset_index() experiment_prop = prop_df.query('experiment_group == "experiment"')[0] control_prop = prop_df.query('experiment_group == "control"')[0] diffs.append(float(experiment_prop) - float(control_prop))
Using the distribution of differences from the bootstrapping, this code calculates the 95% and the 90% confidence interval for the difference in conversion rates.
Pythonimport numpy as np np.percentile(diffs, 2.5), np.percentile(diffs, 97.5) # we see the experiment group conversion is between 0.2% and 12.6% higher than the control with 95% confidence np.percentile(diffs, 5), np.percentile(diffs, 95) # with 90% confidence we see an increase in conversion of 1.2% to 11.6% in the experiment vs. the control
Based on the results from the bootstrapping method, we observe that the conversion rate in the experiment group is between 0.2% and 12.6% higher than the control group with a 95% confidence interval. Similarly, when we consider a 90% confidence interval, the increase in conversion rate is observed to be between 1.2% and 11.6%.
Given that both confidence intervals do not include 0, this indicates that there is statistically significant evidence of a difference in conversion rates between the experiment and control groups at both the 95% and 90% confidence levels. The positive range in both intervals suggests that the experiment group consistently outperforms the control group in terms of conversion rate, reinforcing the conclusion that the observed difference is not due to random chance but is statistically significant.
This suggests that it would be advisable to roll out the changes tested in the experiment group across a wider audience to improve conversion rate.