Session Data Analysis
You are given a dataset with one row per user session. The table contains the following columns:
country: The country of the user.session_time: The length of the session in seconds.
Download the dataset here!
Practice this interview question in your preferred .ipynb environment. Psst... Google Colab is a free, online .ipynb environment that is easy to get up and running!
Part 1: Calculate avg. session time
Calculate the average session time for sessions that are longer than 3 minutes (180 seconds).
Part 2: Visualize data
Create a histogram of session lengths using bins of 5 minutes (300 seconds). How would you prepare the data for this histogram?
Finally, plot the histogram!
Part 3: Find similar countries
“Similar countries” are countries that have a number of sessions within 10% of each other. How would you identify such pairs of countries? The output should be a table with two columns:
country_a: The first country in the pair.country_b: The second country in the pair.
Let's start with importing the dataset into the notebook.
Pythonimport pandas as pd # Load dataset df = pd.read_csv('sessions.csv')
The path to your dataset might vary according to where you dataset is with respect to your notebook e.g.
Pythondf = pd.read_csv('/content/drive/MyDrive/sessions.csv')
Part 1: Calculate avg. session time
Python# Filter sessions longer than 180 seconds long_sessions = df[df['session_time'] > 180] # Calculate average session time average_session_time = long_sessions['session_time'].mean() print("Average session time (longer than 3 minutes):", average_session_time)
Part 2: Visualize data
First, we must prepare the data for plotting.
Pythonimport matplotlib.pyplot as plt # Create bins for 5-minute intervals (300 seconds) bins = pd.interval_range(start=0, end=df['session_time'].max() + 300, freq=300) df['bin'] = pd.cut(df['session_time'], bins=bins) # Count sessions in each bin hist_data = df['bin'].value_counts().sort_index() # Prepare data for plotting hist_df = pd.DataFrame({ 'bin': hist_data.index.astype(str), 'count': hist_data.values }) hist_df
Using this dataframe, we can then plot the histogram.
Python# Plot histogram plt.figure(figsize=(10, 6)) plt.bar(hist_df['bin'], hist_df['count']) plt.xlabel('Session Length (seconds)') plt.ylabel('Number of Sessions') plt.title('Histogram of Session Lengths') plt.xticks(rotation=45) plt.show()
The histogram should look like this:

Part 3: Find similar countries
Next, we can do some pandas magic to aggregate the count of sessions for each country. Then, we find all the possible combinations of countries
Python# Count sessions per country session_counts = df['country'].value_counts().reset_index() session_counts.columns = ['country', 'count'] # Find possible combinations session_counts['key'] = 1 merged = session_counts.merge(session_counts, on='key', suffixes=('_a', '_b')).drop('key', axis=1)
The code above may not work well for larger datasets as it involves creating the cartesian product of the list of countries. This will scale poorly if there is a large number of countries to pair up.
We could consider trading space complexity for time complexity and use a nested for-loop with a built-in conditional check.
Can you come up with a better solution?
Finally, we filter out those pairs that are not within 10% of each other.
Python# Compute the similarity condition condition = abs(merged['count_a'] - merged['count_b']) / merged[['count_a', 'count_b']].min(axis=1) <= 0.1 # Filter out self-pairs and non-similar pairs similar_countries_df = merged[condition & (merged['country_a'] != merged['country_b'])][['country_a', 'country_b']] # Drop duplicate pairs (A, B) and (B, A) similar_countries_df = similar_countries_df[similar_countries_df.apply(lambda x: frozenset([x['country_a'], x['country_b']]), axis=1).duplicated(keep='first') == False] similar_countries_df
1) select avg(session) from table where session> 1802) select round(session_time/300)*300 as session_bin, count(*) as session_count from table group by round(session_time/300)*300 order by session_bin3) SELECT t1.country AS country_a,t2.country AS country_bFROM (SELECT country, COUNT(*) AS session_countFROM your_table_nameGROUP BY country) AS t1JOIN (SELECT country, COUNT(*) AS session_countFROM your_table_nameGROUP BY country) AS t2 ON t1.session_count >= t2.session_count * 0.9AND t1.session_count <= t2.session_count * 1.1AND t1.country != t2.country;