Data Scientist Interview Questions

Review this list of 278 data scientist interview questions and answers verified by hiring managers and candidates.

+ Add interview

Product

Engineering

Operations

Design

Marketing

Data

Sales

Finance

Consulting

Add interview

Product Manager Software Engineer Data Scientist Technical Program Manager Engineering Manager Data Engineer Machine Learning Engineer Data Analyst BizOps & Strategy Business Analyst

Asked at Meta (Facebook) • 3 years ago
What metrics would you use to increase posts with comments in a group?
Data Scientist
Analytical
+2 more
Add answer I was asked this
Data Scientist
Analytical
+2 more
What if you're on a tight on time and want to run your A/B test faster, or you don't have a large enough sample size for statistical significance?
Data Scientist
Statistics & Experimentation
1 answer I was asked this
"To speed up A/B tests results with limited sample sizes, we can apply advanced techniques like CUPED to reduce variance for faster statistical significance, interleaving to gather more comparative data per user (e.g., ranking), MAB to dynamically allocate traffic to winning variations for quicker optimization (e.g., campaigns), and Bayesian A/B testing which offers probabilistic conclusions that can be reached earlier. Each method, when appropriately applied, allows you to gain m"
Lucas G. - "To speed up A/B tests results with limited sample sizes, we can apply advanced techniques like CUPED to reduce variance for faster statistical significance, interleaving to gather more comparative data per user (e.g., ranking), MAB to dynamically allocate traffic to winning variations for quicker optimization (e.g., campaigns), and Bayesian A/B testing which offers probabilistic conclusions that can be reached earlier. Each method, when appropriately applied, allows you to gain m"See full answer
Data Scientist
Statistics & Experimentation
Asked at Microsoft • 8 months ago
Given a list of numbers, find the median without sorting the entire list. Hint: Use quick sort algorithm.
Data Scientist
Coding
Add answer I was asked this
Data Scientist
Coding
Asked at Adobe, Oracle • 7 months ago
Fibonacci Numbers
IDE
Easy
Data Scientist
Data Structures & Algorithms
+2 more
12 answers I was asked this
+8
"Problem Statement: The Fibonacci sequence is defined as F(n) = F(n-1) + F(n-2) with F(0) = 1 and F(1) = 1. The solution is given in the problem statement itself. If the value of n = 0, return 1. If the value of n = 1, return 1. Otherwise, return the sum of data at (n - 1) and (n - 2). Explanation: The Fibonacci sequence is a series of numbers where each number is the sum of the two preceding ones, typically starting with 0 and 1. Java Solution: public static int fib(int n"
Rishi G. - "Problem Statement: The Fibonacci sequence is defined as F(n) = F(n-1) + F(n-2) with F(0) = 1 and F(1) = 1. The solution is given in the problem statement itself. If the value of n = 0, return 1. If the value of n = 1, return 1. Otherwise, return the sum of data at (n - 1) and (n - 2). Explanation: The Fibonacci sequence is a series of numbers where each number is the sum of the two preceding ones, typically starting with 0 and 1. Java Solution: public static int fib(int n"See full answer
Data Scientist
Data Structures & Algorithms
+2 more
Find Average Purchase Value.
IDE
Easy
Data Scientist
Coding
+1 more
8 answers I was asked this
+5
"-- Write your query here select marketing_channel, avg(purchasevalue) as avgpurchase_value from attribution group by 1 order by 2 desc; `"
Anonymous Roadrunner - "-- Write your query here select marketing_channel, avg(purchasevalue) as avgpurchase_value from attribution group by 1 order by 2 desc; `"See full answer
Data Scientist
Coding
+1 more

🧠 Want an expert answer to a question? Saving questions lets us know what content to make next.

Amazon Order Status
IDE
Hard
Data Scientist
Coding
+1 more
7 answers I was asked this
+3
"-- The text of the task is a bit confusing. If the status is repeated several -- times, then in the end you should show as start_date the date of the first -- occurrence, and in end_date the date of the last occurrence of this status, -- and not the date of the beginning of the next status with t1 as (select order_id, status, orderdate as startdate, lead(orderdate) over (partition by orderid order by orderdate) as enddate, ifnull(lag(status) over (partition by order_id order by or"
Alexey T. - "-- The text of the task is a bit confusing. If the status is repeated several -- times, then in the end you should show as start_date the date of the first -- occurrence, and in end_date the date of the last occurrence of this status, -- and not the date of the beginning of the next status with t1 as (select order_id, status, orderdate as startdate, lead(orderdate) over (partition by orderid order by orderdate) as enddate, ifnull(lag(status) over (partition by order_id order by or"See full answer
Data Scientist
Coding
+1 more
Asked at Nvidia, OpenAI • 7 months ago
What is overfitting or underfitting? Which models are most likely to experience this, and why?
Data Scientist
Concept
+2 more
2 answers I was asked this
"Over-fitting of a model occurs when model fails to generalize to any new data and has high variance withing training data whereas in under fitting model isn't able to uncover the underlying pattern in the training data and high bias. Tree based model like decision tree and random forest are likely to overfit whereas linear models like linear regression and logistic regression tends to under fit. There are many reasons why a Random forest can overfits easily 1. Model has grown to its full depth a"
Jyoti V. - "Over-fitting of a model occurs when model fails to generalize to any new data and has high variance withing training data whereas in under fitting model isn't able to uncover the underlying pattern in the training data and high bias. Tree based model like decision tree and random forest are likely to overfit whereas linear models like linear regression and logistic regression tends to under fit. There are many reasons why a Random forest can overfits easily 1. Model has grown to its full depth a"See full answer
Data Scientist
Concept
+2 more
Given multiple engagement metrics for a single experiment, how would you adjust the p-value thresholds for your experiment and why?
Data Scientist
Statistics & Experimentation
1 answer I was asked this
"Because testing many engagement metrics at once increases the risk of finding effects that aren't real (the 'multiple comparisons problem'), you must adjust your criteria for statistical significance. For social media data, the Benjamini-Hochberg procedure is often a practical choice as it controls the rate of false discoveries (FDR) while still allowing you to detect genuine changes; however, the ideal adjustment method will vary depending on your specific number of metrics (e.g., use Bonferron"
Lucas G. - "Because testing many engagement metrics at once increases the risk of finding effects that aren't real (the 'multiple comparisons problem'), you must adjust your criteria for statistical significance. For social media data, the Benjamini-Hochberg procedure is often a practical choice as it controls the rate of false discoveries (FDR) while still allowing you to detect genuine changes; however, the ideal adjustment method will vary depending on your specific number of metrics (e.g., use Bonferron"See full answer
Data Scientist
Statistics & Experimentation
Find Second Highest Order
IDE
Medium
Data Scientist
Coding
+1 more
9 answers I was asked this
+4
"SELECT order_amount FROM ( SELECT *, rank() OVER(ORDER BY order_amount desc) as ranking FROM departments d LEFT JOIN orders o ON d.departmentid = o.departmentid LEFT JOIN customers c ON o.customerid = c.customerid WHERE department_name = 'Fashion' ) where ranking = 2"
Jacky T. - "SELECT order_amount FROM ( SELECT *, rank() OVER(ORDER BY order_amount desc) as ranking FROM departments d LEFT JOIN orders o ON d.departmentid = o.departmentid LEFT JOIN customers c ON o.customerid = c.customerid WHERE department_name = 'Fashion' ) where ranking = 2"See full answer
Data Scientist
Coding
+1 more
Reddit Users
IDE
Easy
Data Scientist
Coding
+1 more
11 answers I was asked this
+8
"select sub.name subreddit_name, count(distinct us.userid) totalusers from user_subreddit as us left join subreddit as sub on us.subredditid = sub.subredditid group by us.subreddit_id having count(distinct us.user_id) > 3"
Lucas G. - "select sub.name subreddit_name, count(distinct us.userid) totalusers from user_subreddit as us left join subreddit as sub on us.subredditid = sub.subredditid group by us.subreddit_id having count(distinct us.user_id) > 3"See full answer
Data Scientist
Coding
+1 more
You have two versions of email campaigns. How would you determine which campaigns will lead to higher sales?
Data Scientist
Statistics & Experimentation
1 answer I was asked this
"This video is a duplicate of the other video in this lesson, "Design A/B test for New Campaign""
Connor W. - "This video is a duplicate of the other video in this lesson, "Design A/B test for New Campaign""See full answer
Data Scientist
Statistics & Experimentation
Find Monthly Revenue Growth
IDE
Hard
Data Scientist
Coding
+1 more
6 answers I was asked this
+3
"-- filter for december and november data -- the total order amount per depatment per month -- department, month, order_amount with monthly_orders AS ( SELECT department_id, strftime('%m', order_date) AS month, SUM(orderamount) AS orderamount FROM orders WHERE orderdate >= '2022-11-01' AND orderdate < '2023-01-01' group by department_id, month ), -- -- add difference from this month to last ( use lag ) monthly_comp"
Aneesha K. - "-- filter for december and november data -- the total order amount per depatment per month -- department, month, order_amount with monthly_orders AS ( SELECT department_id, strftime('%m', order_date) AS month, SUM(orderamount) AS orderamount FROM orders WHERE orderdate >= '2022-11-01' AND orderdate < '2023-01-01' group by department_id, month ), -- -- add difference from this month to last ( use lag ) monthly_comp"See full answer
Data Scientist
Coding
+1 more
E-commerce (4 of 5)
IDE
Easy
Data Scientist
Coding
+1 more
8 answers I was asked this
+5
"select customer_id, order_date, orderid as earliestorder_id from ( select customer_id, order_date, order_id, rownumber() over (partition by customerid, orderdate order by orderdate) as orderrankper_customer from orders ) sub_table where orderrankper_customer=1 order by orderdate, customerid; Standard solution assumed that the orderid indicates which order comes in first. However this is not always the case, and sometime orderid can be random number withou"
Jessica C. - "select customer_id, order_date, orderid as earliestorder_id from ( select customer_id, order_date, order_id, rownumber() over (partition by customerid, orderdate order by orderdate) as orderrankper_customer from orders ) sub_table where orderrankper_customer=1 order by orderdate, customerid; Standard solution assumed that the orderid indicates which order comes in first. However this is not always the case, and sometime orderid can be random number withou"See full answer
Data Scientist
Coding
+1 more
Asked at Walmart Labs • 7 months ago
Why do you want to work at Walmart Labs?
Data Scientist
Behavioral
+5 more
Add answer I was asked this
Data Scientist
Behavioral
+5 more
Asked at OpenAI • 7 months ago
Explain deep reinforcement learning.
Data Scientist
Concept
+1 more
1 answer I was asked this
"Of course. Reinforcement Learning is a type of machine learning where an agent learns to make decisions by trying out different actions and receiving rewards or penalties in return. The goal is to learn, over time, which actions yield the highest rewards. There are three core components in RL: The agent — the learner or decision-maker (e.g., an algorithm or robot), The environment — everything the agent interacts with, Actions and rewards — the agent takes actions, and the"
Constantin P. - "Of course. Reinforcement Learning is a type of machine learning where an agent learns to make decisions by trying out different actions and receiving rewards or penalties in return. The goal is to learn, over time, which actions yield the highest rewards. There are three core components in RL: The agent — the learner or decision-maker (e.g., an algorithm or robot), The environment — everything the agent interacts with, Actions and rewards — the agent takes actions, and the"See full answer
Data Scientist
Concept
+1 more
Asked at Discord • 7 months ago
What other companies are you interviewing at and why?
Data Scientist
Behavioral
+4 more
Add answer I was asked this
Data Scientist
Behavioral
+4 more
Asked at PayPal • 4 years ago
Why might Venmo be seeing a decrease in users adding their bank accounts?
Data Scientist
Execution
+1 more
1 answer I was asked this
"Clarfying questions : When we say a decrease in users adding the bank accounts. I would like to understand how the users making payments within Venmo I assume they are either using their credit cards/debit cards? I would like to understand why the Adding of Bank Accounts is integral to Venmo since the users are using the debit card and Credit Cards. My understanding is when the payments happen through debit cards rails Venmo pays higher interchange fees and to Reduces any losses incurred"
Dev S. - "Clarfying questions : When we say a decrease in users adding the bank accounts. I would like to understand how the users making payments within Venmo I assume they are either using their credit cards/debit cards? I would like to understand why the Adding of Bank Accounts is integral to Venmo since the users are using the debit card and Credit Cards. My understanding is when the payments happen through debit cards rails Venmo pays higher interchange fees and to Reduces any losses incurred"See full answer
Data Scientist
Execution
+1 more
Asked at Discord • 7 months ago
How do you approach personal growth and learning?
Data Scientist
Behavioral
+1 more
Add answer I was asked this
Data Scientist
Behavioral
+1 more
How do you determine sample size based off a certain power you want?
Data Scientist
Statistics & Experimentation
Add answer I was asked this
Data Scientist
Statistics & Experimentation
Asked at Amazon • 4 years ago
What are common linear regression problems?
Data Scientist
Analytical
+2 more
1 answer I was asked this
"I can try to summarize their discussion as I remembered. Linear regression is one of the method to predict target (Y) using features (X). Formula for linear regression is a linear function of features. The aim is to choose coefficients (Teta) of the prediction function in such a way that the difference between target and prediction is least in average. This difference between target and prediction is called loss function. The form of this loss function could be dependent from the particular real"
Ilnur I. - "I can try to summarize their discussion as I remembered. Linear regression is one of the method to predict target (Y) using features (X). Formula for linear regression is a linear function of features. The aim is to choose coefficients (Teta) of the prediction function in such a way that the difference between target and prediction is least in average. This difference between target and prediction is called loss function. The form of this loss function could be dependent from the particular real"See full answer
Data Scientist
Analytical
+2 more