Data Scientist Interview Questions

Review this list of 278 data scientist interview questions and answers verified by hiring managers and candidates.
  • Meta (Facebook) logoAsked at Meta (Facebook) 
    Data Scientist
    Analytical
    +2 more
  • "To speed up A/B tests results with limited sample sizes, we can apply advanced techniques like CUPED to reduce variance for faster statistical significance, interleaving to gather more comparative data per user (e.g., ranking), MAB to dynamically allocate traffic to winning variations for quicker optimization (e.g., campaigns), and Bayesian A/B testing which offers probabilistic conclusions that can be reached earlier. Each method, when appropriately applied, allows you to gain m"

    Lucas G. - "To speed up A/B tests results with limited sample sizes, we can apply advanced techniques like CUPED to reduce variance for faster statistical significance, interleaving to gather more comparative data per user (e.g., ranking), MAB to dynamically allocate traffic to winning variations for quicker optimization (e.g., campaigns), and Bayesian A/B testing which offers probabilistic conclusions that can be reached earlier. Each method, when appropriately applied, allows you to gain m"See full answer

    Data Scientist
    Statistics & Experimentation
  • Adobe logoAsked at Adobe 
    +8

    "Problem Statement: The Fibonacci sequence is defined as F(n) = F(n-1) + F(n-2) with F(0) = 1 and F(1) = 1. The solution is given in the problem statement itself. If the value of n = 0, return 1. If the value of n = 1, return 1. Otherwise, return the sum of data at (n - 1) and (n - 2). Explanation: The Fibonacci sequence is a series of numbers where each number is the sum of the two preceding ones, typically starting with 0 and 1. Java Solution: public static int fib(int n"

    Rishi G. - "Problem Statement: The Fibonacci sequence is defined as F(n) = F(n-1) + F(n-2) with F(0) = 1 and F(1) = 1. The solution is given in the problem statement itself. If the value of n = 0, return 1. If the value of n = 1, return 1. Otherwise, return the sum of data at (n - 1) and (n - 2). Explanation: The Fibonacci sequence is a series of numbers where each number is the sum of the two preceding ones, typically starting with 0 and 1. Java Solution: public static int fib(int n"See full answer

    Data Scientist
    Data Structures & Algorithms
    +2 more
  • +5

    "-- Write your query here select marketing_channel, avg(purchasevalue) as avgpurchase_value from attribution group by 1 order by 2 desc; `"

    Anonymous Roadrunner - "-- Write your query here select marketing_channel, avg(purchasevalue) as avgpurchase_value from attribution group by 1 order by 2 desc; `"See full answer

    Data Scientist
    Coding
    +1 more
  • 🧠 Want an expert answer to a question? Saving questions lets us know what content to make next.

  • +3

    "-- The text of the task is a bit confusing. If the status is repeated several -- times, then in the end you should show as start_date the date of the first -- occurrence, and in end_date the date of the last occurrence of this status, -- and not the date of the beginning of the next status with t1 as (select order_id, status, orderdate as startdate, lead(orderdate) over (partition by orderid order by orderdate) as enddate, ifnull(lag(status) over (partition by order_id order by or"

    Alexey T. - "-- The text of the task is a bit confusing. If the status is repeated several -- times, then in the end you should show as start_date the date of the first -- occurrence, and in end_date the date of the last occurrence of this status, -- and not the date of the beginning of the next status with t1 as (select order_id, status, orderdate as startdate, lead(orderdate) over (partition by orderid order by orderdate) as enddate, ifnull(lag(status) over (partition by order_id order by or"See full answer

    Data Scientist
    Coding
    +1 more
  • "Over-fitting of a model occurs when model fails to generalize to any new data and has high variance withing training data whereas in under fitting model isn't able to uncover the underlying pattern in the training data and high bias. Tree based model like decision tree and random forest are likely to overfit whereas linear models like linear regression and logistic regression tends to under fit. There are many reasons why a Random forest can overfits easily 1. Model has grown to its full depth a"

    Jyoti V. - "Over-fitting of a model occurs when model fails to generalize to any new data and has high variance withing training data whereas in under fitting model isn't able to uncover the underlying pattern in the training data and high bias. Tree based model like decision tree and random forest are likely to overfit whereas linear models like linear regression and logistic regression tends to under fit. There are many reasons why a Random forest can overfits easily 1. Model has grown to its full depth a"See full answer

    Data Scientist
    Concept
    +2 more
  • "Because testing many engagement metrics at once increases the risk of finding effects that aren't real (the 'multiple comparisons problem'), you must adjust your criteria for statistical significance. For social media data, the Benjamini-Hochberg procedure is often a practical choice as it controls the rate of false discoveries (FDR) while still allowing you to detect genuine changes; however, the ideal adjustment method will vary depending on your specific number of metrics (e.g., use Bonferron"

    Lucas G. - "Because testing many engagement metrics at once increases the risk of finding effects that aren't real (the 'multiple comparisons problem'), you must adjust your criteria for statistical significance. For social media data, the Benjamini-Hochberg procedure is often a practical choice as it controls the rate of false discoveries (FDR) while still allowing you to detect genuine changes; however, the ideal adjustment method will vary depending on your specific number of metrics (e.g., use Bonferron"See full answer

    Data Scientist
    Statistics & Experimentation
  • +4

    "SELECT order_amount FROM ( SELECT *, rank() OVER(ORDER BY order_amount desc) as ranking FROM departments d LEFT JOIN orders o ON d.departmentid = o.departmentid LEFT JOIN customers c ON o.customerid = c.customerid WHERE department_name = 'Fashion' ) where ranking = 2"

    Jacky T. - "SELECT order_amount FROM ( SELECT *, rank() OVER(ORDER BY order_amount desc) as ranking FROM departments d LEFT JOIN orders o ON d.departmentid = o.departmentid LEFT JOIN customers c ON o.customerid = c.customerid WHERE department_name = 'Fashion' ) where ranking = 2"See full answer

    Data Scientist
    Coding
    +1 more
  • +8

    "select sub.name subreddit_name, count(distinct us.userid) totalusers from user_subreddit as us left join subreddit as sub on us.subredditid = sub.subredditid group by us.subreddit_id having count(distinct us.user_id) > 3"

    Lucas G. - "select sub.name subreddit_name, count(distinct us.userid) totalusers from user_subreddit as us left join subreddit as sub on us.subredditid = sub.subredditid group by us.subreddit_id having count(distinct us.user_id) > 3"See full answer

    Data Scientist
    Coding
    +1 more
  • "This video is a duplicate of the other video in this lesson, "Design A/B test for New Campaign""

    Connor W. - "This video is a duplicate of the other video in this lesson, "Design A/B test for New Campaign""See full answer

    Data Scientist
    Statistics & Experimentation
  • +3

    "-- filter for december and november data -- the total order amount per depatment per month -- department, month, order_amount with monthly_orders AS ( SELECT department_id, strftime('%m', order_date) AS month, SUM(orderamount) AS orderamount FROM orders WHERE orderdate >= '2022-11-01' AND orderdate < '2023-01-01' group by department_id, month ), -- -- add difference from this month to last ( use lag ) monthly_comp"

    Aneesha K. - "-- filter for december and november data -- the total order amount per depatment per month -- department, month, order_amount with monthly_orders AS ( SELECT department_id, strftime('%m', order_date) AS month, SUM(orderamount) AS orderamount FROM orders WHERE orderdate >= '2022-11-01' AND orderdate < '2023-01-01' group by department_id, month ), -- -- add difference from this month to last ( use lag ) monthly_comp"See full answer

    Data Scientist
    Coding
    +1 more
  • +5

    "select customer_id, order_date, orderid as earliestorder_id from ( select customer_id, order_date, order_id, rownumber() over (partition by customerid, orderdate order by orderdate) as orderrankper_customer from orders ) sub_table where orderrankper_customer=1 order by orderdate, customerid; Standard solution assumed that the orderid indicates which order comes in first. However this is not always the case, and sometime orderid can be random number withou"

    Jessica C. - "select customer_id, order_date, orderid as earliestorder_id from ( select customer_id, order_date, order_id, rownumber() over (partition by customerid, orderdate order by orderdate) as orderrankper_customer from orders ) sub_table where orderrankper_customer=1 order by orderdate, customerid; Standard solution assumed that the orderid indicates which order comes in first. However this is not always the case, and sometime orderid can be random number withou"See full answer

    Data Scientist
    Coding
    +1 more
  • Walmart Labs logoAsked at Walmart Labs 
    Data Scientist
    Behavioral
    +5 more
  • OpenAI logoAsked at OpenAI 

    "Of course. Reinforcement Learning is a type of machine learning where an agent learns to make decisions by trying out different actions and receiving rewards or penalties in return. The goal is to learn, over time, which actions yield the highest rewards. There are three core components in RL: The agent — the learner or decision-maker (e.g., an algorithm or robot), The environment — everything the agent interacts with, Actions and rewards — the agent takes actions, and the"

    Constantin P. - "Of course. Reinforcement Learning is a type of machine learning where an agent learns to make decisions by trying out different actions and receiving rewards or penalties in return. The goal is to learn, over time, which actions yield the highest rewards. There are three core components in RL: The agent — the learner or decision-maker (e.g., an algorithm or robot), The environment — everything the agent interacts with, Actions and rewards — the agent takes actions, and the"See full answer

    Data Scientist
    Concept
    +1 more
  • Discord logoAsked at Discord 
    Data Scientist
    Behavioral
    +4 more
  • PayPal logoAsked at PayPal 

    "Clarfying questions : When we say a decrease in users adding the bank accounts. I would like to understand how the users making payments within Venmo I assume they are either using their credit cards/debit cards? I would like to understand why the Adding of Bank Accounts is integral to Venmo since the users are using the debit card and Credit Cards. My understanding is when the payments happen through debit cards rails Venmo pays higher interchange fees and to Reduces any losses incurred"

    Dev S. - "Clarfying questions : When we say a decrease in users adding the bank accounts. I would like to understand how the users making payments within Venmo I assume they are either using their credit cards/debit cards? I would like to understand why the Adding of Bank Accounts is integral to Venmo since the users are using the debit card and Credit Cards. My understanding is when the payments happen through debit cards rails Venmo pays higher interchange fees and to Reduces any losses incurred"See full answer

    Data Scientist
    Execution
    +1 more
  • Discord logoAsked at Discord 
    Data Scientist
    Behavioral
    +1 more
  • Data Scientist
    Statistics & Experimentation
  • Amazon logoAsked at Amazon 
    Video answer for 'What are common linear regression problems?'

    "I can try to summarize their discussion as I remembered. Linear regression is one of the method to predict target (Y) using features (X). Formula for linear regression is a linear function of features. The aim is to choose coefficients (Teta) of the prediction function in such a way that the difference between target and prediction is least in average. This difference between target and prediction is called loss function. The form of this loss function could be dependent from the particular real"

    Ilnur I. - "I can try to summarize their discussion as I remembered. Linear regression is one of the method to predict target (Y) using features (X). Formula for linear regression is a linear function of features. The aim is to choose coefficients (Teta) of the prediction function in such a way that the difference between target and prediction is least in average. This difference between target and prediction is called loss function. The form of this loss function could be dependent from the particular real"See full answer

    Data Scientist
    Analytical
    +2 more
Showing 141-160 of 278