Data scientist is one of the hottest tech jobs out there, and the interview process reflects this. From screening to on-site meeting, you’re in for what could be a months-long process. Let that sink in, then let your heartbeat return to normal - you can prepare.
With some slight variation, the average data science interview consists of three stages:
With the help of Data Science alums from companies like Facebook, AirBnB, and Google, we’ve got the ultimate resource to help you through each stage.
Most data science hires will go through both a technical and a behavioral screening. These can come in many forms, but you’ll come across coding challenges and quick phone interviews with an HR rep most often.
Both the coding challenge (ex. Make a prediction model based on some scraped data) and the HR screen (“tell me about yourself”) will be pretty basic. Both are meant to weed out inflated resumes.
The only ways you can fail here are through over-stressing or under-preparing. If you’ve been through practice questions on the basics of data science, and you’ve got a thorough understanding of the company you’re interviewing with and the role in question, you’ll be just fine.
Preparing for the Coding Challenge:
If you do fail the screening questions, don’t stress too much - some companies give ridiculous challenges. Otherwise, they might be looking for very specific answers. If you’re spending hours and hours on a take-home coding challenge, ask yourself whether this time might be better spent preparing for other interviews.
Preparing for the Behavioral Screening with HR:
This is the meat of the interview. You can expect a technical call and/or a take-home project and an on-site interview. You’re now a legitimate contender, and your interviewer is testing you to figure out whether you can do the job. Expect rigorous questions, and be prepared to demonstrate your thought process.
Preparing for a live evaluation, whether on-site or on a call:
You’ll likely be tested on:
Here are some sample questions you may come across.
Topic: Linear Regression Modeling
Question: You are modeling marketing return on investment (ROI). You have each month’s revenue on the Y axis and spend on the X axis.
You decide to use a simple linear regression model to evaluate whether spending more would generate more revenue. You find your linear intercept (b) is $1.5MM and gradient (a) is 2.1. Your residual standard error is 79.1 and your adjusted R-squared is 0.72 with a p-value of 1.09e-9.
A. How much of your data’s variance has your model explained and can the result be called significant?
B. Our problem requires more accuracy in modeling the data. How can we alter the linear equation to better fit the data? What regression model would you pick and why?
C. Your new model explains 98% of the data variance. How would you determine if your model is overfitting? How would you evaluate the model overall fit and parameters fit?
HINTS BELOW: Stop here to think before moving on :)
Hint: The p-value is very small. What does that tell you?
Hint #2: Theoretically, what would increase the complexity of the model?
Short Answer to part A: The R-squared value is a statistical measure of how close the data are to the fitted regression line. R-squared values can range from 0 to 1, with a value of 1 meaning that the model explains ALL variability of the response data around it’s mean, so your R-squared of 0.72 indicates that your model explains 72% of variance.
Hypothesis testing tests the validity of a claim being made about a population. You want to know whether greater marketing spend will increase revenue, and you want to ensure that your result (whether the answer is “yes” or “no”) isn’t a sampling error; that is to say, it’s statistically significant. A statistically significant result is one which is not likely to have occurred by chance - instead, it’s likely attributed to a specific cause. You’ll want to set your desired significance to be above 95%, possibly even 98%. Your p-value, which can range from 0 to 1, represents the “strength of the evidence” that marketing spend has an impact on revenue -- that any conclusion drawn is due to a specific cause, not by chance. A small p-value (typically less than or equal to 0.05, or 95% confidence) indicates “strong evidence” or statistical significance. Your p-value is smaller than this, therefore your result is significant. For further Q&A on this data set and more, check out Exponent’s Data Science Course.
Topic: Basic, necessary tools such as if-/else-statements and loops
Question: In any language you're comfortable with, write a program that prints the numbers from 1 to 100. But for multiples of three print “Fizz” instead of the number and for the multiples of five print “Buzz”. For numbers which are multiples of both three and five print “FizzBuzz”.
Answer: View sample code and documentation here.
Topic: Feature Selection
Question: Say you’re a data scientist at a used vehicle dealer and your manager wants to know which vehicles are most likely to have a higher four-year resale value. You have access to a vehicle data set with many attributes associated with valuation. We can build a predictive model to determine which vehicles have higher 4-year resale values. As is often the case we need to perform pre-processing steps to handle the data before we are ready to build our predictive model. In this question, the missing and NA values have been handled appropriately and non-numeric data are converted into numeric or binary dummy features that can be easily processed.
Take a moment to load the relevant Python Modules and load the data drive through Exponent’s Data Science Course.
How would you determine which features to include in the model training data set? Consider both simple and more advanced methods of feature selection and dimension reduction. This may include exploratory data analysis, plots, and analysis methods. Include in your answer HOW you decided which features to keep and which to eliminate.
Answer: For a full answer and more in-depth Q&A on this data set and more, check out Exponent’s Data Science Course.
Topic: Relational database structure
Question: When building a relational database, describe the difference between a logical data model and a physical data model.
Answer: After a high-level conceptual model has been created and basic entities defined, the next step is to build a logical data model. The logical data model includes attributes (text, numbers, dates, etc.) and primary and foreign keys; in essence how each entity is related to each other. The physical data model maps the specific data sources which will be linked together, and is the most detailed view of the three. It represents the specific database as implemented. Read through a more comprehensive explanation here.
There are no specific answers here - interviewers are more interested in your thought process. Nowadays, you’ll likely run into questions that will test your business acumen rather than the “estimation” questions asked years ago. Test yourself on the below:
For a list of recent questions asked by interviewers at Google, Facebook, and Amazon check out Exponent’s Data Science Course.
This component may or may not be included, but if so, the point is to simulate a situation you’ll deal with at work: they’ll give you some data, and a simple if vague request e.g. “identify trend(s) and explain them to a non-technical stakeholder.” Some real-life examples of these (per Glassdoor) are:
You’ll have time to organize your thoughts here - most companies will give you at least a few days to complete the project. So be strategic. This means:
This will likely take the form of an onsite lunch interview, or a quick meet-and-greet. The hard part’s over - they’re convinced you can do the job. Now they want to make sure it’s a mutual fit.
Preparing for a Fit Check
If you're stuck on questions to ask, check out this list of 50 questions interviewers ask to check for culture fit. Flip the script and ask a few! A personal favorite is the always-illuminating: what's the best book you've read recently? You'll learn a lot about that person quickly.
As a data scientist, you’re constantly optimizing. Don’t neglect this tendency in your job search - track how you’re doing and where you’re feeling discomfort. What stood out to you as a weak area within this article? That should be your next area of focus.
And take heart when the process starts to drag. You’re not alone, and there are plenty of opportunities to support each other. This is your community; not the competition. There are more jobs than data scientists. Reach out and keep learning.
Looking for more in-depth preparation? How about a community of thousands populated by PMs and Data Scientists at the likes of Google, Facebook, and Amazon? If you’re ready to level-up your job search, Exponent’s got you covered.
Exponent is the fastest-growing tech interview prep platform. Get free interview guides, insider tips, and courses.
Create your free account