Top 25 Python Data Science Interview Questions (2025 Guide)

Data science interviews often include Python coding questions and statistical analysis.

These questions test your general Python coding skills and knowledge of popular data science Python libraries such as Pandas and NumPy.

Below, we've compiled a list of the most important Python data science interview questions to help you ace your upcoming interviews.

Each question includes a breakdown of what interviewers expect in your answer and code snippets where applicable.

✅

Expert verified: This guide was written and compiled by Derrick Mwiti, a senior data scientist and course instructor. Satyajit Gupte, a senior machine learning engineer (Pinterest, Target) and interview coach, reviewed it.

Python Fundamentals

Python is listed as an essential skill in data science job descriptions for companies such as Microsoft, Google, Apple, and others.

1. Which is faster, Python lists or Numpy arrays? Why?

NumPy arrays are faster than Python lists.

NumPy arrays are specialized for numerical computation and efficient mathematical and statistical operations.

NumPy arrays contain homogeneous data types stored in contiguous memory.
Python lists are heterogeneous data types stored in non-contiguous memory.

Contiguous memory allocation is faster because it allocates consecutive blocks of memory to a process, leading to less memory waste.

👋

This guide contains excerpts from Exponent's complete data science and software engineering interview courses created with data scientists and engineers from Spotify, Amazon, and Instacart.

Sneak peek:
- Watch a Tinder DS answer: Determine the sample size for an experiment.
- Watch a senior DS answer: What is a P-value?
- Practice yourself: Predict results from a fair coin flip.

2. What is the difference between map() and applymap()?

map and applymap are both used for elementwise operations.

However, map is applied to a series, while applymap is applied to a DataFrame.

3. Explain zip() and enumerate().

Given multiple iterables, zip yields tuples until the input is exhausted.

The number of tuples is equivalent to the number of iterables passed. However, it's dependent on the shortest iterable.

Python

list1 = [1, 2, 3, 4, 5]
list2 = ['cow', 'goat', 'hen']
list3 = ['the', 'quick', 'brown', 'fox']
list(zip(list1, list2, list3))
[(1, 'cow', 'the'), (2, 'goat', 'quick'), (3, 'hen', 'brown')]

enumerate creates a tuple for the iterables with the first value as its index and the next being the item's actual value.

This makes it possible to access the position of an item in a list and its position.

Python

e = enumerate(list3)

list(e)
[(0, 'the'), (1, 'quick'), (2, 'brown'), (3, 'fox')]

4. What is a lambda function?

A lambda function is an anonymous function declared without the def keyword.

A lambda function has only one expression but can have multiple arguments. It can make code more concise but less readable.

Python

def myfunc(n):
  return lambda a, b, c : a + b + c * n

my_func = myfunc(3)

print(my_func(5, 6, 2))
# Output: 17

5. How do map, reduce, and filter functions work?

map: Applies a function to each item in an iterable.

Python

def myfunc(n):
  return n**2

x = map(myfunc, (1, 2, 3))
list(x)
# [1, 4, 9]

filter: Removes items that don’t return true and outputs a new iterable.

Python

names = ["Derrick", "Dennis", "Joe"]

def myFunc(x):
  if x.startswith("D"):
    return True
  else:
    return False

final_names = filter(myFunc, names)

for x in final_names:
  print(x)

# Output: Derrick, Dennis

reduce: Applies a function from left to right, reducing the iterable to a single value.

Python

from functools import reduce
reduce(lambda x, y: x + y, [1, 2, 3, 4, 5])
15

6. You are given an integer array of coins representing different coin denominations and an integer amount representing the total amount of money. Write a function `coin_change` that returns the fewest coins needed to make up that amount. If that amount cannot be made up by any combination of the coins, return -1. Assume that you have infinite coins of different kinds.

Python

from typing import List

def coin_change(coins: List[int], amount: int) -&gt; int:
    # Initialize DP array with a value greater than the maximum possible number of coins needed
    dp = [float('inf')] * (amount + 1)
    dp[0] = 0  # Base case: 0 coins needed to make amount 0

    # Process each amount from 1 to the given amount
    for i in range(1, amount + 1):
        for coin in coins:
            if i - coin &gt;= 0:
                dp[i] = min(dp[i], dp[i - coin] + 1)

    # If dp[amount] is still infinity, it means it's not possible to form the amount
    return dp[amount] if dp[amount] != float('inf') else -1

7. We have a long list of unsorted numbers (potentially millions) and want to find the M largest numbers. Implement a function find_largest(input, m) to find and return the largest M values given an input array or file. Return None (Python) or null if the input array is empty.

Python

from heapq import heappush, heappop
def find_largest(input, m):
  if not input:
    return None

  max_nums = [float('-inf')]
  for i in input:
    if int(i) &gt; max_nums[0]:
      if len(max_nums) &gt;= m:
        heappop(max_nums)
      heappush(max_nums, int(i))
  return max_nums

8. You are given an array of characters with sequences separated by spaces. Each space-delimited sequence of characters defines a word. Implement a function `reverse_words` that reverses the order of the words in the array in the most efficient manner.

Python

def reverse_words(arr):
    # Reverse all characters
    n = len(arr)
    mirror_reverse(arr, 0, n - 1)

    # Reverse each word
    word_start = None
    for i in range(n):
        if arr[i] == ' ':
            if word_start is not None:
                mirror_reverse(arr, word_start, i - 1)
                word_start = None
        elif i == n - 1:
            if word_start is not None:
                mirror_reverse(arr, word_start, i)
        else:
            if word_start is None:
                word_start = i

    return arr

# Helper function - reverses the order of items in arr
# Please note that this is language dependent:
# If arrays are passed by value, reversing should be done in place

def mirror_reverse(arr, start, end):
    while start &lt; end:
        arr[start], arr[end] = arr[end], arr[start]
        start += 1
        end -= 1

9. What is the difference between return and yield keywords?

return: Terminates a function and returns a value to the caller, stopping the program's execution.

Python

def tryexponent():
    return "www.tryexponent.com"
    print("Trying exponent!") # This will not be executed

print(tryexponent())
# Output: www.tryexponent.com

yield: Returns an iterator from a function without stopping the program's execution.

Python

def gen_func(x):
    for i in range(x):
        yield i

generator = gen_func(10)
print(next(generator))
# Output: 0
print(next(generator))
# Output: 1
for x in generator:
    print(x)
# Output: 2, 3, 4, 5, 6, 7, 8, 9

10. What are global and local variables in Python?

A local variable is defined inside a function or class and can only be accessed within that scope.
A global variable is defined outside functions or classes and can be accessed anywhere in the program.

11. Write a function to sample words uniformly from a text file.

Python

import random
import string

class UniformWordSampler:
    def __init__(self, file_path):
        self.file_path = file_path
        self.words = self._load_words_from_file()

    def _load_words_from_file(self):
        """Reads and tokenizes the text file."""
        with open(self.file_path, 'r') as file:
            text = file.read()

        # Remove punctuation and split into words
        translator = str.maketrans('', '', string.punctuation)
        text = text.translate(translator)  # Remove punctuation
        words = text.split()  # Split text into words
        return words

    def sample_words(self, k):
        """Samples 'k' words uniformly from the tokenized words."""
        if k &gt; len(self.words):
            raise ValueError("Sample size exceeds the number of available words.")
        
        return random.sample(self.words, k)

def main():
    # Path to the text file (update this to your actual file path)
    file_path = "text.txt"
    
    # Create the sampler object
    sampler = UniformWordSampler(file_path)
    
    # Sample 5 words uniformly from the text file
    sampled_words = sampler.sample_words(5)
    
    # Display the sampled words
    print("Sampled words:", sampled_words)

if __name__ == "__main__":
    main()

12. What are decorators in Python? How are they used?

A decorator is a design pattern that allows for the modification or extension of a Python object without modifying it. Decorators enhance or change the behavior of the functions to which they are applied.

This is possible because functions are first-class citizens in Python.

They can be

returned from a function,
passed as an argument,
and assigned to a variable.

Python

def titlecase_decorator(function):
    def wrapper():
        func = function()
        make_titlecase = func.title()
        return make_titlecase
    return wrapper

@titlecase_decorator
def make_title():
    return 'learning python decorators'

print(make_title())
# Output: 'Learning Python Decorators'

13. Implement a character-level language model with a Vanilla Recurrent Neural Network in Python

Python

import numpy as np

# data I/O
data = open('input.txt', 'r').read() # should be simple plain text file
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print 'data has %d characters, %d unique.' % (data_size, vocab_size)
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }

# hyperparameters
hidden_size = 100 # size of hidden layer of neurons
seq_length = 25 # number of steps to unroll the RNN for
learning_rate = 1e-1

# model parameters
Wxh = np.random.randn(hidden_size, vocab_size)*0.01 # input to hidden
Whh = np.random.randn(hidden_size, hidden_size)*0.01 # hidden to hidden
Why = np.random.randn(vocab_size, hidden_size)*0.01 # hidden to output
bh = np.zeros((hidden_size, 1)) # hidden bias
by = np.zeros((vocab_size, 1)) # output bias

def lossFun(inputs, targets, hprev):
  """
  inputs,targets are both list of integers.
  hprev is Hx1 array of initial hidden state
  returns the loss, gradients on model parameters, and last hidden state
  """
  xs, hs, ys, ps = {}, {}, {}, {}
  hs[-1] = np.copy(hprev)
  loss = 0
  # forward pass
  for t in xrange(len(inputs)):
    xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation
    xs[t][inputs[t]] = 1
    hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
    ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
    ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
    loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)
  # backward pass: compute gradients going backwards
  dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
  dbh, dby = np.zeros_like(bh), np.zeros_like(by)
  dhnext = np.zeros_like(hs[0])
  for t in reversed(xrange(len(inputs))):
    dy = np.copy(ps[t])
    dy[targets[t]] -= 1 # backprop into y. see http://cs231n.github.io/neural-networks-case-study/#grad if confused here
    dWhy += np.dot(dy, hs[t].T)
    dby += dy
    dh = np.dot(Why.T, dy) + dhnext # backprop into h
    dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
    dbh += dhraw
    dWxh += np.dot(dhraw, xs[t].T)
    dWhh += np.dot(dhraw, hs[t-1].T)
    dhnext = np.dot(Whh.T, dhraw)
  for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
    np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients
  return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]

def sample(h, seed_ix, n):
  """ 
  sample a sequence of integers from the model 
  h is memory state, seed_ix is seed letter for first time step
  """
  x = np.zeros((vocab_size, 1))
  x[seed_ix] = 1
  ixes = []
  for t in xrange(n):
    h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh)
    y = np.dot(Why, h) + by
    p = np.exp(y) / np.sum(np.exp(y))
    ix = np.random.choice(range(vocab_size), p=p.ravel())
    x = np.zeros((vocab_size, 1))
    x[ix] = 1
    ixes.append(ix)
  return ixes

n, p = 0, 0
mWxh, mWhh, mWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
mbh, mby = np.zeros_like(bh), np.zeros_like(by) # memory variables for Adagrad
smooth_loss = -np.log(1.0/vocab_size)*seq_length # loss at iteration 0
while True:
  # prepare inputs (we're sweeping from left to right in steps seq_length long)
  if p+seq_length+1 &gt;= len(data) or n == 0: 
    hprev = np.zeros((hidden_size,1)) # reset RNN memory
    p = 0 # go from start of data
  inputs = [char_to_ix[ch] for ch in data[p:p+seq_length]]
  targets = [char_to_ix[ch] for ch in data[p+1:p+seq_length+1]]

  # sample from the model now and then
  if n % 100 == 0:
    sample_ix = sample(hprev, inputs[0], 200)
    txt = ''.join(ix_to_char[ix] for ix in sample_ix)
    print '----\n %s \n----' % (txt, )

  # forward seq_length characters through the net and fetch gradient
  loss, dWxh, dWhh, dWhy, dbh, dby, hprev = lossFun(inputs, targets, hprev)
  smooth_loss = smooth_loss * 0.999 + loss * 0.001
  if n % 100 == 0: print 'iter %d, loss: %f' % (n, smooth_loss) # print progress
  
  # perform parameter update with Adagrad
  for param, dparam, mem in zip([Wxh, Whh, Why, bh, by], 
                                [dWxh, dWhh, dWhy, dbh, dby], 
                                [mWxh, mWhh, mWhy, mbh, mby]):
    mem += dparam * dparam
    param += -learning_rate * dparam / np.sqrt(mem + 1e-8) # adagrad update

  p += seq_length # move data pointer
  n += 1 # iteration counter

14. Given an integer n, representing the number of steps in a staircase, write a function climbStairs that returns the number of distinct ways to climb to the top. Each time, you can either climb 1 step or 2 steps. You may assume that n is a non-negative integer.

Python

def climb_stairs(n: int) -&gt; int:
    if n &lt;= 1:
        return 1

    # Initialize the base cases
    ways = [0] * (n + 1)
    ways[0] = 1
    ways[1] = 1

    # Fill the array with the number of ways to reach each step
    for i in range(2, n + 1):
        ways[i] = ways[i - 1] + ways[i - 2]

    return ways[n]

ℹ️

This interview question was asked at Microsoft. "Explain stack and heap memory allocation."

Data Science with Python

These interview questions test your ability to use Python to solve data science problems.

15. What is the difference between indexing and slicing in NumPy?

Indexing accesses elements at a certain index in a NumPy array.
Slicing involves accessing a subset of the array within a range.

Example of indexing:

Python

import numpy as np

arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
print('3rd element on 1st row: ', arr[0, 2])
# Output: 3rd element on 1st row: 3

Example of slicing:

Python

import numpy as np

matrix = np.arange(1, 17).reshape(4, 4)
print(matrix[2:4, 2:4]) # [start_row:end_row, start_column:end_column]
# Output: [[11, 12], [15, 16]]

ℹ️

This interview question was asked at Apple. "Implement batch normalization using NumPy."

16. Implement simulations to estimate probabilities for dice rolling.

Python

import random

class DiceRollSimulator:
    def __init__(self, trials):
        self.trials = trials

    def roll_dice(self):
        """Simulate rolling two six-sided dice and return their sum."""
        die1 = random.randint(1, 6)
        die2 = random.randint(1, 6)
        return die1 + die2

    def simulate(self, target_sum):
        """Simulate rolling dice and estimate the probability of a specific sum."""
        success_count = 0

        for _ in range(self.trials):
            if self.roll_dice() == target_sum:
                success_count += 1
        
        estimated_probability = success_count / self.trials
        return estimated_probability


def main():
    trials = 1000000  # Number of trials for simulation
    target_sum = 7    # Target sum for which we want to estimate the probability
    
    simulator = DiceRollSimulator(trials)
    
    estimated_prob = simulator.simulate(target_sum)
    print(f"Estimated probability of rolling a sum of {target_sum}: {estimated_prob:.4f}")

if __name__ == "__main__":
    main()

17. What is the difference between merge, join, and concatenate?

merge is used to merge data frames based on a certain column using the intersection of all elements.
join is used to join data frames based on a unique index. A left join uses exclusive IDs from the left table, meaning there will be NaNs for values that don’t exist on the right table.
concatenate joins Pandas objects along a particular axis, for example, by rows or columns.

18. Explain list comprehension and dict comprehension.

List comprehension provides a simple interface for creating new lists from an iterable.

Python

fruits = ["boy", "bowtie", "cow", "goat", "boat"]
newlist = [x for x in fruits if "b" in x]
print(newlist)
# Output: ['boy', 'bowtie', 'boat']

Dictionary comprehension provides a simple interface for creating new dictionaries from an iterable.

Python

dict1 = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
triple_dict1 = {k: v * 3 for k, v in dict1.items() if v &gt; 2}
print(triple_dict1)
# Output: {'c': 9, 'd': 12, 'e': 15}

19. What is Regex? Can you use Regex to validate an email address?

Regular Expression (RegEx) contains special and ordinary characters for matching operations in Python.

The re.match function can be used for this exercise.

Python

import re

email = '[email protected]'

def validate_email(email):
    pattern = '^([a-z0-9_.-]+)@([a-z0-9-]+)\.([a-z0-9-.])+$'
    search = re.match(pattern, email)
    if search:
        return f"{search.group()} is okay"
    else:
        return f"{email} is not valid"

print(validate_email(email))
# Output: '[email protected] is okay'

ℹ️

Practice answering this interview question, "Build a Regex Parser."

20. Discuss the pros and cons of random forests in classification and regression.

A Random Forest is a collection of decision trees. It selects the class with the most votes from all the trees in the forest.

You may encounter questions about random forests in your machine learning interviews.

Classification:

Select random samples from the dataset with replacement.
Build a decision tree for each sample.
Obtain a prediction from each tree.
Vote.
Select the prediction with the most votes.

Regression:

Select random samples from the dataset with replacement.
Build a decision tree for each sample.
Obtain an average from each tree.

Advantages:

Controls overfitting by fitting several decision trees.
Higher accuracy than a single decision tree.
Runs efficiently on large datasets.
Provides feature importance.
Can be used for both classification and regression problems.

ℹ️

21. What is the difference between lists, NumPy arrays, and sets in Python? When should you consider one over the other?

Lists, arrays, and sets are data structures for storing data in Python.

Lists: Denoted by [], store a sequence of data in multiple formats. For example, you can store integers, floats, and strings in the same list. List items can be accessed using their index location and manipulated.
NumPy arrays: Denoted by array(), store items of the same data type only. Very efficient for numerical computation compared to lists.
Sets: Denoted by {}, allow storage of multiple data types but items in a set cannot be updated. Sets also don’t allow for duplicates.

Considerations:

Use NumPy arrays for numerical computation due to their speed.
Use sets for removing duplicates from a list and when you don’t expect the values in the data to change.

Advanced Python and Best Practices

22. Explain the most common Python string functions.

The top Python string functions include:

split: Splits a string.

Python

string.split()

strip: Removes trailing or leading characters from a string, such as spaces and commas.

Python

string.strip(',')

upper: Converts a string to uppercase.

Python

string.upper()

capitalize: Capitalizes a string.

Python

string.capitalize()

count: Counts how many times a word appears in a string.

Python

string.count('the')

23. Discuss Python unit testing with an example.

The Python unittest module provides the tools needed for running tests. Creating tests ensures that the code runs as expected and prevents accidental bugs when modifying code.

This is done by writing test cases that assert different scenarios, for example, checking that the answer returned by a function is greater than zero.

Python

def name_as_uppercase(name):
    return name.upper()

def check_balance(amount_paid, loan):
    return amount_paid - loan

import unittest

class TestCases(unittest.TestCase):
    def test_upper(self):
        new_name = name_as_uppercase('derrick')
        self.assertEqual('derrick'.upper(), new_name)

    def test_balance(self):
        balance = check_balance(20, 10)
        self.assertGreaterEqual(balance, 0)

if __name__ == '__main__':
    unittest.main()

24. Discuss different types of variables in Python OOP.

Class variables: Defined inside the class and accessible by all instances of the class.

Python

class School():
    language = "English" # class attribute

    def __init__(self, name, location): #__init__() sets the initial state of the object 
        self.name = name # instance attribute
        self.location = location # instance attribute

Instance variables: Accessible by individual class instances.

Python

class School():
    def __init__(self, name, location): #__init__() sets the initial state of the object 
        self.name = name # instance attribute
        self.location = location # instance attribute

Local variables: Defined within methods and only available within those methods.

25. Differentiate the types of methods in Python OOP.

Class methods: Used for changing the class state and only access class variables. They take the first parameter as cls.

Python

class School():
    language = "English" # class attribute

    @classmethod
    def chat_motto(cls, motto):
        return f"The motto is: '{motto}', class is '{cls}'"

Instance methods: Can access both class and instance variables. They take the first argument as self.

Python

class School():
    def chat_motto(self, motto):
        return f"The motto is: '{motto}'"

Static methods: Don’t have access to class or instance variables and don’t take a specific first parameter such as cls or self.

Python

class School():
    @staticmethod 
    def chat_motto(motto):
        return f"The motto is: '{motto}'"

Python Data Science Interview Tips

Hopefully, these Python questions have given you a glimpse into what to expect in your data science interviews.

Explore dozens of mock interviews and practice lessons in our data science interview course.
Schedule a free mock interview session to practice answering questions with other peers.
Get data science interviewing coaching from scientists at top companies who have numerous years of experience.

Good luck with your upcoming interview!

Top 25 Python Data Science Interview Questions (2025 Guide)

Python Fundamentals

1. Which is faster, Python lists or Numpy arrays? Why?

2. What is the difference between map() and applymap()?

3. Explain zip() and enumerate().

4. What is a lambda function?

5. How do map, reduce, and filter functions work?

7. We have a long list of unsorted numbers (potentially millions) and want to find the M largest numbers. Implement a function find_largest(input, m) to find and return the largest M values given an input array or file. Return None (Python) or null if the input array is empty.

8. You are given an array of characters with sequences separated by spaces. Each space-delimited sequence of characters defines a word. Implement a function reverse_words that reverses the order of the words in the array in the most efficient manner.

9. What is the difference between return and yield keywords?

10. What are global and local variables in Python?

11. Write a function to sample words uniformly from a text file.

12. What are decorators in Python? How are they used?

13. Implement a character-level language model with a Vanilla Recurrent Neural Network in Python

14. Given an integer n, representing the number of steps in a staircase, write a function climbStairs that returns the number of distinct ways to climb to the top. Each time, you can either climb 1 step or 2 steps. You may assume that n is a non-negative integer.

Data Science with Python

15. What is the difference between indexing and slicing in NumPy?

16. Implement simulations to estimate probabilities for dice rolling.

17. What is the difference between merge, join, and concatenate?

18. Explain list comprehension and dict comprehension.

19. What is Regex? Can you use Regex to validate an email address?

20. Discuss the pros and cons of random forests in classification and regression.

Classification:

Regression:

Advantages:

21. What is the difference between lists, NumPy arrays, and sets in Python? When should you consider one over the other?

Considerations:

Advanced Python and Best Practices

22. Explain the most common Python string functions.

23. Discuss Python unit testing with an example.

24. Discuss different types of variables in Python OOP.

25. Differentiate the types of methods in Python OOP.

Python Data Science Interview Tips

Browse data scientist interview questions

Book time with a Data Scientist coach

Learn everything you need to ace your data science interviews.

Related Courses

Data Science Interview Prep

Data Communications Questions

Related Blog Posts

Top 25 Statistics Data Science Interview Questions (2025 Guide)

Data Science Career Path: Your Complete Guide

Complete Data Scientist Resume Guide (with FAANG Templates)

Your First Analytics Engineering Job Should Be In Consulting

8. You are given an array of characters with sequences separated by spaces. Each space-delimited sequence of characters defines a word. Implement a function `reverse_words` that reverses the order of the words in the array in the most efficient manner.