Top Data Engineer Interview Questions

Review this list of 167 Data Engineer interview questions and answers verified by hiring managers and candidates.

+ Share interview

Asked at Adobe, Apple, Nvidia • 2 years ago
Build a Basic Regex Parser
IDE
Hard
Data Engineer
Data Structures & Algorithms
+3 more
2 answers
"func isMatch(text: String, pattern: String) -> Bool { // Convert strings to arrays for easier indexing let s = Array(text.characters) let p = Array(pattern.characters) guard !s.isEmpty && !p.isEmpty else { return true } // Create DP table: dpi represents if s[0...i-1] matches p[0...j-1] var dp = Array(repeating: Array(repeating: false, count: p.count + 1), count: s.count + 1) // Empty pattern matches empty string dp[0]["
Reno S. - "func isMatch(text: String, pattern: String) -> Bool { // Convert strings to arrays for easier indexing let s = Array(text.characters) let p = Array(pattern.characters) guard !s.isEmpty && !p.isEmpty else { return true } // Create DP table: dpi represents if s[0...i-1] matches p[0...j-1] var dp = Array(repeating: Array(repeating: false, count: p.count + 1), count: s.count + 1) // Empty pattern matches empty string dp[0]["See full answer
Data Engineer
Data Structures & Algorithms
+3 more
Asked at Amazon • a year ago
Create geographic and demographic dashboards for weekly, monthly, and yearly analytics using order data (100M daily records for 5 years) and customer data (1B customers).
Data Engineer
Data Modeling
1 answer
"What do all data scientists need to know about how to work with very large datasets? 37 Follow Request Answer More All related (39) Recommended 📷 Corrin Lakeland · Follow , M.S. Data Science, University of St. Thomas, St. Paul (2018)6yData Science consultant and managerUpvoted by[Tom Halloin](https://www.quora"
Hayatu H. - "What do all data scientists need to know about how to work with very large datasets? 37 Follow Request Answer More All related (39) Recommended 📷 Corrin Lakeland · Follow , M.S. Data Science, University of St. Thomas, St. Paul (2018)6yData Science consultant and managerUpvoted by[Tom Halloin](https://www.quora"See full answer
Data Engineer
Data Modeling
Asked at Adobe, Apple, Capital One + 1 more • 2 years ago
Roman to Integer
Data Engineer
Data Structures & Algorithms
+4 more
Add answer
Data Engineer
Data Structures & Algorithms
+4 more
Asked at Adobe, Amazon, Apple + 1 more • 2 years ago
Top k frequent elements
Data Engineer
Data Structures & Algorithms
+3 more
1 answer
"Leetcode 347: Heap + Hashtable Follow up question: create heap with the length of K instead of N (more time complexity but less space )"
Chen J. - "Leetcode 347: Heap + Hashtable Follow up question: create heap with the length of K instead of N (more time complexity but less space )"See full answer
Data Engineer
Data Structures & Algorithms
+3 more
Asked at Apple, Google, LinkedIn • 2 years ago
Calculate the height of a binary tree.
Data Engineer
Data Structures & Algorithms
+3 more
3 answers
"Recursion: 0 if NULL, else 1+max(height(left), height(right))"
Mohith J. - "Recursion: 0 if NULL, else 1+max(height(left), height(right))"See full answer
Data Engineer
Data Structures & Algorithms
+3 more

🧠 Want an expert answer to a question? Saving questions lets us know what content to make next.

Asked at Adobe, Intel, Nvidia + 1 more • 2 years ago
Sort Colors
Data Engineer
Data Structures & Algorithms
+4 more
Add answer
Data Engineer
Data Structures & Algorithms
+4 more
Asked at Deloitte • a year ago
Explain the key differences between BETWEEN and HAVING clauses in SQL.
Data Engineer
Concept
+4 more
3 answers
"BETWEEN and HAVING clauses in SQL serve different purposes: 1. BETWEEN Clause Used to filter rows based on a range of values. Works with numeric, date, or text values. Can be used with WHERE or HAVING clauses. The range includes both lower and upper bounds. Example: Filtering employees with salaries between 30,000 and 50,000 `SELECT * FROM Employees WHERE salary BETWEEN 30000 AND 50000;` 2. HAVING Clause Used to filter **groups"
Meenakshi D. - "BETWEEN and HAVING clauses in SQL serve different purposes: 1. BETWEEN Clause Used to filter rows based on a range of values. Works with numeric, date, or text values. Can be used with WHERE or HAVING clauses. The range includes both lower and upper bounds. Example: Filtering employees with salaries between 30,000 and 50,000 `SELECT * FROM Employees WHERE salary BETWEEN 30000 AND 50000;` 2. HAVING Clause Used to filter **groups"See full answer
Data Engineer
Concept
+4 more
Asked at Apple • 2 years ago
Set Matrix Zeroes
Data Engineer
Data Structures & Algorithms
+2 more
3 answers
"I was able to provide the optimal approach and coded it up"
Anonymous Wasp - "I was able to provide the optimal approach and coded it up"See full answer
Data Engineer
Data Structures & Algorithms
+2 more
Asked at Walmart Labs • 2 years ago
Tell me about your e-commerce experience.
Data Engineer
Behavioral
+2 more
1 answer
"I’ve spent over 6 years building and scaling e-commerce products across EMEA and APAC. At Jumia, I led product initiatives on the checkout and payments side. For example, I launched gamified promotions on PDP and checkout that improved engagement and delivered a 2.3x uplift in conversion. I also introduced automated installment payments and order cancellation flows, which not only improved user trust but also reduced complaints by 30% and lowered operational costs. Before that, at Lazada, I work"
Rajeev K. - "I’ve spent over 6 years building and scaling e-commerce products across EMEA and APAC. At Jumia, I led product initiatives on the checkout and payments side. For example, I launched gamified promotions on PDP and checkout that improved engagement and delivered a 2.3x uplift in conversion. I also introduced automated installment payments and order cancellation flows, which not only improved user trust but also reduced complaints by 30% and lowered operational costs. Before that, at Lazada, I work"See full answer
Data Engineer
Behavioral
+2 more
Asked at Google • 2 years ago
When is Hadoop better than PySpark?
Data Engineer
Data Pipeline Design
1 answer
"Hadoop is better than PySpark when you are dealing with extremely large scale, batch oriented, non-iterative workloads where in-memory computing isn't feasible/ necessary, like log storage or ETL workflows that don't require high response times. It's also better in situations where the Hadoop ecosystem is already deeply embedded and where there is a need for resource conscious, fault tolerant computation without the overhead of Spark's memory constraints. In these such scenarios, Hadoop's disk-b"
Joshua R. - "Hadoop is better than PySpark when you are dealing with extremely large scale, batch oriented, non-iterative workloads where in-memory computing isn't feasible/ necessary, like log storage or ETL workflows that don't require high response times. It's also better in situations where the Hadoop ecosystem is already deeply embedded and where there is a need for resource conscious, fault tolerant computation without the overhead of Spark's memory constraints. In these such scenarios, Hadoop's disk-b"See full answer
Data Engineer
Data Pipeline Design
Explain the differences between multithreading and multiprocessing.
Data Engineer
Concept
4 answers
+1
"a process can include many threads. good for concurrent and parallel task execution"
Erjan G. - "a process can include many threads. good for concurrent and parallel task execution"See full answer
Data Engineer
Concept
Asked at Databricks • 2 years ago
How would you handle scheduling dependencies between two nightly Jobs to ensure the second Job does not fail if the first Job runs longer than expected?
Data Engineer
Data Pipeline Design
1 answer
"There are 2 questions popping into my mind: Should the 2nd job have to kick off at 12:30AM? Are there others depending on the 2nd job? If both answers are no, we may simply postpone the second job to allow sufficient time for the first one to complete. If they are yeses, we could let the 2nd job retry to a certain amount of times. Make sure that even reaching the maximum of retries won't delay or fail the following jobs."
Anzhe M. - "There are 2 questions popping into my mind: Should the 2nd job have to kick off at 12:30AM? Are there others depending on the 2nd job? If both answers are no, we may simply postpone the second job to allow sufficient time for the first one to complete. If they are yeses, we could let the 2nd job retry to a certain amount of times. Make sure that even reaching the maximum of retries won't delay or fail the following jobs."See full answer
Data Engineer
Data Pipeline Design
Asked at Discord, Two Sigma • 2 months ago
What other companies are you interviewing at and why?
Data Engineer
Behavioral
+4 more
Add answer
Data Engineer
Behavioral
+4 more
Asked at Apple, Goldman Sachs, Oracle • 2 years ago
Implement a hashmap without using any libraries.
Data Engineer
Data Structures & Algorithms
+2 more
1 answer
"public class HashMap { public class Element { T key; V value; Element(T k, V v) { this.key = k; this.value = v; } } private static final int DEFAULT_CAPACITY = 16; private static final float LOAD_FACTOR = 0.75f; private LinkedList[] table = new LinkedList[DEFAULT_CAPACITY]; private int size = 0; private int threshold = (int) (DEFAULTCAPACITY * LOADFACTOR); public void put(T k"
Md kamrul H. - "public class HashMap { public class Element { T key; V value; Element(T k, V v) { this.key = k; this.value = v; } } private static final int DEFAULT_CAPACITY = 16; private static final float LOAD_FACTOR = 0.75f; private LinkedList[] table = new LinkedList[DEFAULT_CAPACITY]; private int size = 0; private int threshold = (int) (DEFAULTCAPACITY * LOADFACTOR); public void put(T k"See full answer
Data Engineer
Data Structures & Algorithms
+2 more
Asked at Databricks • 2 years ago
What is a Medallion Architecture?
Data Engineer
Data Pipeline Design
2 answers
"Medallion architecture is a layered data architecture used in lakehouse systems. Data flows through Bronze, Silver, and Gold layers where each layer improves data quality. Bronze stores raw data, Silver contains cleaned and validated datasets, and Gold provides aggregated business-ready data for analytics and reporting bronzedf = spark.read.json("/landing/apidata") bronze_df.write.format("delta").save("/bronze/users")"
Ramagiri P. - "Medallion architecture is a layered data architecture used in lakehouse systems. Data flows through Bronze, Silver, and Gold layers where each layer improves data quality. Bronze stores raw data, Silver contains cleaned and validated datasets, and Gold provides aggregated business-ready data for analytics and reporting bronzedf = spark.read.json("/landing/apidata") bronze_df.write.format("delta").save("/bronze/users")"See full answer
Data Engineer
Data Pipeline Design
Asked at Walmart Labs • 2 years ago
Why do you want to work at Walmart Labs?
Data Engineer
Behavioral
+5 more
Add answer
Data Engineer
Behavioral
+5 more
What types of indexes are in a relational database?
Data Engineer
Technical
1 answer
"i said there is hashed, clustered, non-clustered"
Erjan G. - "i said there is hashed, clustered, non-clustered"See full answer
Data Engineer
Technical
Explain the differences between Parquet and Avro.
Data Engineer
Technical
2 answers
"Parquet = reading only the columns you need in a spreadsheet Avro = reading full rows one at a time"
Dessalew A. - "Parquet = reading only the columns you need in a spreadsheet Avro = reading full rows one at a time"See full answer
Data Engineer
Technical
Asked at Anthropic, Meta • 7 months ago
Tell me about a time you had to learn something quickly.
Data Engineer
Behavioral
+1 more
Add answer
Data Engineer
Behavioral
+1 more
Asked at Uber • a year ago
Design a rewarding system.
Data Engineer
Coding
1 answer
"Not my answer, but rather the details of this question. It should include the following functions: int insertNewCustomer(double revenue) -> returns a customer ID (assume auto-incremented & 0-based) int insertNewCustomer(double revenue, int referrerID) -> returns a customer ID (assume auto-incremented & 0-based) Set getLowestKCustomersByMinTotalRevenue(int k, double minTotalRevenue) -> returns customer IDs Note: The total revenue consists of the revenue that this customer bring"
Anzhe M. - "Not my answer, but rather the details of this question. It should include the following functions: int insertNewCustomer(double revenue) -> returns a customer ID (assume auto-incremented & 0-based) int insertNewCustomer(double revenue, int referrerID) -> returns a customer ID (assume auto-incremented & 0-based) Set getLowestKCustomersByMinTotalRevenue(int k, double minTotalRevenue) -> returns customer IDs Note: The total revenue consists of the revenue that this customer bring"See full answer
Data Engineer
Coding