Home Questions Data Engineer Data Pipeline Design

Data Engineer Data Pipeline Design Interview Questions

Review this list of 16 data pipeline design data engineer interview questions and answers verified by hiring managers and candidates.

+ Share interview

Product

Engineering

Operations

Design

Marketing

Data

Sales

Finance

Consulting

Security

Share interview

Data Engineer Software Engineer Business Analyst

Asked at Databricks • 2 months ago
Design a document processing pipeline.
Data Engineer
Data Pipeline Design
+2 more
2 answers I was asked this
"High Level Architect Client v API Gateway v Object Storage v Message Queue v Worker v Database Client should can document with a web site or directly with API services. API Gateway should be used for upload document,get document info and state. Object storage should be used for original document and send event to Message Queue for starting. Message Queue is neccessary because there are millions of document should be process each time. Worker can get text from document with OCR. Database shoul"
Berk C. - "High Level Architect Client v API Gateway v Object Storage v Message Queue v Worker v Database Client should can document with a web site or directly with API services. API Gateway should be used for upload document,get document info and state. Object storage should be used for original document and send event to Message Queue for starting. Message Queue is neccessary because there are millions of document should be process each time. Worker can get text from document with OCR. Database shoul"See full answer
Data Engineer
Data Pipeline Design
+2 more
Asked at Databricks • a year ago
What's the difference between a data lakehouse and a data warehouse?
Data Engineer
Data Pipeline Design
4 answers I was asked this
+1
"Data Warehouses are purpose built to derive a business insight but datalakes are build to store virtually all data generated by the organization on which meaning can be derived later."
Anonymous Partridge - "Data Warehouses are purpose built to derive a business insight but datalakes are build to store virtually all data generated by the organization on which meaning can be derived later."See full answer
Data Engineer
Data Pipeline Design
Design an ETL Pipeline for a ML Platform for AWS
Data Engineer
Data Pipeline Design
Add answer I was asked this
Data Engineer
Data Pipeline Design
Design Netflix's Clickstream Data Pipeline
Data Engineer
Data Pipeline Design
+1 more
Add answer I was asked this
Data Engineer
Data Pipeline Design
+1 more
Asked at Databricks • a year ago
How would you handle slow query performance for a single-user SQL endpoint in Databricks, where all sequentially run queries are affected?
Data Engineer
Data Pipeline Design
Add answer I was asked this
Data Engineer
Data Pipeline Design

🧠 Want an expert answer to a question? Saving questions lets us know what content to make next.

Asked at Google • a year ago
When is Hadoop better than PySpark?
Data Engineer
Data Pipeline Design
1 answer I was asked this
"Hadoop is better than PySpark when you are dealing with extremely large scale, batch oriented, non-iterative workloads where in-memory computing isn't feasible/ necessary, like log storage or ETL workflows that don't require high response times. It's also better in situations where the Hadoop ecosystem is already deeply embedded and where there is a need for resource conscious, fault tolerant computation without the overhead of Spark's memory constraints. In these such scenarios, Hadoop's disk-b"
Joshua R. - "Hadoop is better than PySpark when you are dealing with extremely large scale, batch oriented, non-iterative workloads where in-memory computing isn't feasible/ necessary, like log storage or ETL workflows that don't require high response times. It's also better in situations where the Hadoop ecosystem is already deeply embedded and where there is a need for resource conscious, fault tolerant computation without the overhead of Spark's memory constraints. In these such scenarios, Hadoop's disk-b"See full answer
Data Engineer
Data Pipeline Design
Design an ETL Pipeline for Slack for School
Data Engineer
Data Pipeline Design
Add answer I was asked this
Data Engineer
Data Pipeline Design
Asked at Databricks • a year ago
How would you handle scheduling dependencies between two nightly Jobs to ensure the second Job does not fail if the first Job runs longer than expected?
Data Engineer
Data Pipeline Design
1 answer I was asked this
"There are 2 questions popping into my mind: Should the 2nd job have to kick off at 12:30AM? Are there others depending on the 2nd job? If both answers are no, we may simply postpone the second job to allow sufficient time for the first one to complete. If they are yeses, we could let the 2nd job retry to a certain amount of times. Make sure that even reaching the maximum of retries won't delay or fail the following jobs."
Anzhe M. - "There are 2 questions popping into my mind: Should the 2nd job have to kick off at 12:30AM? Are there others depending on the 2nd job? If both answers are no, we may simply postpone the second job to allow sufficient time for the first one to complete. If they are yeses, we could let the 2nd job retry to a certain amount of times. Make sure that even reaching the maximum of retries won't delay or fail the following jobs."See full answer
Data Engineer
Data Pipeline Design
Asked at Databricks • a year ago
What is a Medallion Architecture?
Data Engineer
Data Pipeline Design
1 answer I was asked this
Data Engineer
Data Pipeline Design
Asked at Amazon • 2 years ago
Design a data pipeline that updates hourly and powers a dashboard showing the most common Alexa user requests, broken down by country.
Data Engineer
Data Pipeline Design
Add answer I was asked this
Data Engineer
Data Pipeline Design
Asked at Databricks • a year ago
How would you handle a task in a nightly job that fails unexpectedly during 10 percent of the runs?
Data Engineer
Data Pipeline Design
Add answer I was asked this
Data Engineer
Data Pipeline Design
Design a data pipeline that complies with GDPR.
Data Engineer
Data Pipeline Design
Add answer I was asked this
Data Engineer
Data Pipeline Design
Design a system to ingest large amounts of JSON data from multiple S3 buckets
Data Engineer
Data Pipeline Design
Add answer I was asked this
Data Engineer
Data Pipeline Design
Asked at Databricks • a year ago
When should you use Delta Live Tables over standard data pipelines built on Spark and Delta Lake?
Data Engineer
Data Pipeline Design
Add answer I was asked this
Data Engineer
Data Pipeline Design
Asked at Databricks • a year ago
What is delta lake?
Data Engineer
Data Pipeline Design
Add answer I was asked this
Data Engineer
Data Pipeline Design
Asked at Databricks • a year ago
When should you use a job cluster instead of an all-purpose cluster?
Data Engineer
Data Pipeline Design
Add answer I was asked this
Data Engineer
Data Pipeline Design