Design a Data Warehouse Schema for Stripe
In this video, a data engineer does data modeling for Stripe.
Understanding Stripe's business model is key to creating an effective data model. Stripe facilitates online payments for businesses worldwide, enabling business users to set up products and payment links that allow customers to pay online without needing contracts with merchants or payment processors.
The focus here is to design a data model for Stripe's data warehouse, not the system design of the Stripe app or ETL pipeline. This means defining the warehouse's scope, use cases, and end users. In other words, who will use the warehouse, and what insights will they seek?
Initial Clarifying Steps (2-5 minutes)
Instead of asking direct questions, make statements and propose ideas to guide the interviewer, giving them space to steer the conversation. Keep the process interactive, as back-and-forth engagement signals positively. As you discuss, take notes; this helps the interviewer and yourself keep track of your approach and provides a review record.
Begin by suggesting that Stripe has three potential use cases related to its business model:
- Lifetime Value & Churn Rate: Monitoring these metrics for business users is essential for understanding growth.
- Payment Outcome Monitoring: Payment processes are complex, and a robust data warehouse should track outcomes, such as fraud, chargebacks, refunds, or operational errors by Stripe.
- Revenue Tracking: Monitoring revenue is crucial, especially if Stripe aims to IPO or attract investors.
You can design a data warehouse considering all three use cases but may need to delve deeply into just one.
Since transaction outcomes are critical to Stripe's success, it makes sense to focus on Use Case 2: handling payment outcomes.
Identifying data warehouse users (2 minutes)
The primary users of this data warehouse would be data scientists, analysts, product managers, and engineers. End users include C-level executives, investors, fraud prevention, and customer success teams, who will view the data via dashboards or reports.
Defining the scope (2-3 minutes)
Out of scope:
- ETL Pipeline: The design focuses solely on the data model.
- Financing: Financing options, such as Affirm, are omitted for simplicity.
- Currency Conversion: Assume transactions are only in USD.
- Tax Collection: Tax complexities, especially for global payments, are excluded.
In scope:
- Business Information
- Product Details
- Customer Information
- Transaction Data
High-level data model design (2 minutes)
Begin by identifying relevant entities and events.
Entities:
- Customer: The individual making the purchase.
- Product: The items or services sold.
- Business: The seller of the product.
- Payment: The payment made by the customer.
Events:
- Transaction: When a customer completes a payment to the business for a product or service.
From these entities, you can define the dimensions and facts that form the star schema of the data warehouse:
Dimensions:
- Payment
- Customer
- Product
- Location
- DateTime
Facts:
- Payment Amount
- Collected Fees
Detailed data warehouse design (10-15 minutes)
Transaction fact table:
Dimension tables: payment table
To save time, you may skip detailing other dimensional tables. However, mention that all dimensions, like the business and user, also contain a location field.

Questions and queries (10-15 minutes)
Some example questions and SQL queries based on the data warehouse model:
Question: Which countries have the highest chargeback or fraud rates?
SQLWITH fraud_rate_table AS (
SELECT country,
SUM(CASE WHEN status_code = 2 THEN 1 ELSE 0) * 1.0 / COUNT(*) AS fraud_rate
FROM transaction_facts AS a
INNER JOIN location_dim AS b ON a.location_id = b.location_id
GROUP BY b.country
)
SELECT * FROM fraud_rate_table ORDER BY fraud_rate DESC;
Question: What is the total monthly revenue?
SQLSELECT b.month, b.year, SUM(a.collected_fees) AS monthly_collected_revenue
FROM transaction_fact AS a
INNER JOIN time_dim AS b ON b.time_id = a.time_id
GROUP BY b.month, b.year;
Question: What’s the average lifetime value of a business?
SQLWITH total_revenue_per_business AS ( SELECT business_id, SUM(collected_fees) AS business_revenue FROM transaction_fact GROUP BY business_id ) SELECT AVG(business_revenue) FROM total_revenue_per_business;
Common follow-up questions
- Why have a payment detail table and payment-specific info in the fact table?
- Facts contain aggregatable data, while dimensions provide descriptive data. Aggregatable values like transaction amounts stay in fact tables, allowing calculations like sums and averages.
- How do you handle chargebacks and refunds?
- Update the transaction’s status rather than adding a new record. This approach preserves original values and lets users filter for chargebacks or refunds as needed. For example, adding columns like refunded_amount could track changes without losing information.
- Why is there a location for billing vs. transaction location?
- These can differ: the payment’s location, billing address, and business location each hold distinct analytical value, especially for fraud detection.
Extra considerations
If time permits, discuss additional potential issues or areas to optimize for, showcasing your analytical depth.