Skip to main content

Design a Data Warehouse Schema for Stripe

Premium

In this video, a data engineer does data modeling for Stripe.

Understanding Stripe's business model is key to creating an effective data model. Stripe facilitates online payments for businesses worldwide, enabling business users to set up products and payment links that allow customers to pay online without needing contracts with merchants or payment processors.

The focus here is to design a data model for Stripe's data warehouse, not the system design of the Stripe app or ETL pipeline. This means defining the warehouse's scope, use cases, and end users. In other words, who will use the warehouse, and what insights will they seek?

Initial Clarifying Steps (2-5 minutes)

Instead of asking direct questions, make statements and propose ideas to guide the interviewer, giving them space to steer the conversation. Keep the process interactive, as back-and-forth engagement signals positively. As you discuss, take notes; this helps the interviewer and yourself keep track of your approach and provides a review record.

Begin by suggesting that Stripe has three potential use cases related to its business model:

  1. Lifetime Value & Churn Rate: Monitoring these metrics for business users is essential for understanding growth.
  2. Payment Outcome Monitoring: Payment processes are complex, and a robust data warehouse should track outcomes, such as fraud, chargebacks, refunds, or operational errors by Stripe.
  3. Revenue Tracking: Monitoring revenue is crucial, especially if Stripe aims to IPO or attract investors.

You can design a data warehouse considering all three use cases but may need to delve deeply into just one.

Since transaction outcomes are critical to Stripe's success, it makes sense to focus on Use Case 2: handling payment outcomes.

Identifying data warehouse users (2 minutes)

The primary users of this data warehouse would be data scientists, analysts, product managers, and engineers. End users include C-level executives, investors, fraud prevention, and customer success teams, who will view the data via dashboards or reports.

Defining the scope (2-3 minutes)

Out of scope:

  1. ETL Pipeline: The design focuses solely on the data model.
  2. Financing: Financing options, such as Affirm, are omitted for simplicity.
  3. Currency Conversion: Assume transactions are only in USD.
  4. Tax Collection: Tax complexities, especially for global payments, are excluded.

In scope:

  1. Business Information
  2. Product Details
  3. Customer Information
  4. Transaction Data

High-level data model design (2 minutes)

Begin by identifying relevant entities and events.

Entities:

  • Customer: The individual making the purchase.
  • Product: The items or services sold.
  • Business: The seller of the product.
  • Payment: The payment made by the customer.

Events:

  • Transaction: When a customer completes a payment to the business for a product or service.

From these entities, you can define the dimensions and facts that form the star schema of the data warehouse:

Dimensions:

  • Payment
  • Customer
  • Product
  • Location
  • DateTime

Facts:

  • Payment Amount
  • Collected Fees

Detailed data warehouse design (10-15 minutes)

Transaction fact table:

Column nameTypeDescription
transaction_idbigintUnique ID for the fact table
payment_idbigintPayment transaction ID
datetime_idbigintDateTime ID for transaction timestamp
customer_idbigintID of the customer making the payment
product_idbigintID of the product
location_idbigintLocation ID of the transaction
business_idbigintID of the business
payment_amountdecimalTransaction amount
collected_feesdecimalFees collected from the transaction

Dimension tables: payment table

Column NameTypeDescription
payment_idbigintUnique payment ID
payment_processorvarcharPayment processor (Visa, Mastercard, etc.)
is_subscriptionbooleanWhether the payment is part of a subscription
payment_typevarcharPayment method (financed, credit, or debit)
billing_location_idbigintLocation ID of billing address
status_codebigintStatus outcome of the payment

To save time, you may skip detailing other dimensional tables. However, mention that all dimensions, like the business and user, also contain a location field.

Data modeling schema Stripe

Questions and queries (10-15 minutes)

Some example questions and SQL queries based on the data warehouse model:

Question: Which countries have the highest chargeback or fraud rates?

SQL
WITH fraud_rate_table AS ( SELECT country, SUM(CASE WHEN status_code = 2 THEN 1 ELSE 0) * 1.0 / COUNT(*) AS fraud_rate FROM transaction_facts AS a INNER JOIN location_dim AS b ON a.location_id = b.location_id GROUP BY b.country ) SELECT * FROM fraud_rate_table ORDER BY fraud_rate DESC;

Question: What is the total monthly revenue?

SQL
SELECT b.month, b.year, SUM(a.collected_fees) AS monthly_collected_revenue FROM transaction_fact AS a INNER JOIN time_dim AS b ON b.time_id = a.time_id GROUP BY b.month, b.year;

Question: What’s the average lifetime value of a business?

SQL
WITH total_revenue_per_business AS ( SELECT business_id, SUM(collected_fees) AS business_revenue FROM transaction_fact GROUP BY business_id ) SELECT AVG(business_revenue) FROM total_revenue_per_business;

Common follow-up questions

  • Why have a payment detail table and payment-specific info in the fact table?
    • Facts contain aggregatable data, while dimensions provide descriptive data. Aggregatable values like transaction amounts stay in fact tables, allowing calculations like sums and averages.
  • How do you handle chargebacks and refunds?
    • Update the transaction’s status rather than adding a new record. This approach preserves original values and lets users filter for chargebacks or refunds as needed. For example, adding columns like refunded_amount could track changes without losing information.
  • Why is there a location for billing vs. transaction location?
    • These can differ: the payment’s location, billing address, and business location each hold distinct analytical value, especially for fraud detection.

Extra considerations

If time permits, discuss additional potential issues or areas to optimize for, showcasing your analytical depth.