Skip to main content

Design a Data Warehouse Schema for Airbnb

Premium

In this video, we dive into data modeling for Airbnb, focusing on building a star schema model for analytical purposes. The discussion covers capturing metrics such as customer satisfaction and business profitability through key fact tables like Booking, Revenue, and Review, while exploring important dimensions such as user, listing, and location to optimize Airbnb’s operational insights.

Identifying the problem (2-5 minutes)

Since this is a Data Warehouse, we can assume it will primarily be used for analytical purposes.

To choose between schemas:

  • Star schema: This schema denormalizes data, adding redundant columns to some dimension tables. While this increases storage requirements, it improves query performance, making it a better choice for read-heavy operations.
  • Snowflake schema: This schema normalizes data, reducing redundancy and optimizing storage. However, it requires more joins to generate the same view, which makes it less optimal for read-heavy operations.

Given that our goal is to perform analytics on Airbnb data, we should consider using a star schema. Its read-optimized design aligns well with the analytical needs of a Data Warehouse.

Solving the problem (15 minutes)

Now that we’ve established the goal of performing analytics on Airbnb data and decided on a Star Schema, let’s explore the key metrics we aim to derive from this data.

Analytics focus:

  1. Customer experience: Understanding customer feedback is crucial for improving service quality. Metrics here could include:
    • Reviews and their ratings.
    • Complaint patterns and resolution times.
    • Host response times.
  2. Business profitability: Capturing data around bookings can help identify reservation trends and seasonality. Additional metrics, like pricing data, can uncover opportunities for revenue optimization.

Core fact tables:

Based on these focus areas, we can categorize the data into three primary fact tables:

  • Reservations: To analyze booking patterns and trends.
  • Reviews: To understand customer feedback and satisfaction.
  • Revenue: To track income and payment methods.

Fact tables store numerical data, while dimension tables provide context. Both are necessary for effective data modeling and analytics.

Fact tables

Booking fact table

Captures data about reservations and includes:

  • BookingID (Primary Key)
  • ListingID (Foreign Key to Listing Dimension)
  • UserID (Foreign Key to User Dimension)
  • HostID (Foreign Key to Host Dimension)
  • BookingDate
  • CheckInDate
  • CheckOutDate
  • NightsBooked
  • TotalCost
  • BookingStatus (with optional CancellationReason)
  • BookingSource (e.g., mobile, web)
  • NumberOfGuests

Review fact table

Stores data about customer reviews and ratings:

  • ReviewID (Primary Key)
  • ListingID (Foreign Key to Listing Dimension)
  • UserID (Foreign Key to User Dimension)
  • ReviewDate
  • Rating (e.g., 1–5 stars)
  • ReviewText
  • ResponseTime (time taken by the host to respond)
  • Optional: Add BookingID as a foreign key to tie reviews to specific bookings, ensuring only guests who booked and stayed can leave reviews.

Revenue fact table

Tracks revenue and payment details:

  • RevenueID (Primary Key)
  • BookingID (Foreign Key to Booking Fact)
  • ListingID (Foreign Key to Listing Dimension)
  • Date
  • RevenueAmount
  • PaymentMethod
  • Currency

Dimension tables

User dimension

Contains user-specific details:

  • UserID (Primary Key)
  • Name
  • Email
  • Phone
  • Country
  • JoinDate
  • UserType (e.g., guest, host)
  • Gender, AgeGroup

Host dimension

Captures host-specific details:

  • HostID (Primary Key)
  • Name
  • Email
  • Phone
  • Country
  • JoinDate
  • SuperHostStatus
  • ResponseRate

Listing dimension

Details about listings:

  • ListingID (Primary Key)
  • HostID (Foreign Key to Host Dimension)
  • PropertyType (e.g., apartment, house)
  • RoomType (e.g., entire home, private room)
  • NumberOfRooms
  • City
  • State
  • Country
  • Latitude
  • Longitude
  • PricePerNight
  • CancellationPolicy
  • Amenities (stored as a list or JSON)
  • ListingCreationDate
  • LastUpdatedDate

Date dimension

Captures date-related information:

  • Date (Primary Key)
  • Year
  • Month
  • Day
  • Week
  • Quarter
  • DayOfWeek
  • IsHoliday

Location dimension

Provides geographic details:

  • LocationID (Primary Key)
  • City
  • State
  • Country
  • Region

This structure ensures a clean and efficient design, supporting the analytical needs of the system while maintaining flexibility for future extensions.

Data model

data modelling airbnb 1

data modelling airbnb 2

data modelling airbnb 3

Query analysis (10 minutes)

Circling back to the metrics, let’s see how we can construct our queries within this schema.

1. Revenue analysis

Purpose: Understanding revenue streams from different dimensions.

Key metrics:

  • Total revenue by location: Essential for market segmentation and geographic performance analysis.
  • Revenue trends over time: Track revenue changes over specific time periods (e.g., daily, monthly, quarterly) using the Time Dimension.
  • Revenue by property type: Compare revenue generated from different types of properties (e.g., apartments, houses).
  • Host performance analysis: Helps identify high-performing hosts and design reward/incentive programs.
  • Cancellation impact on revenue: Helps assess the loss due to cancellations and optimize cancellation policies.

Total revenue by cities:

SQL
SELECT l.City, SUM(b.TotalCost) AS TotalRevenue FROM BookingFact b JOIN ListingDimension l ON b.ListingID = l.ListingID GROUP BY l.City

Host with most cancellations:

SQL
SELECT h.HostID, COUNT(b.BookingID) AS TotalCancellations FROM BookingFact b JOIN HostDimension h ON b.HostID = h.HostID WHERE b.CancellationStatus = 'Cancelled' GROUP BY h.HostID ORDER BY TotalCancellations DESC;

2. Reservation analysis

Booking source analysis

Determine which booking sources (e.g., mobile, web) generate the most bookings.

SQL
SELECT b.BookingSource, COUNT(b.BookingID) AS TotalBookings, SUM(b.TotalCost) AS TotalRevenue FROM BookingFact b GROUP BY b.BookingSource

Occupancy rate analysis

Calculate and compare occupancy rates across different property types, and time periods.

SQL
SELECT l.PropertyType, t.Year, t.Month, SUM(b.NightsBooked) AS TotalNightsBooked, SUM(b.TotalAvailableNights) AS TotalAvailableNights, (SUM(b.NightsBooked) / SUM(b.TotalAvailableNights)) * 100 AS OccupancyRate FROM BookingFact b JOIN ListingDimension l ON b.ListingID = l.ListingID JOIN TimeDimension t ON b.DateID = t.DateID GROUP BY l.PropertyType, t.Year, t.Month;

3. Review & rating analysis

  • Rating distribution: Analyze the distribution of ratings (e.g., 1-5 stars) across listings or hosts.
  • Impact of reviews on bookings: Correlate the frequency and quality of reviews with booking rates to determine the impact of guest feedback on listing performance.

4. Host performance metrics

  • Top performing hosts: Identify the top-performing hosts based on revenue, occupancy rates, and guest reviews.
  • Host response rate: Analyze how a host's response rate affects booking rates and guest satisfaction.

Distribution of ratings across listings

SQL
SELECT l.ListingID, l.ListingName, r.Rating, COUNT(r.ReviewID) AS RatingCount FROM ReviewFact r JOIN ListingDimension l ON r.ListingID = l.ListingID GROUP BY l.ListingID, l.ListingName, r.Rating ORDER BY l.ListingID, r.Rating;

Distribution of ratings across hosts

SQL
SELECT h.HostID, h.HostName, r.Rating, COUNT(r.ReviewID) AS RatingCount FROM ReviewFact r JOIN ListingDimension l ON r.ListingID = l.ListingID JOIN HostDimension h ON l.HostID = h.HostID GROUP BY h.HostID, h.HostName, r.Rating ORDER BY h.HostID, r.Rating;

Additional considerations (5-10 minutes)

Asking follow-up questions, identifying potential problems, and theorizing and presenting alternative solutions.

Potential problems:

  1. Scalability: As Airbnb scales, the size of fact tables, especially Booking Fact and Revenue Fact, could become enormous. Queries could slow down unless the data warehouse is optimized with appropriate indexing, partitioning, or other performance-enhancing techniques.
  2. Handling unstructured data: The schema assumes that reviews are structured (e.g., text length, ratings), but unstructured data (e.g., full text of reviews) might require additional processing or even a separate schema design to handle text mining or sentiment analysis effectively.
  3. Over-normalization vs under-normalization: If the schema leans too much toward normalization (especially in the case of snowflake designs), it could lead to complex joins that slow down queries, particularly for large datasets. On the other hand, over-denormalization could lead to data redundancy and update anomalies.

Evaluation & feedback (5 minutes)

Thinking through this exercise, we’ve built a solid foundational data model for Airbnb analytics. Starting with a brainstorming approach allowed us to create a versatile structure.

One of the most valuable takeaways is the importance of thinking backward from business objectives. By focusing on real-world needs—such as reviews, business operations, and potential challenges—we ensure the data model is practical and aligned with the company’s goals.

Given more time, we could extend the model to include additional analytics, such as competitor pricing. This would provide insights into how Airbnb’s pricing compares with competitors, helping identify opportunities to optimize pricing strategies.

Additional analytics: Competitor pricing & dynamic pricing

We could capture competitor-related data to evaluate Airbnb’s performance in comparison to competitors. This would support dynamic pricing strategies by identifying areas of over- or under-pricing.

Proposed fact table: Competitor pricing fact

Captures data for competitor comparison:

  • CompetitorPricingID (Primary Key)
  • CompetitorID (Foreign Key to Competitor Dimension)
  • ListingID (Foreign Key to Listing Dimension)
  • Date (Foreign Key to Date Dimension)
  • CompetitorPricePerNight
  • CompetitorOccupancyRate
  • CompetitorRevenue

Proposed fact table: Dynamic pricing fact

Tracks pricing adjustments and their impact:

  • PricingStrategyID (Primary Key)
  • ListingID (Foreign Key to Listing Dimension)
  • Date (Foreign Key to Date Dimension)
  • PricePerNight (Airbnb’s dynamic price)
  • OccupancyRate
  • RevenueImpact
  • CompetitorComparisonID (Foreign Key to Competitor Pricing Fact)
  • PricingRule (e.g., “discount for early booking,” “peak season surge”)

Proposed competitor dimension

Provides details about competitors:

  • CompetitorID (Primary Key)
  • CompetitorName
  • PricingModel (e.g., fixed, dynamic)
  • BrandPositioning (e.g., luxury, budget)

Relationships:

  1. Competitor pricing fact links to:
    • Competitor Dimension
    • Listing Dimension
    • Date Dimension
  2. Dynamic pricing fact connects to:
    • Listing Dimension
    • Date Dimension
    • Competitor Pricing Fact

By expanding the model to include competitor pricing and dynamic pricing, we can further enhance Airbnb’s ability to adapt to market conditions and maintain a competitive edge.