Design a Data Warehouse Schema for Airbnb
In this video, we dive into data modeling for Airbnb, focusing on building a star schema model for analytical purposes. The discussion covers capturing metrics such as customer satisfaction and business profitability through key fact tables like Booking, Revenue, and Review, while exploring important dimensions such as user, listing, and location to optimize Airbnb’s operational insights.
Identifying the problem (2-5 minutes)
Since this is a Data Warehouse, we can assume it will primarily be used for analytical purposes.
To choose between schemas:
- Star schema: This schema denormalizes data, adding redundant columns to some dimension tables. While this increases storage requirements, it improves query performance, making it a better choice for read-heavy operations.
- Snowflake schema: This schema normalizes data, reducing redundancy and optimizing storage. However, it requires more joins to generate the same view, which makes it less optimal for read-heavy operations.
Given that our goal is to perform analytics on Airbnb data, we should consider using a star schema. Its read-optimized design aligns well with the analytical needs of a Data Warehouse.
Solving the problem (15 minutes)
Now that we’ve established the goal of performing analytics on Airbnb data and decided on a Star Schema, let’s explore the key metrics we aim to derive from this data.
Analytics focus:
- Customer experience: Understanding customer feedback is crucial for improving service quality. Metrics here could include:
- Reviews and their ratings.
- Complaint patterns and resolution times.
- Host response times.
- Business profitability: Capturing data around bookings can help identify reservation trends and seasonality. Additional metrics, like pricing data, can uncover opportunities for revenue optimization.
Core fact tables:
Based on these focus areas, we can categorize the data into three primary fact tables:
- Reservations: To analyze booking patterns and trends.
- Reviews: To understand customer feedback and satisfaction.
- Revenue: To track income and payment methods.
Fact tables store numerical data, while dimension tables provide context. Both are necessary for effective data modeling and analytics.
Fact tables
Booking fact table
Captures data about reservations and includes:
- BookingID (Primary Key)
- ListingID (Foreign Key to Listing Dimension)
- UserID (Foreign Key to User Dimension)
- HostID (Foreign Key to Host Dimension)
- BookingDate
- CheckInDate
- CheckOutDate
- NightsBooked
- TotalCost
- BookingStatus (with optional CancellationReason)
- BookingSource (e.g., mobile, web)
- NumberOfGuests
Review fact table
Stores data about customer reviews and ratings:
- ReviewID (Primary Key)
- ListingID (Foreign Key to Listing Dimension)
- UserID (Foreign Key to User Dimension)
- ReviewDate
- Rating (e.g., 1–5 stars)
- ReviewText
- ResponseTime (time taken by the host to respond)
- Optional: Add BookingID as a foreign key to tie reviews to specific bookings, ensuring only guests who booked and stayed can leave reviews.
Revenue fact table
Tracks revenue and payment details:
- RevenueID (Primary Key)
- BookingID (Foreign Key to Booking Fact)
- ListingID (Foreign Key to Listing Dimension)
- Date
- RevenueAmount
- PaymentMethod
- Currency
Dimension tables
User dimension
Contains user-specific details:
- UserID (Primary Key)
- Name
- Phone
- Country
- JoinDate
- UserType (e.g., guest, host)
- Gender, AgeGroup
Host dimension
Captures host-specific details:
- HostID (Primary Key)
- Name
- Phone
- Country
- JoinDate
- SuperHostStatus
- ResponseRate
Listing dimension
Details about listings:
- ListingID (Primary Key)
- HostID (Foreign Key to Host Dimension)
- PropertyType (e.g., apartment, house)
- RoomType (e.g., entire home, private room)
- NumberOfRooms
- City
- State
- Country
- Latitude
- Longitude
- PricePerNight
- CancellationPolicy
- Amenities (stored as a list or JSON)
- ListingCreationDate
- LastUpdatedDate
Date dimension
Captures date-related information:
- Date (Primary Key)
- Year
- Month
- Day
- Week
- Quarter
- DayOfWeek
- IsHoliday
Location dimension
Provides geographic details:
- LocationID (Primary Key)
- City
- State
- Country
- Region
This structure ensures a clean and efficient design, supporting the analytical needs of the system while maintaining flexibility for future extensions.
Data model



Query analysis (10 minutes)
Circling back to the metrics, let’s see how we can construct our queries within this schema.
1. Revenue analysis
Purpose: Understanding revenue streams from different dimensions.
Key metrics:
- Total revenue by location: Essential for market segmentation and geographic performance analysis.
- Revenue trends over time: Track revenue changes over specific time periods (e.g., daily, monthly, quarterly) using the Time Dimension.
- Revenue by property type: Compare revenue generated from different types of properties (e.g., apartments, houses).
- Host performance analysis: Helps identify high-performing hosts and design reward/incentive programs.
- Cancellation impact on revenue: Helps assess the loss due to cancellations and optimize cancellation policies.
Total revenue by cities:
SQLSELECT
l.City,
SUM(b.TotalCost) AS TotalRevenue
FROM
BookingFact b
JOIN
ListingDimension l ON b.ListingID = l.ListingID
GROUP BY
l.CityHost with most cancellations:
SQLSELECT
h.HostID,
COUNT(b.BookingID) AS TotalCancellations
FROM
BookingFact b
JOIN
HostDimension h ON b.HostID = h.HostID
WHERE
b.CancellationStatus = 'Cancelled'
GROUP BY
h.HostID
ORDER BY
TotalCancellations DESC;2. Reservation analysis
Booking source analysis
Determine which booking sources (e.g., mobile, web) generate the most bookings.
SQLSELECT b.BookingSource, COUNT(b.BookingID) AS TotalBookings, SUM(b.TotalCost) AS TotalRevenue FROM BookingFact b GROUP BY b.BookingSource
Occupancy rate analysis
Calculate and compare occupancy rates across different property types, and time periods.
SQLSELECT
l.PropertyType,
t.Year,
t.Month,
SUM(b.NightsBooked) AS TotalNightsBooked,
SUM(b.TotalAvailableNights) AS TotalAvailableNights,
(SUM(b.NightsBooked) / SUM(b.TotalAvailableNights)) * 100 AS OccupancyRate
FROM
BookingFact b
JOIN
ListingDimension l ON b.ListingID = l.ListingID
JOIN
TimeDimension t ON b.DateID = t.DateID
GROUP BY
l.PropertyType, t.Year, t.Month;3. Review & rating analysis
- Rating distribution: Analyze the distribution of ratings (e.g., 1-5 stars) across listings or hosts.
- Impact of reviews on bookings: Correlate the frequency and quality of reviews with booking rates to determine the impact of guest feedback on listing performance.
4. Host performance metrics
- Top performing hosts: Identify the top-performing hosts based on revenue, occupancy rates, and guest reviews.
- Host response rate: Analyze how a host's response rate affects booking rates and guest satisfaction.
Distribution of ratings across listings
SQLSELECT
l.ListingID,
l.ListingName,
r.Rating,
COUNT(r.ReviewID) AS RatingCount
FROM
ReviewFact r
JOIN
ListingDimension l ON r.ListingID = l.ListingID
GROUP BY
l.ListingID, l.ListingName, r.Rating
ORDER BY
l.ListingID, r.Rating;Distribution of ratings across hosts
SQLSELECT
h.HostID,
h.HostName,
r.Rating,
COUNT(r.ReviewID) AS RatingCount
FROM
ReviewFact r
JOIN
ListingDimension l ON r.ListingID = l.ListingID
JOIN
HostDimension h ON l.HostID = h.HostID
GROUP BY
h.HostID, h.HostName, r.Rating
ORDER BY
h.HostID, r.Rating;Additional considerations (5-10 minutes)
Asking follow-up questions, identifying potential problems, and theorizing and presenting alternative solutions.
Potential problems:
- Scalability: As Airbnb scales, the size of fact tables, especially Booking Fact and Revenue Fact, could become enormous. Queries could slow down unless the data warehouse is optimized with appropriate indexing, partitioning, or other performance-enhancing techniques.
- Handling unstructured data: The schema assumes that reviews are structured (e.g., text length, ratings), but unstructured data (e.g., full text of reviews) might require additional processing or even a separate schema design to handle text mining or sentiment analysis effectively.
- Over-normalization vs under-normalization: If the schema leans too much toward normalization (especially in the case of snowflake designs), it could lead to complex joins that slow down queries, particularly for large datasets. On the other hand, over-denormalization could lead to data redundancy and update anomalies.
Evaluation & feedback (5 minutes)
Thinking through this exercise, we’ve built a solid foundational data model for Airbnb analytics. Starting with a brainstorming approach allowed us to create a versatile structure.
One of the most valuable takeaways is the importance of thinking backward from business objectives. By focusing on real-world needs—such as reviews, business operations, and potential challenges—we ensure the data model is practical and aligned with the company’s goals.
Given more time, we could extend the model to include additional analytics, such as competitor pricing. This would provide insights into how Airbnb’s pricing compares with competitors, helping identify opportunities to optimize pricing strategies.
Additional analytics: Competitor pricing & dynamic pricing
We could capture competitor-related data to evaluate Airbnb’s performance in comparison to competitors. This would support dynamic pricing strategies by identifying areas of over- or under-pricing.
Proposed fact table: Competitor pricing fact
Captures data for competitor comparison:
- CompetitorPricingID (Primary Key)
- CompetitorID (Foreign Key to Competitor Dimension)
- ListingID (Foreign Key to Listing Dimension)
- Date (Foreign Key to Date Dimension)
- CompetitorPricePerNight
- CompetitorOccupancyRate
- CompetitorRevenue
Proposed fact table: Dynamic pricing fact
Tracks pricing adjustments and their impact:
- PricingStrategyID (Primary Key)
- ListingID (Foreign Key to Listing Dimension)
- Date (Foreign Key to Date Dimension)
- PricePerNight (Airbnb’s dynamic price)
- OccupancyRate
- RevenueImpact
- CompetitorComparisonID (Foreign Key to Competitor Pricing Fact)
- PricingRule (e.g., “discount for early booking,” “peak season surge”)
Proposed competitor dimension
Provides details about competitors:
- CompetitorID (Primary Key)
- CompetitorName
- PricingModel (e.g., fixed, dynamic)
- BrandPositioning (e.g., luxury, budget)
Relationships:
- Competitor pricing fact links to:
- Competitor Dimension
- Listing Dimension
- Date Dimension
- Dynamic pricing fact connects to:
- Listing Dimension
- Date Dimension
- Competitor Pricing Fact
By expanding the model to include competitor pricing and dynamic pricing, we can further enhance Airbnb’s ability to adapt to market conditions and maintain a competitive edge.