Design an Architecture for a Self-Serve Insurance Product
You are an SA working with a client in the insurance industry. Design an architecture for a new insurance application that allows users to:
- Sign up for an individual policy
- Sign up for group family policies (with every member of the family being insured)
- Make claims against existing policies.
What kind of architecture would best suit our requirements, considering that we expect to serve over a million customers in 2 years from now?
What problems you might face if starting with a simple architecture? What would be the cost to transition to a highly-scalable architecture in the future?
Would you opt for a cloud provider or an on-premise data center to host your data and services?
What does the data look like? Is the data highly relational? What are other requirements, for example, a high number of TPS (transactions per second)?
What are some examples of queries made against the system? Do the frequently-made queries expect data from a single domain or do they expect data consolidated across multiple domains?
Are there other cloud capabilities that might be particularly useful in the insurance industry? For example, maching learning / advanced analytics? What would this cost to develop in-house?
Since insurance is a highly regulated sector and data loss must be avoided, do we need our database system to have capabilities like disastor recovery, high availability, etc?
Clarifying Questions
There's a lot to be defined here, so let's ask some clarifying questions first.
Your goal is to thoroughly define the problem space, the product and business goals (and constraints), and to define a set of both functional and non-functional requirements that will allow you to make good decisions as you design the architecture.
Given this prompt, good questions to ask include:
- What volume of users do we expect, and how does the growth trajectory look over time? For example, do we expect a huge initial spike and then growth? Or steadily increasing adoption?
- How does this product fit in with greater business objectives?
- Given business objectives, what success metrics should we track? This will help us determine what system properties to prioritize. For example, is high availability enough? Or do we need to build for high reliability as well?
- Where will this be implemented? How will that change as we grow?
- Will this product integrate with other services, internal or external? How will that change as we grow?
Let’s assume that we define the following:
- The application is expected to grow significantly in the future. Functionally, the client wants to expand to multiple countries in the future, which would add significant complexity to the system in terms of new tax regulations, payment providers, etc.
- Huge user growth is expected as the application rolls out in more countries, and the client expects that the application will be used by approximately 10 million users in about 2 years.
- The client has a preference for cloud-native services. They understand the risks of vendor lock-in, but considering all factors, an executive decision has been made to go with cloud-native services where possible.
- The application must be highly available since any loss of service will lead to high monetary and reputational costs for the customer.
- The company partner with third parties in the future. The architecture should enable external providers to integrate with our capabilities such as issuing policies, issuing claims, etc. The company wants to make issuing a small number of policies (for example, 50 per year) using our platform free for partners while charging them if they exceed this free limit.
With these functional and non-functional requirements defined, let's begin designing our architecture.
Our Solution
Step 1: Platform selection (cloud provider / on-premise)
Tip: Check out this lesson for a review of the different types of cloud deployments and how to choose.
Since the solution should scale to millions of users, it makes sense that the application be hosted through a cloud provider in order to leverage the cost-effectiveness and elasticity of the cloud. The number of users signing up and using the application at a particular time would vary greatly, thus scaling up/down should be convenient, which is very hard to achieve in an on-premise setup. In addition, cloud providers provide high-availability/disaster-recovery capabilities out of the box, which is another major advantage.
The client might also want to leverage more advanced cloud capabilities, like native Datalake / machine learning / business intelligence tools in the future, which would be time-consuming for the client to build on their own.
Luckily, our client has a preference for cloud-native solutions, so the decision is an easy one.
Step 2: Design methodology
The application has three major functions:
- Issuing individual policies
- Issuing group policies
- Making claims against existing policies
Insurance is a highly complex field, and interaction with multiple SaaS services is also expected. For example, the client expects to deploy this product in multiple countries, so we'll need currency conversion, tax calculations, multiple payment methods, digital signatures, etc. - and that's just on the payment side.
For example, the domain “Policy” would house services such as “ReadPolicy”, “InitiateCreditCheck” and “CreateNewPolicy”, whereas the domain “Claims” would house services such as “ReadClaims”, “InitiateClaim” and “CancelClaim”. The application does seem to be read-heavy.
We have several options to consider. We can start off the application with a monolithic design and later break it up into components if need be. We can also choose a microservices architecture from the start. A microservice architecture would add complexity initially, but it'd enable us to scale efficiently.
Considering these factors, and especially the requirement to serve millions of customers, a microservices architecture seems to be the way to go because it would enable us to scale read-heavy services like “ReadPolicy” and “ReadClaims” as and when required.
Mature integration capabilities are required for our application. A possible option can be an orchestration tool like MuleSoft, which provides out-of-the-box connectors to systems like NoSQL databases, various SaaS services, and even cloud providers. If the preference is to go for cloud-native services, a good option would be to develop our services on native cloud functionals like AWS Lambda while orchestrating the various functions through an orchestration functionality like “AWS Step Functions”. AWS Step Functions provides in-built capabilities like error-handling, retries, etc. so we don’t have to code them in our application. We can choose one of the multiple languages supported by AWS Lambda to develop our services.
For the purpose of this architecture and considering the client’s preference for cloud-native services, let’s go for cloud-native services as much as possible.
Step 3: Database selection
Both common database options (SQL vs. NoSQL) can work with our application, though each comes with pros and cons.
While SQL databases have traditionally been difficult to scale, there have been major improvements recently with services like Amazon RDS. A NoSQL database, for example, a documentDB can store all related information such as the policy information of a particular user, policy information of their family, and all previous claims in a single document - thus greatly decreasing the number of table joins that have to be made.
Tip: Read up on SQL vs. NoSQL databases here.
Considering that an insurance application is read-heavy and a lot of the queries made to the backend will need information across a number of domains, in addition to that base requirement of the application being able to service millions of users, a NoSQL DB seems to be the correct choice.
A high-level overview of the application might look like this:

Step 4: Consider success metrics
We previously determined that we want our system to be highly available. The primary metric for high-availability of a system is typically uptime. While a very high uptime for the application is definitely possible, it can come at a significant cost from the resources used to ensure high availability. Thus, it is always prudent to define the uptime metric for any highly-available system. The API gateway and Lambda functions provide high availability by default, easily scaling up to serving thousands of requests per second.
For example, Lambda functions are run in multiple availability zones (AZs) to ensure that they are able to process requests in case of a service interruption in a single zone. DocumentDB, in addition to spreading instances across AZs, supports “Global Custers”, which ensures that your data is replicated across regions and disaster recovery is easy in case of outages across a particular region.
Step 5: Discuss integrations / API considerations
One of the client’s requirements is to enable external partners to use its capabilities, such as issuing policies, processing claims, etc.
For example, multiple sites aggregate and compare various insurance providers to enable consumers to effectively compare policies. Our client will definitely want to make integration with such partners as easy as possible. API Gateways, in addition to providing scalability and security to our backend system, can facilitate this.
For example, we could ask each consumer of the API Gateway to create an application that allows the developer to access our platform. We could define tiers, for example:
- A free tier, which enables processing 50 free policies per year
- A commercial tier, which enables processing 500 policies per year
- An enterprise tier, which enables processing 10,000 policies per year.
This can be easily achieved with any API gateway.
Tip: Want to learn more about how API gateways work in practice? Check out the developer portal at slack. When you create an app on this portal, it provides you with a client_id and client_secret, which enables you to call slack API and make changes such as creating a slack channel or sending a message, all programmatically through API calls.
Conclusion
When developing a new application from scratch, it is always beneficial to deliver an MVP version of the application and get stakeholder feedback as frequently as possible. It is advisable in this case to start with a cloud platform and use native cloud services to start with since that will enable us to experiment quickly and with the lowest possible price. Using cloud-native services like lambdas also means that scalability and high availability is inbuilt into the system which would be very useful when our application scales to millions of users.
In addition, the choice of the DB is a very important one since it is typically very cost and effort-intensive to move databases when an application is already in use. In this scenario, since scale is very important, a NoSQL database seems like the right choice. Finally, using an API gateway, we ensure that our system is safe from external attacks and internal systems are protected from traffic surges at particularly busy times.