When we explained the control plane approach for Infrastructure SaaS, most developers we talked to understood what SaaS control planes do and often said things like
oh! I built this many times! I didn't realize it is called a control plane. This reaction confirmed what we initially suspected - that everyone is building control planes, but no one is talking about it. If you are interested in SaaS control plane - sign up to our mailing list. We'll send you our latest content and reach out for a chat
We also received a few responses like "why is this difficult?"
When we started building Confluent Cloud, we thought that building a control plane would be easy. Few tables for users and their clusters, run everything on K8s... how hard can it be? A two-pizza team effort of 6 months seemed like a very reasonable estimate. We discovered otherwise. Recently, when we talked to other SaaS Infra companies, we learned that regardless of the company's specific architecture or size, building a successful Infra SaaS product required at least 25% of the engineers to work on the control plane.
Building a control plane sounds simple - but it isn't. If you ever said, "oh, I can do this in a weekend," only to keep struggling for weeks or months, you should be familiar with deceptively simple problems.
Chess only has six pieces and ten rules. It sounds like a pretty simple game. The DynamoDB paper is short, yet every implementation took years to complete. Copying data from one Kafka cluster to another sounds simple. No one expected it to require four rewrites in 7 years. Pick the simplest software engineering topic you can imagine - SQL, JSON, REST, PHP - and you will be able to find a blog explaining why this topic is rather complex.
In this blog, we'll look at the challenges waiting for the engineer who sets out to build a control plane.
As we described in the previous blog, the control plane is responsible for providing the SaaS capabilities, metadata management, and controlling the life cycle across all the data planes.
Building a control plane is a big job. In this blog post we'll discuss the many challenges that may not be apparent when you first try to estimate the effort involved. We divided the problems into two parts:
- SaaS Flows: The problems that come up as you try to create a unified customer experience across many services in the control plane and the data plane.
- Control Plane and Data Plane integration: The problem that come up when you need to send information from the control plane to the data plane and vice versa.
Different problems become relevant at various stages of building a control plane. An MVP for a control plane can be relatively simple and become more challenging as requirements add up. Having this map of challenges will help you understand the technical depth involved and the realistic investment level.
While this blog focuses on control planes that serve the customers of Infra SaaS, the challenges involved in building internal control planes are similar. We will address the topic of internal control planes in a future blog post.
Seamless SaaS Flows
Your customers signed up to your SaaS product because they need to get something done and they will use the control plane to do it. They don't care about its architecture and all the services that are involved - they need it to disappear so they can focus on getting things done.
You can probably think of products where even though the product is complex, getting things done is a seamless experience. The same concepts are used through the product and you can't tell which teams own which microservices behind the scenes. Apple's iPhone, Stripe, Notion all have a seamless user experience.
Compare this to AWS network configuration. All you want is to run a container that can communicate with the public internet. But you have to figure out how to configure EC2, security groups, load balancer, target groups, subnets, routing tables and NATs. Each with its own documentation. If you do it wrong, you won't have connectivity. But because each component is configured separately, it is nearly impossible to understand why the packets don't arrive correctly.
We use the term SaaS Flow to describe common user activities in SaaS products that interact with multiple services.
There are SaaS Flows that are standard in SaaS products - they interact with standard SaaS capabilties.
For example: inviting a user to an organization in a SaaS product is a SaaS flow - a single activity from the user perspective, but the implementation spans the authentication service, user management service, notification service and perhaps an access control service as well. You can see an example diagram of an invite flow below.
There are also SaaS Flows that interact with entities that are specific to your application.
Creating a new set of API keys that give access to the data plane database that was provisioned by a customer. Upgrading an account from free trial to a paid version and updating the number of concurrent builds that can run in the data plane. Handling an expired credit card, deleting a user, deleting a tenant - all those are examples of one user activity that has to be handled across many services some are general (payments, notifications) and others are product specific (pausing all ETL jobs, updating data plane access policies).
There are hundreds such flows in any SaaS product.
Every control plane is a SaaS Mesh - it is made of many multi-service SaaS Flows
User management, access management, audit, billing, notifications, provisioning, upgrades, metrics, logs... not to mention anything specific to your product. Every SaaS flow will involve several of these services, which means that they continously need to exchange information, events and actions. Each component has an owner, but who owns the end to end flow? Who makes sure that services send each other the relevant information and that each service handles its part of the flow in ways that fit with all the other parts? You can think of this as a SaaS Mesh - a seamless product that is generated from independent components and clear contracts between them. Or it can become a SaaS Mess, if the interfaces are not well defined and the dependencies are introduced ad-hoc.
As an example, think of a scenario where the credit card on file for an organization has expired. The organization has 15 users. Which one of the users in the customer's organization will be notified? how will they be notified? will your sales team or support get notified too? will the customer's clusters or jobs keep running? for how long? if the cluster is de-provisioned, will you keep storing their data? what about the list of users and their emails? metrics? once they update the credit card details, will every service resume its activity? will the new card get charged for any of the lapsed time?
It is important to also handle all the failure scenarios in each of these flows - what if the notification service is down? What if Salesforce returns an error or throttles your requests? Is it possible to save the current flow state and try again later? Can you restart the flow from the beginning or were some notifications already sent?
SaaS Flows can be modeled with a state machine and each event or API request/response moves the system between states. This model helps you persist the latest state of the flow, so completed steps won't re-run but failed steps can be retried and the flow can continue. In addition, this modeling helps in monitoring the load and health of each flow.
SaaS Flows across control and data planes.
Creating Seamless SaaS Flows that touch both control plane and data plane services is an even bigger challenge. This is especially true when a customer request encounters a failure in the data plane.
Balancing enough abstraction for a seamless experience when things go right, but enough details for meaningful debugging when things go wrong, is an engineering challenge in general. It becomes more difficult when the user interacts with a control plane but problems happen in the data plane.
Think of a scenario where you built a SaaS ETL product. A customer tries to define a new ETL pipeline through your control plane, but something in the process failed. If the failure was due to lack of compute on a K8s cluster, the control plane shouldn't show the exact K8s error, since your service abstracts K8s. But if the failure is in loading data to the customer's DB, showing the exact error will actually help your customer identify the DB issue on their side.
Example of an actionable error message in a SaaS flow:
If the error is transient, it makes sense to retry the SaaS flow - starting from the point of failure. Does the control plane manage the retries by re-publishing the "create new pipeline" request repeatedly until it successfully completes? Does the pipeline itself persist the event until it is successfully acknowledged? Does the dataplane store the in-flight requests locally until they complete? Each one of the architectures has its own tradeoffs.
In cases where the user does interact with the data plane directly, we discovered that user's mental model is that all admin activity will still be available in one place and that there will be a consistent permissions and access model between control plane and data plane.
A user who just created a database in the control plane will expect to be able to also be able to create tables, insert data into these tables and run queries. The expectation is that the control plane is a single pane of glass that reflects all the data plane systems. It will be a non-ideal experience if they need to use two or three different tools for all those activities and an even worse experience if the user who created the database doesn't have permission to create a table or to query the table that they created.
SaaS Flows that involve business systems
In addition to the control plane and the data plane, there are other parts of the business that have a relationship with customers.
Support teams will need a view of the current state of the customer's metadata - especially if there were any recent changes or errors. They will need to be able to dig into any relevant metrics or logs on behalf of the customer and perhaps even take action on the customer behalf's (after proper approvals).
Marketing teams may need metrics regarding the customer's engagement or specific activities they took (or did not yet take) in the product. And they may wish to tweak certain aspects of the product experience to drive growth in certain segments or personas.
Sales teams may need to know when the customer's usage passed a certain limit. They may also need to be aware of any serious incidents, SLA misses and significant planned maintenance that will affect their customers. And of course business analytics or data science teams will need access to all the usage, engagement, costs and revenue data in order to prepare dashboards for the executives.
A credit card expiration flow may have a step that updates the sales team via Salesforce, along with many other steps:
All those business requirements indicate the need for ETL and reverse ETL between the control plane and multiple business systems - data warehouse, analytics store, marketing tools, sales tools, support systems and so on. Those integrations also require monitoring and ideally should be part of the integration testng, so you can quickly catch any breaking changes.
When using 3rd party services - you still own the SaaS flows
Since SaaS control planes are large in scope, it makes sense to integrate with 3rd party providers for specific features such as payment processing authentication or transactional notifications.
Using 3rd party services helps Infra SaaS companies deliver value to their customers faster, but those customers still need seamless SaaS flows. External services can be part of these flows but the flow itself is still owned by the control plane developers.
Lets say you use a 3rd party authentication service. Authentication is solved, but information about users still has to exist throughout the control plane and even the data plane, since it is part of many SaaS flows. There is still a "user data store" and "user service" which provides APIs and events to every other service that needs information related to users. All the issues we describe in this section are still problems that you own and need to address: designing SaaS flows, error handling, access management between control and data planes, testing and monitoring.
Trust but Test and Monitor
SaaS Flows have to be tested as a flow - coverage of each service alone leaves many gaps for customers to fall through. You will want an integration testing framework that allows you to test all the services, including the 3rd party ones. Testing the "reset password" API will require an environment with the authentication service, user management service and notification service.
It is also important to test all the cross-service APIs. You will want to avoid breaking compatibility between services when possible, and to know when compatibility was broken so you can come up with a deployment plan that involves all services that use the modified API. There are also APIs that were not meant to be used by other services, and yet they are. Breaking those undocumented APIs will break your application just the same. There are service mesh tools that can report which APIs are actually in use, and by which services use which API - use those tools to understand which API contracts you need to maintain.
Make sure you collect detailed metrics about the number of users, payments, notifications or other entities in each step of the flow - a large number of flows stuck in a specific state will be your main indication that there is an error condition that your flow did not take into account.
Most SaaS Flows have implicit user expectations around latency - after clicking "reset password", users will expect the website to update in 100ms, the SMS to arrive in 30 seconds and an email to arrive within a minute or two. You will want to measure the latency of each step in the flow and queuing delays between steps.
Integrating Control and Data Plane
This is the core challenge of the control plane architecture. We reviewed the overall architecture in the previous blog, but here's the MVP version:
- Design the control plane metadata and use Postgres as your data store. Use Postgres built-in REST APIs and access controls and you have a minimal backend.
- Use 3rd party integrations where possible. This still requires effort, but it is a good start.
- Capture changes from the control plane that need to be applied on the data plane. With this architecture all changes are persisted to the database, so it makes sense to capture changes at the DB layer. This can be done with logical replication, Debezium, database triggers or a home-grown service.
- Deliver the events to the data plane: The most common pattern is to have the data plane poll the control plane for new events - this can be via API calls, direct database queries, or an event / messages service.
- Data plane services react to events independently, according to their own business logic
- Data plane services update the control plane on progress and errors
Once you implement all this, make sure you set up integration testing and monitoring.
Beyond this simple architecture, there are additional challenges that result from the different dimensions in which the system can evolve.
If your architecture allows users to interact directly with the data plane, you want to make sure that the data plane availability is either completely decoupled from that of the control plane or that both the control plane and the data plane and the pipelines in-between are designed for a higher SLA than what you offer your customers. If you opt for decoupling the data plane availability from that of the control plane, you'll probably end up with the data plane caching information from the control plane locally. It may sound simple, but keep in mind that cache invalidation is one of the two hardest problems in computer science.
If you support enterprise customers, there will be interesting challenges around the security of the communication between the data plane and the control plane. They will need to mutually authenticate each other and the events themselves may need to be signed for authenticity. You'll likely need IP whitelists in both directions, publish approved port lists and support at least one private networking option, possibly more.
Some Enterprise customers may also want you to run and manage the data plane, or even the control plane in their VPC or their cloud vendor account.
You will need support for storing secrets in the control plane. It is very likely that your data plane will need to authenticate to customer resources in other SaaS, so you will ask your users for credentials - and the last thing you need is for those credentials to leak.
As the number of data plane service instances grows, you need to make sure the control plane can handle the case where they all attempt to connect to the control plane at once and retrieve updated state. This can happen as a result of an incident, a recovery plan or a mis-managed upgrade. A meltdown of the control plane under this coordinated DDOS is not going to be helpful in any of these scenarios. A combination of good database design which minimizes hot-spots and a good rate limiting protocol will help save the day.
Many Infra SaaS have use-cases that are latency senstive. When the target latency is below 100ms, you have to avoid routing these operations via a central control plane (regional control plane may be acceptable). The extra latency for the additional network hop will be meaningful and the risk that the control plane will become a bottleneck is rather high.
Over time, as your product and business evolves, you may end up with multiple pipelines between control and data plane:
- Metrics and logs are often sent from data plane to control plane, so they will be visible to customers via the control plane ("single pane of glass" is a common name for this).
- There may be another system for fleet management and upgrades, one that is integrated with your CI/CD system but also with the control plane front-end and the notification service.
While those may be separate channels of tasks and information, it makes sense to view all those pipelines as part of a single "logical" control plane and standardize on the entities, events and APIs that these systems refer to. The reason is that as we discussed when we talked about SaaS Flows, customers expect a seamless experience with the control plane - not multiple control planes. They may want to subscribe to upgrade notifications or even configure a maintenance schedule. If the fleet management and control plane speak different languages, this integrated experience will be a challenge.
Reconcilling state between data plane and control plane
Remember that things may happen on the data plane without going through the control plane first. This can be caused by the cloud provider decomissioning machines or upgrading K8s masters with surprise effects, or more often - it can be an engineer acting with the best intentions. Regardless of the cause, operating a system where the control plane has one configuration and the data plane has another is a recipe for failure. Your architecture must include plans for discovering and reconciling divergence.
It is easy to look at a control plane as "just a Postgres DB with some APIs and an auth service" and believe that it is simple to build and grow. However, even at its simplest, the control plane requires careful design, good guard-rails in the form of integration tests and comprehensive monitoring, and quite a bit of toil to build the needed integrations. Systems that look easy but turn out to be a significant investment are quite common in engineering. At the MVP stage, they require balance between keeping the scope minimal while still designing a flexible system that can evolve and address both customer requirements and operational pains. We will introduce more design patterns in later blog posts that will help you in designing and implementing such systems. Join our mailing list to get notified when we publish additional posts.