A few months back, we saw a tweet about how every Infrastructure SaaS company needs to separate the control plane from the data plane to build a successful product. Reading this got us excited since we were working on a platform that would make this really easy. We would love to talk to you if you are already familiar with these patterns and are building an Infrastructure SaaS product
We spent the last six years at Confluent, helping transform it into a world-class Infrastructure SaaS company. We shared the same sentiment as this tweet - building Infrastructure SaaS products can be much simpler if we have a platform that helps develop a reliable control plane. Companies could save significant costs and time, and they could leverage their engineers to focus more on their core products. We thought it would be helpful to explain the end-to-end architecture of an Infrastructure SaaS product, the role of the data plane and control plane, and the problems that make this challenging.
What is Infrastructure SaaS?
Infrastructure SaaS refers to any infrastructure or platform product provided as a service. It includes data infrastructure, data analytics, machine learning/AI, security, developer productivity, and observability products. Sai Senthilkumar from Redpoint wrote an excellent article on this topic and how these Infrastructure SaaS companies are among the fastest-growing companies.
Infrastructure SaaS companies invest in platform teams to build their SaaS platform. The platform teams are responsible for developing the building blocks needed to build a control plane. The investment in the platform teams continues to grow significantly as the product succeeds and is typically 25-50% of the engineering organization. Based on our experience building large-scale Infrastructure SaaS and talking to other companies, it has become apparent that platform investment is the highest cost to the engineering organization in these companies.
Data plane vs. Control plane - when do we need this?
Control planes are typically responsible for providing the SaaS capabilities, metadata management, and controlling the life cycle across all the data planes. The separation between the control and data planes is common when building an infrastructure SaaS product. There are a few reasons for this:
Productize an open-source infrastructure as a SaaS productMost open-source infrastructure projects start with only the data plane. The project authors realize that the next step is to productize the open-source infrastructure as a SaaS product. An independent control plane is ideal for achieving the SaaS experience and ensuring that the core open-source data plane is separate. The control plane will help manage multiple data planes across regions and cloud providers.
Building any proprietary Infra SaaS productThe open-source argument is pretty strong. However, the need for a control plane is not just limited to open-source infrastructure. It becomes a core need for any infrastructure SaaS product, either close or open source. Almost all Infrastructure SaaS need a central management layer that enables tenant management, user management, cluster management, and orchestration of all the data planes. The control plane provides a single pane of glass experience for the end-users, coordinating with all the data planes and responsible for the overall life cycle management.
Data locality with customer locationWith infrastructure SaaS, there is a general need to keep the data plane close to the customer location for a few reasons.
The data transfer cost will be prohibitively expensive if the data plane is network intensive. You typically want to eliminate this cost by being in the same region as the customer. There are a few other networking options to mitigate this cost (a post for another day).
For enterprise customers, the data plane location depends on substantial compliance and regulatory requirements. Extremely security-conscious customers might want the data plane in their account to control access more tightly.
Mission-critical infrastructure typically has low latency requirements. The data plane must be in the same region as the customer to ensure excellent performance.
- High availability
For high availability, you want to avoid your connections to the data plane across GEO and be more resilient to network failures cross-region or cloud. In addition, a single data plane cluster may be hard to scale due to capacity reasons and would need to be sharded. It becomes much easier to scale the data plane by decoupling it from the control plane.
Finally, supporting multiple cloud providers is becoming very popular. One model to support this would be to centralize the control plane in one cloud and deploy the data plane in different cloud providers for the same customer. There are more variants to this which we will look at later.
What does a world-class control plane need?
It would be helpful to understand what capabilities a control plane needs to support. These requirements will influence the architecture of an Infrastructure SaaS product.
User, organization, and metadata managementUsers and organization management are basic requirements for an Infrastructure SaaS product. User management includes authenticating users, managing users' lifecycle (add, invite, delete, update), and supporting user groups and third-party identity integrations.The control plane needs to ensure the access controls for a user is reflected on the data plane when the user lifecycle APIs are invoked.
Organization management, sometimes known as tenant management, includes supporting the organization hierarchy data model, applying quotas, SKUs, security policies at an organization's scope, and end-to-end life cycle management. Multitenancy is a basic need for a SaaS application, and Infrastructure SaaS is not any different. For larger customers, organization management becomes pretty complex, including supporting flows to merge two or more organizations, suspending organizations, and implementing clean organization deletions based on regulatory requirements (GDPR, FedRAMP, etc.).The control plane needs to ensure tenant lifecycle management is reflected on the data plane as well. For example, when an organization is suspended, the control plane needs to ensure that the data plane cuts access temporarily.
There are many standard SaaS entities that a SaaS application needs - users and organizations are examples of that. At the same time, there is a lot of application-specific metadata. For example, an infrastructure product that lets users manage a set of database clusters could define metadata like ‘cluster’, ‘network,’ and ‘environment.’ The metadata needs to be defined, CRUD APIs need to be written to manage them, and their access needs to be controlled by the same security policies defined for users and the organization. The central control plane needs to be the source of truth for this metadata and support its management.
Orchestration and integration with data planesThe control plane should have near-instantaneous communication with the data plane - whether it manages a single data plane or hundreds of clusters across different regions and cloud providers. It needs to communicate and transfer data securely across the data planes and receive data back. The control plane needs to provide a single pane of glass view of all the metadata of an organization’s data plane. Pushing configuration changes, sharing application metadata, deployment, and maintenance operations are a few examples where the control plane needs to have the ability to orchestrate across the different data planes.
Lifecycle management of the data planeOne of the control plane's core needs is to manage the data plane's end-to-end lifecycle. Typically, this includes creating, updating, and deleting resources in the data plane. For Infrastructure SaaS, all these operations are asynchronous, and you want the control plane to manage the end-to-end flow. It needs to provide an excellent experience for the end-user when they invoke these lifecycle operations, ensure low latency and correctness in executing these async lifecycle operations and can do this management at scale for hundreds to thousands of data planes across regions and cloud providers.
Security PoliciesSecurity policies include access controls, quotas and data governance for each organization and user. Access controls can range from simple permissions to complex RBAC support. Typically, these access controls need to apply to both the control plane and the data plane. When there are IDP integrations, the control plane may have to apply access controls based on what is defined in the IDP. Quotas are bounds to the set of operations that users and tenants can perform on the infrastructure. This is typically done to protect any denial of service attack and build a healthy multitenant system. Quotas, similar to access controls, can apply to the control plane and data plane operations. Data governance is increasingly critical for larger customers. Governance includes ability to find data easily, store data in compliance with country specific policies and purge data based on data retention rules. For all the security policies, the control plane needs to ensure they are applied to all the data planes consistently based on tenant location, policies and user rules.
Metrics, alerting, and insightsFor infrastructure SaaS, users typically execute some commands on the infrastructure and would like to know how the execution is progressing. They might want to get notified or alerted when something is not going right or if it needs their attention. For example, a database product will have users execute a set of queries and look at query metrics to understand the response times, errors, and usage. Users may also want to get notified when a query starts failing. Users also need aggregated metrics across their databases in different regions and cloud providers. The control plane needs to aggregate all the metrics across the data planes, provide insights on the metrics to the users, and alert the customers on critical anomalies.
Subscriptions and Usage-based billingInfrastructure SaaS has fundamentally changed how billing for SaaS works. Traditional SaaS is billed based on the number of seats/users. For Infrastructure, it is typically a combination of monthly subscription plus billing based on product consumption. For example, a specific infrastructure product might provide customers with three different SKUs with increasing value. The first SKU could be a free tier with limited quota, and the rest could have a base monthly subscription fee. In addition, the users would pay based on their usage of the product that month. Consumption could include throughput, number of queries, storage, number of compute instances, etc.
The control plane must be able to compute the billing based on the user SKU, monthly rate, and usage. The usage component must be calculated based on the metadata or usage metrics aggregated from all the data planes. The metrics and insights shown to the user (explained above) should match the billing data to ensure the user has a consistent experience.
Back office integrationSaaS companies must integrate customer metadata with all the back-office systems. When a user signs up, the marketing team will want the user information in their campaign tool to start including the new users as part of their marketing campaigns. In a sales-led company, the sales rep must create a new production account for the customer once the deal is successfully closed in their CRM. The customer metadata must be pushed to the Data Warehouse for business insights. These examples need a reliable pipeline that integrates data between the production database and the back-office systems. For Infrastructure SaaS, the control plane is the central customer metadata store in production. It needs to provide a reliable pipeline that integrates both ways with all the back-office systems. In addition, the data needs to be available in all the different systems within an acceptable SLA agreed by all stakeholders.
Infrastructure SaaS architectureWith our understanding of the control plane and infrastructure SaaS requirements, let us delve into the architecture of a typical Infrastructure SaaS product. We will review the basic building blocks and discuss a few architectural considerations.
The basic building blocks of a control plane
SaaS fundamentals (aka The SaaS Mesh)Infrastructure SaaS products need a world-class SaaS experience for their customers. It includes authenticating users and user management, organization management, providing a permission model for access controls, defining different SKUs for different product offerings, and the ability to bill based on subscription or usage of the product. These are basic expectations from end-users for all Infrastructure SaaS products. A basic version of all these features listed may be good enough for a free tier offering with some quotas, and they get complex as you serve higher segments (e.g., enterprise). For example, mid to enterprise customers may need to integrate with their identity management system for authentication instead of the default offering that the product provides.
There is also a complex interconnect between the SaaS features that we call ‘SaaS Flows’. The SaaS experience of an infrastructure SaaS consists of a bunch of SaaS flows. For example, when a user signs up for the product, you may also want to create an entry in a marketing tool to send campaigns. A more complex example of a SaaS Flow could be when a credit card for a specific organization expires. On expiry, you want to notify the customer a few times to update the credit card information. If there is no response from the customer, you might want to temporarily suspend the account, disable access to the data plane and eventually reclaim the account after waiting for a sufficient amount of time to avoid incurring infrastructure cost. This SaaS Flow example connects user management, org management, billing, notifications and the data plane. We call this interconnect between the different SaaS features a ‘SaaS mesh.’ SaaS mesh is needed to build the different SaaS flows. The SaaS flows include customer-facing experience, and back-office flows for the other stakeholders in a company.
Orchestrating the data planesOne of the core responsibilities of the central control plane is to orchestrate the different data planes. Typically a single customer could have applications or clusters in multiple regions or cloud providers. With more customers and data planes, a few things have to be supported -
- Propagating the SaaS metadata to all the data planes
- Pushing new configurations and application versions across the fleet
- Defining maintenance windows and sequence of deployment based on customer priorities
- Capacity management of all the data planes to ensure infrastructure is within the cloud limits
Data plane managementThe data plane is where the actual customer application or cluster is deployed. The deployment can happen on a Kubernetes cluster or directly on cloud instances. The cluster or application could be co-located on one Kubernetes cluster or separate. There is usually an agent running on these data planes that helps execute the local life cycle operation on the data plane based on the commands from the control plane. The agent, in a sense, acts like a mini control plane co-located in the data plane. Like any architecture, there are different ways to manage the data plane. Kubernetes, Terraform, or Temporal are tools that could be used to manage the lifecycle of each data plane.
Closed feedback loopAny control plane architecture is not complete without a closed feedback loop with the data plane. As mentioned previously, the control plane is the source of truth about the current state of customer applications or clusters. The data plane needs to report the status of the operations to the control plane. In addition, the control plane would also want to collect application metrics and metadata to show insights to the users about the infrastructure.
Customer account vs. Fully hostedIn the fully hosted model, the data plane is deployed in the cloud account of the service provider. Due to compliance requirements, some companies and customers demand that the infrastructure be deployed in their own cloud accounts. Some infrastructure Saas companies need to support both models. It is possible to unify the architecture for these different deployment models. To deploy infrastructure in the customer account, permission model in the customer account, billing plans (the customer gets usage cost in their cloud bill), support (who gets access), and development overhead.
TestingTesting new data plane changes against the central control plane adds complexity. Mocking the entire control plane for end-to-end tests is not desirable since you typically have to make changes in the control plane to enable new data plane features, which need to be tested. A reasonable solution is to provide each developer their own local sandbox of the control plane with only their changes. It will help them to test their changes locally before pushing the changes to pre-production. Without a good testing strategy for the control plane, every change gets harder to stabilize as the product and teams scale.
Disaster recoveryAn essential part of control plane design is to have a sound strategy when the control plane becomes entirely unavailable. From a user perspective, the data plane needs to be available even if the control plane is unavailable. In addition, there needs to be a plan to bring the control plane back up in the same region or another region (sometimes in another cloud). Restoring the data without any data loss is critical. You can provide a highly available service if you can bootstrap a control plane automatically from the backup data.
This is hard!We plan to publish a post soon covering the complex parts of the building and scaling a control plane. We have listed a few questions below that take significant time and cost to design and build.
- How do you build the SaaS fundamentals for your product? How do you support the different SaaS flows?
- How do you manage all the SaaS and application metadata to provide a single pane of glass experience for your users across all the data planes?
- What mechanisms do you use to ensure the metadata changes are available to all the data planes?
- How is everything designed to support multitenancy? How do you enforce metadata access, quotas, and SKUs at the tenant scope?
- How does the architecture change when you orchestrate thousands of data planes?
- How can you collect all the metrics and metadata from all the data planes to provide insights to the users, compute billing based on usage and integrate with the Datawarehouse for business intelligence?
- How can the control-plane scale, and what SLAs do you provide?
Building Infrastructure SaaS?In the next eight weeks, we will release a series of blog posts describing different aspects of the control plane architecture for Infrastructure SaaS. We hope this will help all the companies building, scaling, or rearchitecting their control planes to provide their infrastructure as a service in the cloud.
We will love to talk to you if this problem sounds familiar to what you are tackling! We are building a platform that will make it easy to build and scale Infrastructure SaaS products. We hope to provide a world-class platform that Infrastructure SaaS companies can leverage to develop their control plane.