Skip to main content

· 16 min read

At Nile, we’re making it easier for companies to build world-class control planes for their infra SaaS products. Multi-tenancy is core to all SaaS products and especially those with control-plane architectures. At Nile, we’ve built multi-tenancy into our product from day one. If you are working on an infra SaaS product and need a multi-tenant control plane, you should talk to us.

From previous experience, we’re familiar with multiple multi-tenant SaaS architecture options. We decided to store everything in a single Postgres schema since it provides a balance of scalability, cost optimization, and flexibility. However, this requires serious investment in database authorization to guarantee that we never leak customer data.

Authorization in a multi-tenant db is something many companies have to deal with, and in previous companies, I saw authorization implemented in probably the most common way: appending WHERE user_id = $USER_ID to queries. This is also the way things started out at Nile, but as we added more features we noticed that we were forced to add many branching and repetitive WHEREs to our code. We needed a solution that would allow us to add features quickly and confidently, and using custom filters in every single query was error-prone and hard to evolve if our data model changed.

RLS code excerpt

One solution that I knew about was Postgres Row-Level Security (RLS), a db-level mechanism that filters rows based on a per-user basis. I expected it would allow us to iterate faster and dramatically reduce security risks. You can learn the basics with these two blogs that show how to build multi-tenant apps using Postgres RLS. As with most solutions, the blog version was easy to implement, but there was an especially long tail to ship to production.

In this blog post, I’ll talk about the alternatives we considered - both for multi-tenant architecture and for securing data access - why we chose RLS, and the various challenges we encountered and overcame while shipping it to production.

Existing multi-tenancy solutions

Schema-per-tenant and database-per-tenant

We considered both of these approaches but went with the single-schema approach for its minimal operational complexity, low cost, and ability to scale later on. I won’t go into detail about these approaches, as there are countless resources on the topic. Here are two resources I’ve found to be helpful:

  1. Multi-tenant SaaS patterns - Azure SQL Database | Microsoft Docs
  2. A great paper from Microsoft -Multi-Tenant Data Architecture

Single schema with dynamic WHERE queries

Pros

  1. Easiest and most straightforward zero-to-one solution.
  2. Transparent and easy to reason about.

Cons

  1. Possibility of forgetting to add a filter to a query. Since queries are permissive by default, this is easy to miss and hard to detect without extensive testing. There are some solutions to this ( i.e: @Filter in Hibernate) but I find that ORMs make simple querying easier and complicated querying harder. At Nile, our authorization model is complicated enough that we didn’t want to rely on Hibernate for this.
  2. Repetitive, ugly, and annoying to implement. Imagine you have 20 API endpoints that require authorization and 2 different types of roles, USER and ADMIN. The access controls for these two roles are different, so you might have to define 40 WHEREs across your codebase. This doesn't scale well when adding new roles or modifying existing ones across many API endpoints.

External authorization systems

Pros

  1. Highly flexible
  2. (Claim to be) scalable

Cons

  1. $$$ cost, if managed. Operational cost, if self-hosted.
  2. Unnecessary if the permissioning model isn’t particularly complicated. At Nile, so far it’s not.
  3. External dependencies often make testing more difficult and reduce engineering velocity. The benefits have to outweigh these costs.
  4. As a control plane, multi-tenancy is core to our product. We believe in building foundational capabilities in-house so that we can push the envelope rather than be constrained by external solutions.

What might a better solution look like?

After we chose to use a single multi-tenant schema, we were looking for a solution that would be cleaner and less error-prone than dynamic queries and lighter than an external authorization system.

In the rest of this blog post, I’ll lay out what I discovered about RLS in the few weeks I spent researching and implementing it at Nile, and how it solved our problem (at least for now) of building authorization with speed, confidence, and maintainable architecture.

A quick overview of RLS

The high-level process to set up RLS is:

  1. Define your data model as usual, but include a tenant identifier in every table
  2. Define RLS policies on your tables (i.e: “only return rows for the current tenant”)
  3. Define a db user (i.e: app_user) with all the privileges your application will need to interact with the db, but without any superuser roles. In Postgres, this is necessary since superuser roles bypass all permission checks , including RLS (more on that later).

A simple org access control example

Imagine your API has an /orgs endpoint that should only return organizations the calling user is a member of. To achieve this via RLS, you’d define your tables, policies, and db user as such:

rls_policy_setup.sql
CREATE
TABLE
users(
id SERIAL PRIMARY KEY
);

CREATE
TABLE
orgs(
id SERIAL PRIMARY KEY
);

CREATE
TABLE
org_members(
user INTEGER REFERENCES users NOT NULL,
org INTEGER REFERENCES orgs NOT NULL
);

-- ** RLS setup **
ALTER TABLE
orgs ENABLE ROW LEVEL SECURITY;

-- Create a function, current_app_user(),
-- that returns the user to authorize against.
CREATE
FUNCTION current_app_user() RETURNS INTEGER AS $$ SELECT
NULLIF(
current_setting(
'app.current_app_user',
TRUE
),
''
)::INTEGER $$ LANGUAGE SQL SECURITY DEFINER;

CREATE
POLICY org_member_policy ON
orgs
USING(
EXISTS(
SELECT
1
FROM
org_members
WHERE
user = current_app_user()
AND org = id
)
);

-- Create the db user that'll be used in your application.
CREATE
USER app_user;

GRANT ALL PRIVILEGES ON
ALL TABLES IN SCHEMA public TO app_user;

GRANT ALL PRIVILEGES ON
ALL SEQUENCES IN SCHEMA public TO app_user;

The above RLS policy will only return true for organizations that the current user is a member of. Simple enough. Later on, we’ll see how things can get more complicated.

Note the current_app_user() function. In the traditional use case of direct db access, RLS works by defining policies on tables that filter rows based on the current db user. For a SaaS application, however, defining a new db user for each app user is clunky. For an application use case you can dynamically set and retrieve users using Postgres’ current_settings() function ( i.e: SET app.current_app_user = ‘usr_123’ and SELECT current_settings(‘app.current_app_user)).

What it looks like from request to response

Request to response diagram

Why we chose RLS

It fails by default - and therefore secure by default

The biggest benefit of RLS is that if you define a policy that’s too restrictive, or forget to define a policy, things just fail. Compared to dynamic queries where forgetting to add a WHERE will leak data, this is a big win for security. I didn’t appreciate this until I wrote some integration tests for access patterns (i.e: testing if a user can access orgs they’re a part of). Initially, all the tests failed, and for cases where users should have access tests only passed when I added the appropriate RLS policies.

RLS is, of course, not a silver bullet. Accidentally defining an overly permissive policy is hard to catch without extensive tests so it’s important to still be careful.

Defined once, applied everywhere

One of the main challenges with dynamic queries in single-schema multi-tenancy is that changes to tables often require touching many different queries. RLS solves this problem since policies are tied to tables and not queries. After modifying a table, all you need to do is to change its access policies, which will be applied to all queries.

Composability

With RLS, it’s easy to add more access rules as your multi-tenant data model evolves. According to the Postgres docs:

“When multiple policies apply to a given query, they are combined using either OR (for permissive policies, which are the default) or using AND (for restrictive policies).”

Since by default policies are combined with OR, this makes it super easy to define more policies as your access rules get more complex. This isn’t so straightforward with dynamic queries, where you might have to define your own logic for combining access rules. Or, as probably many of us have seen before, just create monster WHERE statements.

Separation of Concerns

Instead of mixing filters that are related to our application logic with filters that are related to the multi-tenant database design in the same WHERE clauses, we now have a clean separation:

  • Our application applies all the filters that are requested by users through APIs and other application logic.
  • RLS is responsible for filters that are required due to the multi-tenant database design.

Cases where RLS isn’t a great fit

Every technology has its tradeoffs and cases where you shouldn’t use it. Here are two cases where we think RLS isn’t a great fit:

If you need stronger isolation between tenants

RLS in a multi-tenant db isolates access to database rows, but all other database resources are still shared between tenants. It doesn’t help with limiting the disk space, CPU, or db cache used per tenant. If you need stronger isolation at the db level, you will need to look elsewhere.

If you have sophisticated access policies

As you will see in the next section, our current access policy is fairly simple - tenants are isolated from each other, and within a tenant, you have administrators with additional access. More mature access control policies such as RBAC/ABAC require their own schema design and can be more challenging to integrate with RLS and even more challenging to make performant.

We’ve recently started the design for the RBAC/ABAC feature in Nile (talk to us if you are interested in joining the conversation), and we will have a follow-up blog with recommendations on best practices for adding RBAC/ABAC to multi-tenant SaaS.

Implementation challenges

A few gotchas

One gotcha we encountered was that RLS doesn’t apply to superusers and table owners. According to the Postgres docs:

“Superusers and roles with the BYPASSRLS attribute always bypass the row security system when accessing a table. Table owners normally bypass row security as well, though a table owner can choose to be subject to row security with ALTER TABLE ... FORCE ROW LEVEL SECURITY.”

Both of the blogs I shared earlier create a user called app_user that’s used in the application. We did this as well, locally, but didn’t change the database user when deploying to our testing environment. Thankfully, we caught and fixed this issue quickly.

Another issue we caught during testing was that some requests were being authorized with a previous request’s user id. We discovered that since the user id for RLS was being stored in thread-local storage and threads were being reused for requests, it was necessary to set up a post-response handler to reset thread-local storage.

Overall, so far the gotchas haven’t been too tricky to diagnose and resolve, but as one might expect with anything security-related, they do have serious consequences if not addressed.

Initial widespread code changes

Although RLS addresses the problem of continuous widespread changes well (see “Defined once, applied everywhere”), initially switching from dynamic queries to RLS requires more code changes than you might think. Here’s an example of how RLS might affect an API endpoint to update an organization that’s only callable by users in that org:

before_and_after_rls.java
/*
** ---- Without RLS ---- **

1. Check if user is a member of the org
a. If so, execute the update query
b. Else, return a 404
*/

Org update(userId, orgId, updatePayload) {
if (dao.isOrgMember(userId, orgId)) {
return dao.updateOrg(updatePayload);
} else {
throw new NotFoundException();
}
}

/* -- DAO layer -- */

boolean isOrgMember(userId, orgId) {
return query("EXISTS(SELECT 1 ...)");
}

Org updateOrg(updatePayload) {
return query("UPDATE orgs SET ... RETURNING *");
}

/*
** ----- With RLS ---- **

1. Execute the update query
a. If the org was returned from the db, return the org in the response
b. Else, return a 404
*/

Org update(userId, orgId, updatePayload) {
Optional<Org> maybeOrg = dao.updateOrg(updatePayload);
if (maybeOrg.isPresent()) {
return maybeOrg.get();
} else {
throw new NotFoundException();
}
}

/* -- DAO layer -- */

Optional<Org> updateOrg(updatePayload) {
return query("UPDATE orgs SET ... RETURNING *");
}

In this example, authorization without RLS is done before writing to the db. With RLS, since authorization is determined at query time, write queries might fail so error handling has to be pushed down to the db level. This isn’t a mind-boggling change but is one you should keep in mind when planning to add RLS in any project that involves a multi-tenant db.

The gaps between blog-ready and production-ready RLS

Recursive permission policies

Let’s say you want to add an admin user type and implement the following access rules:

  1. Users can read, update, and delete their own user profiles.
  2. Users can read the profiles of other users who belong to the same tenant.
  3. Users with admin access can read, update, and delete other users who belong to the same tenant.

The first two use cases are possible with straightforward RLS policies, but the third isn’t. This is because we must query the users table to see if the user in question is an admin ( i.e: SELECT 1 FROM users WHERE id = current_app_user() AND is_admin = TRUE). Since querying a table triggers its RLS policy checks, executing this query within a users RLS policy will trigger users RLS policy checks, which will call this query, which will trigger RLS policy checks, resulting in an infinite loop. Postgres will catch this error instead of timing out, but you should make sure to test your policies so this doesn’t happen at runtime. You can avoid this problem by defining a function with SECURITY DEFINER permissions that’s to be used in the RLS policy. According to the Postgres docs:

"SECURITY DEFINER specifies that the function is to be executed with the privileges of the user that owns it."

In our case, this user is the superuser that you probably used to set up your database. So they bypass RLS.

note

By using SECURITY DEFINER you are allowing users to bypass the security policy and use superuser privileges regardless of who they really are, so you must be careful. I recommend reviewing the “Writing SECURITY DEFINER Functions Safely ” section of the Postgres documentation before using this capability.

Here’s an example of how to implement RLS policies that satisfy the three use cases above:

complex_rls_policy.sql
CREATE
TABLE
users(
id SERIAL PRIMARY KEY,
is_admin BOOLEAN
);

ALTER TABLE
users ENABLE ROW LEVEL SECURITY;

-- Users can do anything to themselves.
CREATE
POLICY self_policy ON
users
USING(
id = current_app_user()
);

CREATE
FUNCTION is_user_admin(
_user_id INTEGER
) RETURNS bool AS $$ SELECT
EXISTS(
SELECT
1
FROM
users
WHERE
id = _user_id
AND is_admin = TRUE
) $$ LANGUAGE SQL SECURITY DEFINER;

CREATE
FUNCTION do_users_share_org(
_user_id_1 INTEGER,
_user_id_2 INTEGER
) RETURNS bool AS $$ SELECT
EXISTS(
SELECT
1
FROM
org_members om1,
org_members om2
WHERE
om1.user != om2.user
AND om1.org = om2.org
AND om1.user = _user_id_1
AND om2.user = _user_id_2
) $$ LANGUAGE SQL SECURITY INVOKER;

-- Non-admins can only read users in their orgs.
CREATE
POLICY read_in_shared_orgs_policy ON
users FOR SELECT
USING(
do_users_share_org(
current_app_user(),
id
)
);

CREATE
POLICY admin_policy ON
users
USING(
do_users_share_org(
current_app_user(),
id
)
AND is_user_admin(
current_app_user()
)
);

Note the use of the do_users_share_org() SECURITY INVOKER function. According to the Postgres docs:

“SECURITY INVOKER indicates that the function is to be executed with the privileges of the user that calls it.”

In our case, this is app_user (who doesn’t bypass RLS), so we just define these functions for reusability purposes.

Logging

It’s important to set up logging before shipping any feature to production. This is especially true with RLS where logging the execution of the actual policies isn’t directly possible . For each request, it’s helpful to log the user and tenant IDs to be used for RLS when:

  • Parsing them from auth headers
  • Setting and getting them from thread-local storage
  • Setting them in the db connection
  • This makes it easier to identify bugs related to thread-local storage When resetting them in thread-local storage

It’s also a good idea to enable more detailed logging in the db, at least initially, to see the values actually being inserted/retrieved. If policies return too few/many results, or inserts fail unexpectedly, it’s easier to figure out what went wrong.

Testing

In multi-tenant SaaS, guaranteeing the security of each tenant is critical. We have an extensive suite of integration tests that test every access pattern to make sure that nothing ever leaks. The tests spin up a Postgres Testcontainer and call the relevant API endpoints, checking that proper access is always enforced.

In order to minimize the execution time of a large suite of integration tests, we avoid setup and teardown of the database between tests and annotate the order in which tests run to make sure the results are deterministic even without a full cleanup in between tests. As we scale, we’ll look into other options like property-based testing and parallelizing our tests.

The switch from dynamic queries to RLS has been seamless in our integration tests. All we had to do was to make sure our tests were using the newly-created app_user that doesn’t bypass RLS.

Conclusion

Every modern SaaS product is multi-tenant, but the good ones are also scalable, cost-effective, and maintainable. Scalability and cost-effectiveness are the results of careful system design. Maintainability includes design considerations such as the DRY principle (don’t repeat yourself) and a separation of concerns, which make mistakes less likely and testing and troubleshooting easier.

As we’ve shown, a single-schema multi-tenant database with RLS ticks all the checkboxes for scalable, cost-effective, and maintainable architecture. This blog includes everything you need to get started with your own multi-tenant SaaS architecture. But if this seems like too much and you’d rather have someone else handle this for you - talk to us :)

· 18 min read

When we explained the control plane approach for Infrastructure SaaS, most developers we talked to understood what SaaS control planes do and often said things like oh! I built this many times! I didn't realize it is called a control plane. This reaction confirmed what we initially suspected - that everyone is building control planes, but no one is talking about it. If you are interested in SaaS control plane - sign up to our mailing list. We'll send you our latest content and reach out for a chat

We also received a few responses like "why is this difficult?"

Legally blond meme - &quot;what, like its hard?&quot;

When we started building Confluent Cloud, we thought that building a control plane would be easy. Few tables for users and their clusters, run everything on K8s... how hard can it be? A two-pizza team effort of 6 months seemed like a very reasonable estimate. We discovered otherwise. Recently, when we talked to other SaaS Infra companies, we learned that regardless of the company's specific architecture or size, building a successful Infra SaaS product required at least 25% of the engineers to work on the control plane.

Building a control plane sounds simple - but it isn't. If you ever said, "oh, I can do this in a weekend," only to keep struggling for weeks or months, you should be familiar with deceptively simple problems.

Chess only has six pieces and ten rules. It sounds like a pretty simple game. The DynamoDB paper is short, yet every implementation took years to complete. Copying data from one Kafka cluster to another sounds simple. No one expected it to require four rewrites in 7 years. Pick the simplest software engineering topic you can imagine - SQL, JSON, REST, PHP - and you will be able to find a blog explaining why this topic is rather complex.

tweet about &quot;I can do this in a weekend

In this blog, we'll look at the challenges waiting for the engineer who sets out to build a control plane.

As we described in the previous blog, the control plane is responsible for providing the SaaS capabilities, metadata management, and controlling the life cycle across all the data planes.

Building a control plane is a big job. In this blog post we'll discuss the many challenges that may not be apparent when you first try to estimate the effort involved. We divided the problems into two parts:

  • SaaS Flows: The problems that come up as you try to create a unified customer experience across many services in the control plane and the data plane.
  • Control Plane and Data Plane integration: The problem that come up when you need to send information from the control plane to the data plane and vice versa.

Different problems become relevant at various stages of building a control plane. An MVP for a control plane can be relatively simple and become more challenging as requirements add up. Having this map of challenges will help you understand the technical depth involved and the realistic investment level.

While this blog focuses on control planes that serve the customers of Infra SaaS, the challenges involved in building internal control planes are similar. We will address the topic of internal control planes in a future blog post.

Seamless SaaS Flows

Your customers signed up to your SaaS product because they need to get something done and they will use the control plane to do it. They don't care about its architecture and all the services that are involved - they need it to disappear so they can focus on getting things done.

You can probably think of products where even though the product is complex, getting things done is a seamless experience. The same concepts are used through the product and you can't tell which teams own which microservices behind the scenes. Apple's iPhone, Stripe, Notion all have a seamless user experience.

Compare this to AWS network configuration. All you want is to run a container that can communicate with the public internet. But you have to figure out how to configure EC2, security groups, load balancer, target groups, subnets, routing tables and NATs. Each with its own documentation. If you do it wrong, you won't have connectivity. But because each component is configured separately, it is nearly impossible to understand why the packets don't arrive correctly.

We use the term SaaS Flow to describe common user activities in SaaS products that interact with multiple services.

There are SaaS Flows that are standard in SaaS products - they interact with standard SaaS capabilties.

For example: inviting a user to an organization in a SaaS product is a SaaS flow - a single activity from the user perspective, but the implementation spans the authentication service, user management service, notification service and perhaps an access control service as well. You can see an example diagram of an invite flow below.

There are also SaaS Flows that interact with entities that are specific to your application.

Creating a new set of API keys that give access to the data plane database that was provisioned by a customer. Upgrading an account from free trial to a paid version and updating the number of concurrent builds that can run in the data plane. Handling an expired credit card, deleting a user, deleting a tenant - all those are examples of one user activity that has to be handled across many services some are general (payments, notifications) and others are product specific (pausing all ETL jobs, updating data plane access policies).

There are hundreds such flows in any SaaS product.

state machine for user invite flow

Every control plane is a SaaS Mesh - it is made of many multi-service SaaS Flows

User management, access management, audit, billing, notifications, provisioning, upgrades, metrics, logs... not to mention anything specific to your product. Every SaaS flow will involve several of these services, which means that they continously need to exchange information, events and actions. Each component has an owner, but who owns the end to end flow? Who makes sure that services send each other the relevant information and that each service handles its part of the flow in ways that fit with all the other parts? You can think of this as a SaaS Mesh - a seamless product that is generated from independent components and clear contracts between them. Or it can become a SaaS Mess, if the interfaces are not well defined and the dependencies are introduced ad-hoc.

As an example, think of a scenario where the credit card on file for an organization has expired. The organization has 15 users. Which one of the users in the customer's organization will be notified? how will they be notified? will your sales team or support get notified too? will the customer's clusters or jobs keep running? for how long? if the cluster is de-provisioned, will you keep storing their data? what about the list of users and their emails? metrics? once they update the credit card details, will every service resume its activity? will the new card get charged for any of the lapsed time?

It is important to also handle all the failure scenarios in each of these flows - what if the notification service is down? What if Salesforce returns an error or throttles your requests? Is it possible to save the current flow state and try again later? Can you restart the flow from the beginning or were some notifications already sent?

note

SaaS Flows can be modeled with a state machine and each event or API request/response moves the system between states. This model helps you persist the latest state of the flow, so completed steps won't re-run but failed steps can be retried and the flow can continue. In addition, this modeling helps in monitoring the load and health of each flow.

SaaS Flows across control and data planes.

Creating Seamless SaaS Flows that touch both control plane and data plane services is an even bigger challenge. This is especially true when a customer request encounters a failure in the data plane.

Balancing enough abstraction for a seamless experience when things go right, but enough details for meaningful debugging when things go wrong, is an engineering challenge in general. It becomes more difficult when the user interacts with a control plane but problems happen in the data plane.

Think of a scenario where you built a SaaS ETL product. A customer tries to define a new ETL pipeline through your control plane, but something in the process failed. If the failure was due to lack of compute on a K8s cluster, the control plane shouldn't show the exact K8s error, since your service abstracts K8s. But if the failure is in loading data to the customer's DB, showing the exact error will actually help your customer identify the DB issue on their side.

Example of an actionable error message in a SaaS flow:

is this error useful?

If the error is transient, it makes sense to retry the SaaS flow - starting from the point of failure. Does the control plane manage the retries by re-publishing the "create new pipeline" request repeatedly until it successfully completes? Does the pipeline itself persist the event until it is successfully acknowledged? Does the dataplane store the in-flight requests locally until they complete? Each one of the architectures has its own tradeoffs.

In cases where the user does interact with the data plane directly, we discovered that user's mental model is that all admin activity will still be available in one place and that there will be a consistent permissions and access model between control plane and data plane.

A user who just created a database in the control plane will expect to be able to also be able to create tables, insert data into these tables and run queries. The expectation is that the control plane is a single pane of glass that reflects all the data plane systems. It will be a non-ideal experience if they need to use two or three different tools for all those activities and an even worse experience if the user who created the database doesn't have permission to create a table or to query the table that they created.

SaaS Flows that involve business systems

In addition to the control plane and the data plane, there are other parts of the business that have a relationship with customers.

Support teams will need a view of the current state of the customer's metadata - especially if there were any recent changes or errors. They will need to be able to dig into any relevant metrics or logs on behalf of the customer and perhaps even take action on the customer behalf's (after proper approvals).

Marketing teams may need metrics regarding the customer's engagement or specific activities they took (or did not yet take) in the product. And they may wish to tweak certain aspects of the product experience to drive growth in certain segments or personas.

Sales teams may need to know when the customer's usage passed a certain limit. They may also need to be aware of any serious incidents, SLA misses and significant planned maintenance that will affect their customers. And of course business analytics or data science teams will need access to all the usage, engagement, costs and revenue data in order to prepare dashboards for the executives.

A credit card expiration flow may have a step that updates the sales team via Salesforce, along with many other steps:

services involved in credit card expiration

All those business requirements indicate the need for ETL and reverse ETL between the control plane and multiple business systems - data warehouse, analytics store, marketing tools, sales tools, support systems and so on. Those integrations also require monitoring and ideally should be part of the integration testng, so you can quickly catch any breaking changes.

When using 3rd party services - you still own the SaaS flows

Since SaaS control planes are large in scope, it makes sense to integrate with 3rd party providers for specific features such as payment processing authentication or transactional notifications.

Using 3rd party services helps Infra SaaS companies deliver value to their customers faster, but those customers still need seamless SaaS flows. External services can be part of these flows but the flow itself is still owned by the control plane developers.

Lets say you use a 3rd party authentication service. Authentication is solved, but information about users still has to exist throughout the control plane and even the data plane, since it is part of many SaaS flows. There is still a "user data store" and "user service" which provides APIs and events to every other service that needs information related to users. All the issues we describe in this section are still problems that you own and need to address: designing SaaS flows, error handling, access management between control and data planes, testing and monitoring.

Trust but Test and Monitor

SaaS Flows have to be tested as a flow - coverage of each service alone leaves many gaps for customers to fall through. You will want an integration testing framework that allows you to test all the services, including the 3rd party ones. Testing the "reset password" API will require an environment with the authentication service, user management service and notification service.

It is also important to test all the cross-service APIs. You will want to avoid breaking compatibility between services when possible, and to know when compatibility was broken so you can come up with a deployment plan that involves all services that use the modified API. There are also APIs that were not meant to be used by other services, and yet they are. Breaking those undocumented APIs will break your application just the same. There are service mesh tools that can report which APIs are actually in use, and by which services use which API - use those tools to understand which API contracts you need to maintain.

Make sure you collect detailed metrics about the number of users, payments, notifications or other entities in each step of the flow - a large number of flows stuck in a specific state will be your main indication that there is an error condition that your flow did not take into account.

Most SaaS Flows have implicit user expectations around latency - after clicking "reset password", users will expect the website to update in 100ms, the SMS to arrive in 30 seconds and an email to arrive within a minute or two. You will want to measure the latency of each step in the flow and queuing delays between steps.

diagram of spans in SaaS flow

Integrating Control and Data Plane

This is the core challenge of the control plane architecture. We reviewed the overall architecture in the previous blog, but here's the MVP version:

  1. Design the control plane metadata and use Postgres as your data store. Use Postgres built-in REST APIs and access controls and you have a minimal backend.
  2. Use 3rd party integrations where possible. This still requires effort, but it is a good start.
  3. Capture changes from the control plane that need to be applied on the data plane. With this architecture all changes are persisted to the database, so it makes sense to capture changes at the DB layer. This can be done with logical replication, Debezium, database triggers or a home-grown service.
  4. Deliver the events to the data plane: The most common pattern is to have the data plane poll the control plane for new events - this can be via API calls, direct database queries, or an event / messages service.
  5. Data plane services react to events independently, according to their own business logic
  6. Data plane services update the control plane on progress and errors

Once you implement all this, make sure you set up integration testing and monitoring.

Beyond this simple architecture, there are additional challenges that result from the different dimensions in which the system can evolve.

Availability

If your architecture allows users to interact directly with the data plane, you want to make sure that the data plane availability is either completely decoupled from that of the control plane or that both the control plane and the data plane and the pipelines in-between are designed for a higher SLA than what you offer your customers. If you opt for decoupling the data plane availability from that of the control plane, you'll probably end up with the data plane caching information from the control plane locally. It may sound simple, but keep in mind that cache invalidation is one of the two hardest problems in computer science.

Security

If you support enterprise customers, there will be interesting challenges around the security of the communication between the data plane and the control plane. They will need to mutually authenticate each other and the events themselves may need to be signed for authenticity. You'll likely need IP whitelists in both directions, publish approved port lists and support at least one private networking option, possibly more.

Some Enterprise customers may also want you to run and manage the data plane, or even the control plane in their VPC or their cloud vendor account.

You will need support for storing secrets in the control plane. It is very likely that your data plane will need to authenticate to customer resources in other SaaS, so you will ask your users for credentials - and the last thing you need is for those credentials to leak.

Scale

As the number of data plane service instances grows, you need to make sure the control plane can handle the case where they all attempt to connect to the control plane at once and retrieve updated state. This can happen as a result of an incident, a recovery plan or a mis-managed upgrade. A meltdown of the control plane under this coordinated DDOS is not going to be helpful in any of these scenarios. A combination of good database design which minimizes hot-spots and a good rate limiting protocol will help save the day.

Many Infra SaaS have use-cases that are latency senstive. When the target latency is below 100ms, you have to avoid routing these operations via a central control plane (regional control plane may be acceptable). The extra latency for the additional network hop will be meaningful and the risk that the control plane will become a bottleneck is rather high.

Over time, as your product and business evolves, you may end up with multiple pipelines between control and data plane:

  • Metrics and logs are often sent from data plane to control plane, so they will be visible to customers via the control plane ("single pane of glass" is a common name for this).
  • There may be another system for fleet management and upgrades, one that is integrated with your CI/CD system but also with the control plane front-end and the notification service.

While those may be separate channels of tasks and information, it makes sense to view all those pipelines as part of a single "logical" control plane and standardize on the entities, events and APIs that these systems refer to. The reason is that as we discussed when we talked about SaaS Flows, customers expect a seamless experience with the control plane - not multiple control planes. They may want to subscribe to upgrade notifications or even configure a maintenance schedule. If the fleet management and control plane speak different languages, this integrated experience will be a challenge.

Upgrade flow

Reconcilling state between data plane and control plane

Remember that things may happen on the data plane without going through the control plane first. This can be caused by the cloud provider decomissioning machines or upgrading K8s masters with surprise effects, or more often - it can be an engineer acting with the best intentions. Regardless of the cause, operating a system where the control plane has one configuration and the data plane has another is a recipe for failure. Your architecture must include plans for discovering and reconciling divergence.

Summary

It is easy to look at a control plane as "just a Postgres DB with some APIs and an auth service" and believe that it is simple to build and grow. However, even at its simplest, the control plane requires careful design, good guard-rails in the form of integration tests and comprehensive monitoring, and quite a bit of toil to build the needed integrations. Systems that look easy but turn out to be a significant investment are quite common in engineering. At the MVP stage, they require balance between keeping the scope minimal while still designing a flexible system that can evolve and address both customer requirements and operational pains. We will introduce more design patterns in later blog posts that will help you in designing and implementing such systems. Join our mailing list to get notified when we publish additional posts.

· 17 min read

A few months back, we saw a tweet about how every Infrastructure SaaS company needs to separate the control plane from the data plane to build a successful product. Reading this got us excited since we were working on a platform that would make this really easy. We would love to talk to you if you are already familiar with these patterns and are building an Infrastructure SaaS product

Twitter-Snapshot

We spent the last six years at Confluent, helping transform it into a world-class Infrastructure SaaS company. We shared the same sentiment as this tweet - building Infrastructure SaaS products can be much simpler if we have a platform that helps develop a reliable control plane. Companies could save significant costs and time, and they could leverage their engineers to focus more on their core products. We thought it would be helpful to explain the end-to-end architecture of an Infrastructure SaaS product, the role of the data plane and control plane, and the problems that make this challenging.

What is Infrastructure SaaS?

Infrastructure SaaS refers to any infrastructure or platform product provided as a service. It includes data infrastructure, data analytics, machine learning/AI, security, developer productivity, and observability products. Sai Senthilkumar from Redpoint wrote an excellent article on this topic and how these Infrastructure SaaS companies are among the fastest-growing companies.

Infra-SaaS

Infrastructure SaaS companies invest in platform teams to build their SaaS platform. The platform teams are responsible for developing the building blocks needed to build a control plane. The investment in the platform teams continues to grow significantly as the product succeeds and is typically 25-50% of the engineering organization. Based on our experience building large-scale Infrastructure SaaS and talking to other companies, it has become apparent that platform investment is the highest cost to the engineering organization in these companies.

Data plane vs. Control plane - when do we need this?

Control planes are typically responsible for providing the SaaS capabilities, metadata management, and controlling the life cycle across all the data planes. The separation between the control and data planes is common when building an infrastructure SaaS product. There are a few reasons for this:

Infra-Relevant

Productize an open-source infrastructure as a SaaS product

Most open-source infrastructure projects start with only the data plane. The project authors realize that the next step is to productize the open-source infrastructure as a SaaS product. An independent control plane is ideal for achieving the SaaS experience and ensuring that the core open-source data plane is separate. The control plane will help manage multiple data planes across regions and cloud providers.

Building any proprietary Infra SaaS product

The open-source argument is pretty strong. However, the need for a control plane is not just limited to open-source infrastructure. It becomes a core need for any infrastructure SaaS product, either close or open source. Almost all Infrastructure SaaS need a central management layer that enables tenant management, user management, cluster management, and orchestration of all the data planes. The control plane provides a single pane of glass experience for the end-users, coordinating with all the data planes and responsible for the overall life cycle management.

Data locality with customer location

With infrastructure SaaS, there is a general need to keep the data plane close to the customer location for a few reasons.

  • Cost
    The data transfer cost will be prohibitively expensive if the data plane is network intensive. You typically want to eliminate this cost by being in the same region as the customer. There are a few other networking options to mitigate this cost (a post for another day).
  • Security
    For enterprise customers, the data plane location depends on substantial compliance and regulatory requirements. Extremely security-conscious customers might want the data plane in their account to control access more tightly.
  • Latency
    Mission-critical infrastructure typically has low latency requirements. The data plane must be in the same region as the customer to ensure excellent performance.
  • High availability
    For high availability, you want to avoid your connections to the data plane across GEO and be more resilient to network failures cross-region or cloud. In addition, a single data plane cluster may be hard to scale due to capacity reasons and would need to be sharded. It becomes much easier to scale the data plane by decoupling it from the control plane.
  • Multi-cloud
    Finally, supporting multiple cloud providers is becoming very popular. One model to support this would be to centralize the control plane in one cloud and deploy the data plane in different cloud providers for the same customer. There are more variants to this which we will look at later.

What does a world-class control plane need?

It would be helpful to understand what capabilities a control plane needs to support. These requirements will influence the architecture of an Infrastructure SaaS product.

Infra-Requirements

User, organization, and metadata management

Users and organization management are basic requirements for an Infrastructure SaaS product. User management includes authenticating users, managing users' lifecycle (add, invite, delete, update), and supporting user groups and third-party identity integrations.The control plane needs to ensure the access controls for a user is reflected on the data plane when the user lifecycle APIs are invoked.

Organization management, sometimes known as tenant management, includes supporting the organization hierarchy data model, applying quotas, SKUs, security policies at an organization's scope, and end-to-end life cycle management. Multitenancy is a basic need for a SaaS application, and Infrastructure SaaS is not any different. For larger customers, organization management becomes pretty complex, including supporting flows to merge two or more organizations, suspending organizations, and implementing clean organization deletions based on regulatory requirements (GDPR, FedRAMP, etc.).The control plane needs to ensure tenant lifecycle management is reflected on the data plane as well. For example, when an organization is suspended, the control plane needs to ensure that the data plane cuts access temporarily.

There are many standard SaaS entities that a SaaS application needs - users and organizations are examples of that. At the same time, there is a lot of application-specific metadata. For example, an infrastructure product that lets users manage a set of database clusters could define metadata like ‘cluster’, ‘network,’ and ‘environment.’ The metadata needs to be defined, CRUD APIs need to be written to manage them, and their access needs to be controlled by the same security policies defined for users and the organization. The central control plane needs to be the source of truth for this metadata and support its management.

Orchestration and integration with data planes

The control plane should have near-instantaneous communication with the data plane - whether it manages a single data plane or hundreds of clusters across different regions and cloud providers. It needs to communicate and transfer data securely across the data planes and receive data back. The control plane needs to provide a single pane of glass view of all the metadata of an organization’s data plane. Pushing configuration changes, sharing application metadata, deployment, and maintenance operations are a few examples where the control plane needs to have the ability to orchestrate across the different data planes.

Lifecycle management of the data plane

One of the control plane's core needs is to manage the data plane's end-to-end lifecycle. Typically, this includes creating, updating, and deleting resources in the data plane. For Infrastructure SaaS, all these operations are asynchronous, and you want the control plane to manage the end-to-end flow. It needs to provide an excellent experience for the end-user when they invoke these lifecycle operations, ensure low latency and correctness in executing these async lifecycle operations and can do this management at scale for hundreds to thousands of data planes across regions and cloud providers.

Security Policies

Security policies include access controls, quotas and data governance for each organization and user. Access controls can range from simple permissions to complex RBAC support. Typically, these access controls need to apply to both the control plane and the data plane. When there are IDP integrations, the control plane may have to apply access controls based on what is defined in the IDP. Quotas are bounds to the set of operations that users and tenants can perform on the infrastructure. This is typically done to protect any denial of service attack and build a healthy multitenant system. Quotas, similar to access controls, can apply to the control plane and data plane operations. Data governance is increasingly critical for larger customers. Governance includes ability to find data easily, store data in compliance with country specific policies and purge data based on data retention rules. For all the security policies, the control plane needs to ensure they are applied to all the data planes consistently based on tenant location, policies and user rules.

Metrics, alerting, and insights

For infrastructure SaaS, users typically execute some commands on the infrastructure and would like to know how the execution is progressing. They might want to get notified or alerted when something is not going right or if it needs their attention. For example, a database product will have users execute a set of queries and look at query metrics to understand the response times, errors, and usage. Users may also want to get notified when a query starts failing. Users also need aggregated metrics across their databases in different regions and cloud providers. The control plane needs to aggregate all the metrics across the data planes, provide insights on the metrics to the users, and alert the customers on critical anomalies.

Subscriptions and Usage-based billing

Infrastructure SaaS has fundamentally changed how billing for SaaS works. Traditional SaaS is billed based on the number of seats/users. For Infrastructure, it is typically a combination of monthly subscription plus billing based on product consumption. For example, a specific infrastructure product might provide customers with three different SKUs with increasing value. The first SKU could be a free tier with limited quota, and the rest could have a base monthly subscription fee. In addition, the users would pay based on their usage of the product that month. Consumption could include throughput, number of queries, storage, number of compute instances, etc.

The control plane must be able to compute the billing based on the user SKU, monthly rate, and usage. The usage component must be calculated based on the metadata or usage metrics aggregated from all the data planes. The metrics and insights shown to the user (explained above) should match the billing data to ensure the user has a consistent experience.

Back office integration

SaaS companies must integrate customer metadata with all the back-office systems. When a user signs up, the marketing team will want the user information in their campaign tool to start including the new users as part of their marketing campaigns. In a sales-led company, the sales rep must create a new production account for the customer once the deal is successfully closed in their CRM. The customer metadata must be pushed to the Data Warehouse for business insights. These examples need a reliable pipeline that integrates data between the production database and the back-office systems. For Infrastructure SaaS, the control plane is the central customer metadata store in production. It needs to provide a reliable pipeline that integrates both ways with all the back-office systems. In addition, the data needs to be available in all the different systems within an acceptable SLA agreed by all stakeholders.

Infrastructure SaaS architecture

With our understanding of the control plane and infrastructure SaaS requirements, let us delve into the architecture of a typical Infrastructure SaaS product. We will review the basic building blocks and discuss a few architectural considerations.

Controlplane-arch

The basic building blocks of a control plane

SaaS fundamentals (aka The SaaS Mesh)

Infrastructure SaaS products need a world-class SaaS experience for their customers. It includes authenticating users and user management, organization management, providing a permission model for access controls, defining different SKUs for different product offerings, and the ability to bill based on subscription or usage of the product. These are basic expectations from end-users for all Infrastructure SaaS products. A basic version of all these features listed may be good enough for a free tier offering with some quotas, and they get complex as you serve higher segments (e.g., enterprise). For example, mid to enterprise customers may need to integrate with their identity management system for authentication instead of the default offering that the product provides.

There is also a complex interconnect between the SaaS features that we call ‘SaaS Flows’. The SaaS experience of an infrastructure SaaS consists of a bunch of SaaS flows. For example, when a user signs up for the product, you may also want to create an entry in a marketing tool to send campaigns. A more complex example of a SaaS Flow could be when a credit card for a specific organization expires. On expiry, you want to notify the customer a few times to update the credit card information. If there is no response from the customer, you might want to temporarily suspend the account, disable access to the data plane and eventually reclaim the account after waiting for a sufficient amount of time to avoid incurring infrastructure cost. This SaaS Flow example connects user management, org management, billing, notifications and the data plane. We call this interconnect between the different SaaS features a ‘SaaS mesh.’ SaaS mesh is needed to build the different SaaS flows. The SaaS flows include customer-facing experience, and back-office flows for the other stakeholders in a company.

Orchestrating the data planes

One of the core responsibilities of the central control plane is to orchestrate the different data planes. Typically a single customer could have applications or clusters in multiple regions or cloud providers. With more customers and data planes, a few things have to be supported -
  • Propagating the SaaS metadata to all the data planes
  • Pushing new configurations and application versions across the fleet
  • Defining maintenance windows and sequence of deployment based on customer priorities
  • Capacity management of all the data planes to ensure infrastructure is within the cloud limits
The control plane is the source of truth across all the data planes. Managing the data plane information centrally helps provide a single pane of glass experience for the end customers to access all the information about their applications or clusters.

Data plane management

The data plane is where the actual customer application or cluster is deployed. The deployment can happen on a Kubernetes cluster or directly on cloud instances. The cluster or application could be co-located on one Kubernetes cluster or separate. There is usually an agent running on these data planes that helps execute the local life cycle operation on the data plane based on the commands from the control plane. The agent, in a sense, acts like a mini control plane co-located in the data plane. Like any architecture, there are different ways to manage the data plane. Kubernetes, Terraform, or Temporal are tools that could be used to manage the lifecycle of each data plane.

Closed feedback loop

Any control plane architecture is not complete without a closed feedback loop with the data plane. As mentioned previously, the control plane is the source of truth about the current state of customer applications or clusters. The data plane needs to report the status of the operations to the control plane. In addition, the control plane would also want to collect application metrics and metadata to show insights to the users about the infrastructure.

Other considerations

Customer account vs. Fully hosted

In the fully hosted model, the data plane is deployed in the cloud account of the service provider. Due to compliance requirements, some companies and customers demand that the infrastructure be deployed in their own cloud accounts. Some infrastructure Saas companies need to support both models. It is possible to unify the architecture for these different deployment models. To deploy infrastructure in the customer account, permission model in the customer account, billing plans (the customer gets usage cost in their cloud bill), support (who gets access), and development overhead.

Testing

Testing new data plane changes against the central control plane adds complexity. Mocking the entire control plane for end-to-end tests is not desirable since you typically have to make changes in the control plane to enable new data plane features, which need to be tested. A reasonable solution is to provide each developer their own local sandbox of the control plane with only their changes. It will help them to test their changes locally before pushing the changes to pre-production. Without a good testing strategy for the control plane, every change gets harder to stabilize as the product and teams scale.

Disaster recovery

An essential part of control plane design is to have a sound strategy when the control plane becomes entirely unavailable. From a user perspective, the data plane needs to be available even if the control plane is unavailable. In addition, there needs to be a plan to bring the control plane back up in the same region or another region (sometimes in another cloud). Restoring the data without any data loss is critical. You can provide a highly available service if you can bootstrap a control plane automatically from the backup data.

This is hard!

We plan to publish a post soon covering the complex parts of the building and scaling a control plane. We have listed a few questions below that take significant time and cost to design and build.

Controlplane-hard

  • How do you build the SaaS fundamentals for your product? How do you support the different SaaS flows?
  • How do you manage all the SaaS and application metadata to provide a single pane of glass experience for your users across all the data planes?
  • What mechanisms do you use to ensure the metadata changes are available to all the data planes?
  • How is everything designed to support multitenancy? How do you enforce metadata access, quotas, and SKUs at the tenant scope?
  • How does the architecture change when you orchestrate thousands of data planes?
  • How can you collect all the metrics and metadata from all the data planes to provide insights to the users, compute billing based on usage and integrate with the Datawarehouse for business intelligence?
  • How can the control-plane scale, and what SLAs do you provide?
Look out for a blog post that will discuss the challenging problems of building a control plane in more detail.

Building Infrastructure SaaS?

In the next eight weeks, we will release a series of blog posts describing different aspects of the control plane architecture for Infrastructure SaaS. We hope this will help all the companies building, scaling, or rearchitecting their control planes to provide their infrastructure as a service in the cloud.

We will love to talk to you if this problem sounds familiar to what you are tackling! We are building a platform that will make it easy to build and scale Infrastructure SaaS products. We hope to provide a world-class platform that Infrastructure SaaS companies can leverage to develop their control plane.