Skip to main content

· 20 min read

This blog is based on the speaker notes for my talk at Data Day Texas 2023, and improved with feedback and ideas from my colleagues Ram Subramanian and Ewen Cheslack-Postava. If any of these missing database capabilities resonate with you and you'd like to try Nile's database, please sign up to our waitlist

Writing or talking about things databases don't do may sound a bit silly. There are obviously many things databases don't do. For example, I run many databases and not a single one of them made me coffee this morning.

Cup of coffee

It also sounds a bit useless. Knowing about things you can do is obviously useful - it helps you do things. But what can you do with information about things databases tend to not do? Other than build a new database, that is.

Forewarned, forearmed; to be prepared is half the victory

In this blog, I'll point out functionality that is very often needed in data platforms and more likely than not, you will need to build yourself since your DB won’t handle it for you. Even though it really should. I know you will need to build all of this yourself, because I've seen it in almost every project I was part of over 20+ years.

You can take this list of things DBs don't do, and use it as a checklist for your application or data platform design - make sure you don't forget about them, because with high likelyhood - you will need them.

In some cases there are some DBs that do these things. You can use this list to guide your choice of a DB. These things are not key criteria for choosing a DB, but if you have several close options, it may tip the scale.

And finally, it may be an entertaining rant to read. There is some relief in knowing that you are not alone in asking yourself "Why do I still need to implement this in every project? why doesn't the DB just take care of it?".

A Database can’t do everything - but a data platform could

Lets start with one thing that databases don’t do but everyone wishes they did and almost every vendor claims that they do:

If you pick a database, there will be a set of use-cases that it shines in, a set of use-cases that it is ok at, and some stuff that it was really not built to do. A good analogy is that of the grain of the wood. The grain of the wood is the way its fibers are arranged. When doing woodwork, if you work with the grain of the wood, the work will be easy and the result will be clean. If you work against the grain, the work will be difficult, the results will be of poor quality and you are likely to hurt yourself in the process. In a similar manner, if you work with the grain of the database - work with its architecture the way it was intended, your work will be easier and the results will be of high quality.

Unfortunately, pretty much every database vendor, at some point, claimed that their DB is good for everything. It is very tempting to believe them! It sounds so much easier to have a single DB for relations, documents, graphs, 5 year financial report, real-time location report of trucks in the app and the kitchen sink.

But the moment you dig into the details - who is going to run it, what are their priorities, how will they tune it, what hardware will it run on, etc - you discover that - shockingly! Different use-cases and different people have different needs and one-size really fits none.

Because one database can't be great at everything, we need data platforms. And for the most part, we need to build them ourselves for our specific use and vertical. However, this doesn't mean that we can't ask our databases to do a lot more - without going against their grain.

Things (Nearly) All Applications Need

Version Control for Schema Changes

What do you do if you need to change the database schema?

When I gave this talk at Data Day Texas 23, about 25% of the audience raised their hands when I asked "Do you just connect to the database with an admin tool and run alter table?". The other 75% likely use version control and migration scripts.

The migration script is a file with the DDL/DML of the changes you want to apply to the database schema. Developers typically test the changes locally, then post them to a source control system for someone to review. This usually happens in a branch and this branch can have many different changes related to a specific feature in it - both schema and application code changes. Sometimes the branch is deployed to a pre-production system for more validation, especially of dependencies and integration points. When it is time to take changes to production, the developer merges the branch with the changes and an automated system runs all kinds of tests before pushing this to production. Each engineering organization is responsible for all this automation - the deployment scripts, the tests, etc.

“Schema version changes from localhost to production”

Databases are only aware of the current version of the schema that exists in them in a particular moment in time, and have the capability to apply the DDL/DML in the migration script. Sometimes even without too much locking or downtime.

But this is where database's assistance in schema migrations pretty much ends. Databases are not aware of compatibility considerations between the application code and the DB schema, so they will not help you catch changes that break compatibility.

In cases when you do need an incompative change, you'll need a thoughtful migration strategy. It will likely involve additional columns, indexes and views to allow application code to work with both the old and the new versions until the migration is completed. Most databases don't help with those more involved migrations. Many don't even support checking whether a column or a view is still in use and discovering who uses it. Marking columns or views as deprecated to warn current or new users is practically unheard of.

Ideally, a DB that would also be aware of multiple schema versions, allow us to flip between them and warn us before we push something incompatible to production (i.e. the main version used by the most critical applications).

When I presented this at Data Day Texas, one of the audience members asked about the feasibility of doing this: compatibility an application concern, so how will the DB detect compatibility issues?

One way to assess feasibility is to look at ways compatibility is already handled in less traditional data stores. There are multiple data lakes and event streaming systems that allow the developer to define the type of compatibility guarantee they want - forward, backward, transitive. Given this requirement, alter table operations can be assessed with known rules - dropping columns is rarely compatible, some type changes are compatible and others are not. This isn't rocket science. Marking columns as deprecated or tracking their usage is not beyond our current technical capabilities either.

More complex problems that are harder to detect automatically in the DB itself, like semantic changes, can be addressed by integrating the DB-level schema versioning with CI/CD and the various levels of tests that automated releases typically involve.

Once we go in this direction, there are more exciting possibilities. For example, running multiple schema in parallel on the same data, perhaps using schema on-write capabilities. This is great for both A/B testing and more gradual rollouts.

Tenant-Awareness

Production databases tend to contain data that belongs to someone else – your customers and users. And you typically need to isolate them from each other. You absolutely need to guarantee that no customer will ever see someone else’s data – this is critical and no mistakes allowed. Messing up this part will cost you the customer trust and may put the business at risk. But there is a bunch of other capabilities you will likely need. “Tenant Awareness Capabilities”

All these capabilities are important, and it would be a separate blog or talk to get into all of these in detail. Independent upgrades and restores, for example, is something few companies think about in advance, and is hard to implement when it is actually needed. About a year ago Atlassian had multi-day downtime for some customers because they needed to roll back a change for a subset of their customers, and they had to go through a very manual process.

Since multi-tenant applications are so common, there are design patterns on how to model tenants in a database. The common models are either Isolated, where each customer gets a separate database, or Pooled where a single database is used for all tenants and each table has a column that maps to the tenant that owns the data. Generally speaking, the pooled model is more cost effective and scalable. But because databases don't actually do anything to support this model, it requires building a lot of the tenant-aware capabilities yourself and hope no one ever misses a "where" clause. The isolated model gives you all the tenant-aware capabilities "for free" but is challenging to scale effectively.

“Pooled vs Isolated tenant models”

When you start with the pooled model, after a while you outgrow the capacity of a single DB or perhaps some of your customers require better isolation, better performance or better availability. So you end up sharding your database and placing different tenants on different shards. Depending on the extent to which your database supports sharding, you may or may not need to manually build a layer that determines the right shard for new tenants, directs each request to the rigth shard, adds new shards when needed and balances load between shards.

“Sharded DB”

Democratizing Change Events

If you execute select * from my_table on your database, you only get the latest state of information in my_table. The records that exist right now. But DBs also have a record of all the changes that happened in the DB. All the inserts, updates, deletes. It is called a change log, redo log, write ahead log or a binlog - depending on your database.

The historical record of changes in a database is very useful information, and you can build a lot of useful applications by listening or subscribing to change events. “Use-cases for data change history”

Some databases let you perform some type of "time travel queries" where you run a query against an older state of the database. But this isn't the same as accessing the change events themselves, or getting notified about them when they happen. For change event use-cases, you would typically use Debezium to capture changes from the database, stream them to Apache Kafka, and from there use a combination of Connectors, Stream processing and Kafka clients to get the right changes in the right format to all these use-cases.

As you can see, those architectures aren't the simplest to build and maintain. They include multiple new components to learn, configure, monitor and troubleshoot.

In addition, in order to query the change history or listen to new change events, there are new clients and APIs that developers need to learn, in addition to the original database. This new ecosystem is not equally accessible in all languages and models. Web UI and FE developers in particular lack ways to access these real time events. Which is a shame because real-time updates are such an important UI feature.

Of course all those components also have their own security and access models. Remember the issue we mentioned earlier about isolating access to data between tenants? You now need to solve the same problem in the change capture system as well.

This pattern is so common and so many companies have implemented similar architectures... why is this not a normal part of the database?

Soft Deletes

You must have implemented this a million times when building apps. When a customer clicks “delete”, you don’t actually delete anything. Why? Because there’s a good chance that in the next 10 minutes, hours or days, they will call support and ask to get it back. Recovering the entire DB is a bit messy.

So instead, we have a field called “deleted” or “delete date” and when the customer clicks "delete", we update it. If the customer later calls support, we update it back. Much quicker and easier than a database recovery.

In many cases we do eventually need to delete the data for real. GDPR typically gives us 30 days to actually delete things. So we also write a process that does the real cleanup.

Now, you’ll notice that I wrote “We all did it many times”. So, why do we still do it? Why can’t databases add SOFT DELETE functionality? And maybe even add the cleanup job?

The annoying thing is that databases already do this internally. You’ve heard of “compaction”, “garbage collection” or “vaccuming”. These refer to a similar process – some DB changes involve a quick part that happens when you run the command and a part that happens in batches later. This design adds a lot of efficiency to many processes - especially deletes. So why not give all of us the same benefits?

APIs (In addition to SQL)

Databases, for the most part, talk in SQL. Pretty much every database that does not start out with SQL, eventually adds SQL. I’ve been through this at Cloudera, when we added SQL on Hadoop, then Confluent when we added KSQL and I now see the same demand in my own company. SQL has been the lingua franca of data for the last 40 years, and it seems well set to continue being this language for the next 40 years.

SQL is a great data language. But it is a terrible programmatic API. Why? Because it isn’t composable! Allow me to demonstrate.

Lets say that you have a table with user profiles, including their age and gender. Counting the number of users by age is a simple SQL:

Select age, count(*) from people
Group by age

And the same operation in Scala (with Spark's dataframes API):

ageGroup = people.groupBy(“age”)
print(ageGroup.count())

I'd say both are fairly readable, perhaps SQL a bit more so. But now lets say that we are working on a different method in the same module, one that requires counting only women.

In SQL, you'll write:

select age, count(*) from people
Where gender = 'female'
Group by age

Note that you can't really reuse the existing SQL or method. Either you write a new query, or if you really want to reuse another method, you need to do some creative string manipulation.

Meanwhile in Scala:

ageGroup = people.groupBy(“age”)
femaleOnly = ageGroup.filter(
col(”gender”).like(“female))

Print(ageGroup.count())
Print(fOnly.count())

The data APIs make it easy to take parts of the query and add more filters. Unlike SQL, these APIs are composable. And composability is key for reuse, minimizing redundancies and maintainability. This is why most developers use ORMs, even though there are many reasons not to use ORMs. SQL is not developer friendly.

If databases had native, optimized developer APIs, we wouldn’t need to trade off between maintainable code and performance.

Modern Protocols

The mismatch between databases and modern engineering goes deeper than just SQL. There is a big mismatch between the protocols applications use, and the protocols databases uses. And every project includes some engineering effort to work around this gap.

On one hand we have the database, which holds all the data, typically has its own binary protocol and uses LDAP or Kerberos protocols for authentication.

On the other hand, we have the browser. The browser needs to get data, so it can show it to customers. Browsers speaks in various HTTP-based protocols and typically authenticate with OAuth2 or SAML.

The mismatch seems pretty obvious - a system that only speaks HTTP needs to get data from a system with a native binary protocol. The solution is so obvious that we've been implementing is over and over for decades now - you write a backend.

The backend typically has a lot of important business functionality, but it also serves as a translator. It accepts HTTP connections from browsers on one side, and has a connection pool to a database on the other side. A lot of what a backend does is recieve REST requests over HTTP, translate them into SQL and send them to the DB for processing, then get the result, translate it again and send it back. We all wrote a lot of code that basically does that.

“Backend serving as a translator between browser and DB”

In the best cases, this adds value. Maybe we abstract a bunch of internal DB details nicely. But in too many cases, there isn't much to abstract and the API pretty much exposes the table as-is. One use of GraphQL is to auto-generate APIs from a DB schema. I suspect that GraphQL so popular because developers tend to feel a bit silly writing another bit of translation code whenever a product manager wants another button in the UI. I'd argue that in some cases, it even introduces risk.

One way the backend can add risk is authentication. Users authenticates to your website, persumably with a standard secure protocol like SAML. But the DB doesn’t know about this user, the DB typically has service-account type of users that identify entire applications. So any kind of access controls need to be enforced by the application or with custom code in the database.

To make things a bit worse, it is a controversial but common practice to have more than one backend working with the same database. In these scenarios enforcing security in each backend introduces redundancy and even more surface area for problems.

If browsers could talk to the DB and users could authenticate to the DB with HTTP-based protocols, we'd have less low-value backend code that we need to test and maintain, the DB could enforce data access directly, reducing the surface area for access control issues. Combine this with a DB that is actually aware of tenant boundaries, and you get stronger guarantees for less effort.

DB Features from the Future

Global Databases

These days, even tiny companies find themselves with a global business. As a result, keeping all the data in one datacenter is no longer a feasible plan. Both data-related regulations and the low latency that users demand from their SaaS applications require globally distributed databases.

Typically this global distribution translates to having relational databases in 2-5 main datacenters, with selective replication between them. In addition to DBs in main datacenters, it is common to have a much larger distribution of caches that speed up performance of read operations.

“Global DB with main datacenters and caches”

Databases these days typically have several types of replication built in, which helps a lot in setting up the selective and async replication required between main datacenters.

Complying with regulations, on the other hand, is 100% on you. Which data are you allowed to copy? Who is allowed to access which data? How long are you allowed to keep it? You need to implement all those rules for each location in which you have either data or users.

When it comes to caches, developers are also left to their own devices. Keeping caches up to date while minimizing traffic is a challenging problem. And one that database vendors have only solved within the database.

In an ideal architecture, applications will not even know if they are talking to a cache or a DB. They will use the same clients to read/write and the data will be fresh when needed. The clients and infrastructure will route them to the best location.

Intelligent and Adaptive Databases

Tuning a database isn’t easy. Figuring out which indexes would improve performance and which indexes would slow you down is an important full time job in many companies. The indexes themselves are a series of tradeoffs – each database has a small collection of index types, each with its own benefits, problems and a lot of folk wisdom on when to use each..

We are starting to see proof-of-concept for databases that use machine learning to adapt to the access patterns and the data stored in them, and create indexes specifically to make the most common access patterns smarter. For example, a paper from Google that was published in SIGMOD 2018 titled "The Case for Learned Index Structures". Later research introduced additional index operations, join optimizations and query plan optimizations. It will be exciting to see a generation of data engineers focus more on building data products and data platforms that serve their users, and less time tuning indexes.

An area where I didn't find a lot of research, but seems to have a lot of potential for inteligence is capacity predictions. Can the DB predicts its own growth with high accuracy? If it can, perhaps it can also provision its own future capacity directly from the cloud provider and either migrate itself to a larger server or balance its load to new nodes in the growing cluster. These capabilities will give us a "serverless" experience because the DB would handle all those pesky servers itself.

Summary

This was just the tip of the iceberg as far as missing database features go. One could include tiered storage, better query optimization methods, more usable backup/recovery methods and built-in cronjobs.

While we shouldn't expect a single DB to be a perfect fit for any situation, we can and should expect databases to do a lot more for us.

Almost all the "missing features" that I mentioned here are necessary in most projects and many wouldn't require a complete re-architecture to introduce as an option in an existing DB. I believe that the reason every developer implemented these features again and again is that we've resigned ourselves to that reality.

Jeff Bezos famously said: "One thing I love about customers is that they are divinely discontent. Their expectations are never static – they go up. It’s human nature."

Technology keeps improving because users always ask for faster and better products. And when it comes to databases, we are the users. If we don't ask for more - we won't get it. In an age where Github Co-Pilot can write our boilerplate code and ChatGPT can write our CV, we can't just accept that soft delete, version control or reasonable data APIs are impossible.

Feel free to download the full slidedeck that I presented at Data Days Texas. If any of these missing database capabilities resonate with you and you'd like to try Nile's database, please sign up to our waitlist.

· 6 min read

Last week, I saw Gunnar Morling post a good question on Twitter: "Separating storage and compute" vs. "Predicate push-down" -- I can't quite square these two with each other. Is there a world where they co-exist, or is it just two opposing patterns/trends in DB tech. ?

It is the kind of question I love because there are probably many developers who are wondering the same thing, while also many developers who feel the answer is obvious. This is because the words "storage", "compute" and "separate" mean different things to different people. So let me clarify things a bit.

Let's start at the beginning - A single machine with local SSDs was pretty common for running DBs in the early days of the cloud. In this scenario, compute and storage are definitely not separate. In the case of a Macbook - I can't even open the box!

Compute-Storage Separation

Today, both in data-centers and the cloud, you typically have the machine the DB is running on, and then the storage on separate devices - Netapp, EMC, EBS, etc.

Note that all these storage devices have compute. They are not just a disk with a network card attached.

So, did we separate compute and storage? We did in three important ways:

  1. Compute and storage nodes have different workloads and can be optimized for them. Compute nodes use their CPU for DB things - parsing queries, optimizing, hash joins, aggregation, etc. Storage nodes use CPUs for storage things - encryption, snapshots, replicas, etc.
  2. They fail independently and have different reliability guarantees. In a single box, if the machine dies, you can lose the DB and the data and everything. Hope you have backups. If compute and storage are separate and a compute machine is dead, you can provision a new machine and connect the old disks. The storage cluster is typically built for extreme reliability so data is never lost. The compute cluster can be designed for your availability SLA - Do you need a hot standby? Or can wait to start a new compute node and connect it to the storage?

But you can take this concept even farther. Compute-storage separation is a key step toward being able to scale them independently. Storage solutions today are very scalable. DB compute on the other hand is more challenging to scale, you often scale up.

Scale up works extremely well and is often an under-rated technique, until it is no longer possible. At this point the usual pattern is to shard. Unfortunately, most DBs aren't aware of shards and therefore everyone needs to write their own sharding logic.

In an ideal world, you'll be able to scale by adding DB machines and dynamically + transparently move workload between them. If storage is tightly coupled with each DB machine, then adding machines means copying data around, which takes time and resources. Decoupling allows for minimal data movement when expanding the compute layer.

Hopefully I convinced you that separating compute and storage is a good thing even in its simple form, and opens the door to elasticity. But what about predicate pushdown? can we still do this? Or is it an opposing pattern?

Adding Predicate Pushdown

Let's take a simple form of storage/compute separation - single DB machine, and its data files are placed on a storage cluster.

As we already said, storage clusters have compute and they use it for stuff like encryption, compression, snapshots. If you have a general purpose storage cluster, like EBS, this is pretty much all you can do.

But what if your DB has a storage cluster that was built specifically for the database? In a way similar to AWS Aurora architecture. In this case each storage node has some of the data, but not all of it, and a processor. What can you do with those?

Turns out that quite a bit. If you are familiar with the map/reduce model, anything that you can do in a mapper, you can do in these storage nodes - filter, mask sensitive fields, modify formatting, etc. But - it turns out that filtering is especially powerful.

The network between storage devices and DBs can become a bottleneck after storage and compute are separated. This can be very difficult bottleneck to solve. If you reach the limits of your network, adding more compute nodes or more storage nodes won't necessarily improve the DB throughput and you have to start sharding the DB cluster in order to limit cross-node traffic. As we mentioned earlier - unless your DB was built for sharding, this requires implementing custom logic.

With predicate pushdown, you send the query "where" clause down to the storage cluster. Each storage node filters the data and only sends a subset over the network to the compute layer. The difference in network traffic is meaningful and allows the system to avoid the network bottleneck. This solution is extra nifty because use the pros of the architecture, the fact that storage has its own compute, to solve the bottleneck that the architecture created. A bit of a Judo move.

Predicate Pushdown as a Design Pattern

Predicate pushdown pattern is useful in many other systems. Between front end and back end, between a service and its DB and between microservices - by designing APIs that allow you to send filters to the system with the data and only get back the data you really need, you make the system faster and more scalable.

If you build the system with predicate pushdown in mind, a user clicks something in the webapp UI, the request with the filters is sent to the backend, maybe to another service, to the DB, to the storage - the storage apply the filters, and only the minimum required data is sent all the way back to the user. This is a real end-to-end system optimization.

So, hopefully you learned about the different meanings of compute/storage separation, why storage still has compute and why storage/compute separation doesn't conflict with predicate pushdown - in fact they are better together!

· 5 min read

Every Thanksgiving, I take time to reflect on everything I have to be thankful for. Looking at the past year and taking the time to reflect on my blessings. This year there I am especially appreciative of the help and support of an unusual (for me) number of people. I didn’t realize until now how many people are part of Nile's journey, and how grateful I’ll be to all of them.

Hopefully this post doesn’t miss anyone who deserves my thanks, and please don’t read too much into the order, because I’m having trouble figuring out where to even start.

Thank you

The first customer discovery call for Nile. took place in September 2021 with our friends at RocketLane. Since then Ram and I have talked to over 100 companies - the smallest had just 2 founders and was founded days before we talked to them, the largest was a 40 years old public company with 25,000 employees. We talked about their journey, their challenges and future plans - and we learned so much from every conversation. We are thankful to everyone who spent an hour or more on the phone, in coffee shops or in their garage. We are especially grateful to everyone who said “why would we want this?” Or “Interesting idea, but I can’t see us using this”. I know that saying anything negative isn’t easy or natural but all feedback was useful in helping us carve our product direction.

From those who talked to us, some became the early users and design partners of our product. They asked smart questions, gave us concrete feedback, discovered a bug or two, shared ideas and even contributed a pull request. They inspired us with the problems they are tackling and the products they are building. We know that investing time and effort in a brand new product isn’t an easy decision to make, and we are doing our best to repay this generosity by rapidly delivering a world class product to everyone who took this step.

A year ago, Nile was just 3 of us - Ram, Norwood and myself and now we are eight Nilians (Nilers? Nilists?). It is always a big leap of faith to join a tiny startup with barely any users or product. To build the MVP and help shape a future. It requires self-confidence, trust in the team and belief in a possibly crazy idea. Everyone who joined us had other options - join a FAANG, join another startup, start their own company or perhaps early retirement. We are grateful for their decision to join Nile and even more for what everyone gave Nile - their creativity, passion and effort.

Of course, we couldn’t have hired these wonderful humans without funding. As part of our fundraising efforts, Ram and I talked to many investors - from the large and famous VCs to individuals who wanted Nile to be their first angel investment. We appreciate the time of everyone who talked to us. We appreciate the candid feedback from those who heard us out and said “not a good fit”. We really appreciate those who heard us and said “oh yeah, this is a great idea and we want to be part of this journey”. Some investors went above and beyond in introducing us to their portfolio companies, or in sharing their research with us, and we are grateful for this. We are especially grateful to everyone who ended up on our cap table with investments large and small. We appreciate how you believe in our vision, the time you spent with us and all the support.

Over the last year, we’ve built not just a company but also a community. The SaaS Developer Community grew to over 1500 members. We have a Slack, a YouTube channel, a podcast and a bi-weekly newsletter/blog. Slowly but surely, our vision for a place for those who are building SaaS products to share their experience is becoming a reality. I met great people through this community and learned quite a bit. I am grateful to everyone who contributed to our Slack discussions and to everyone who joined me for the YouTube / podcast recordings.

The SaaS Developer Community is not the only community that I’m part of, and I’m grateful for the support I got from other communities as well. Communities that I was already part of were very supportive of my new direction, and I was also invited to join several new communities that quickly became part of my daily routine - a place to give and receive help and support. I know what it takes to create and run a great community, and I appreciate the community admins, members and those who invited me to join.

And of course, a list of gratitude can’t be complete without sharing my deep appreciation to my close friends, family, and especially my husband. Those closest to me were more supportive than I can ever thank them enough - patiently listening to me talk endlessly about Nile, reassuring me that I’m not doing anything too crazy, making sure I get breaks, workouts, sleep, food, and hanging out with me when my head is so full that I can barely talk at all.

For the last year, every evening I asked myself whether I did my best today, worried that I’ll fail all those people who believed in me and supported me, hoping that my best was enough.

Saying thank you is not enough, but today, it is a good start 🙏

· 18 min read

An infrastructure Software-as-a-Service (infrastructure SaaS) enables users to self-serve without buying underlying infrastructure and get it with minimal effort. If an infrastructure SaaS offering isn’t more compelling than its competitors (who may be other infrastructure SaaS companies offering similar services that are easier to use) or cheaper than open source projects (that may be self-managed but free), then people just won’t use it. So a viable infrastructure SaaS company has to do it better and deliver it faster (After you finish reading the blog post, talk to us if this problem space sounds familiar).

Let’s consider a hypothetical infrastructure SaaS company: an ETL as a service that for its most basic service enables users to deploy jobs that connect to a source system, read data from it, store it, and write it to some other destination. The end users own their end systems, and the SaaS owns the underlying pipeline infrastructure. A generalized user workflow may look like this:

  1. Provision a test job through a web UI with minimal clicks: provide end system connection information and credentials, and go!
  2. Get notifications of real-time updates on the status of the pipeline: connected to database, volume data copied, last read timestamp, etc.
  3. Provision production jobs programmatically through an API and integrate it with their own CI/CD
  4. View existing jobs of teammates in the same organization
  5. Monitor a pre-built dashboard of relevant job metrics that show pipeline performance
  6. Get billed based on consumption of resources

Infra SaaS architecture

The architecture usually comprises a control plane to manage the user and ETL metadata and to integrate with the data plane that actually does the job execution. Developing this kind of infrastructure SaaS product can have a lot of complexity in lifecycle management and codifying best practices. (Watch Infrastructure SaaS - A control plane first architecture for a deep conversation with Ram Subramanian and Gwen Shapira on SaaS control planes.) These SaaS architectures have to solve a lot of really hard problems, including:

Since a new SaaS company will have strong pressures to find product-market fit and to deliver something as quickly as possible, sometimes shortcuts are taken on some of these problems in order to avoid delaying product launch. But these shortcuts can be 10x more expensive than addressing them the right way in your architecture from Day 0. They eventually catch up to you because shortcuts expose even bigger security/scalability/velocity problems that are more expensive to resolve, and customers always demand them anyway. So addressing these solutions early on enables you to keep moving fast. In this blog post, we will use self-serve provisioning in the ETL infrastructure SaaS example as a way to explore these problems more in-depth.

Self-serve in Infra SaaS

Multi-tenant control plane

A user starts by logging into the frontend UI (or maybe API) and sending a request to self-provision a pipeline. The backend receives this request and persists it to a database. Recording user CRUD events in a database ensures that there is a source of truth representing the desired state. Having a database on the backend is a basic thing that all SaaS companies do, but it’s not as simple as persisting the ETL job info directly into a single database and moving on. A critical security concern for cloud companies is isolating tenant data to ensure that tenants don’t see data from another tenant that they’re not supposed to. Companies don’t win customers unless they have tenant isolation to secure customer data. Tenant isolation is also important to ensure that one "noisy" tenant who might be creating a new job every second does not cause delays in provisioning for other tenants.

Record request in a database

Since all control planes have a database, achieving organization awareness and tenant isolation raises questions around database design, schema management, and multi-tenancy. Is it one monolithic database? What is the tenancy model? Is it soft isolation or hard isolation? What are the APIs between the applications and the database? How does it scale up as the company grows, etc?

Many infrastructure SaaS companies make a business decision for time-to-market reasons to have hard isolation because they think it’s faster than implementing soft isolation. Although it makes some tasks like backups and migration easier in the short term, it tends to come with higher cost, poor scalability, and negative impact on tenant onboarding because new infrastructure needs to be provisioned. It also doesn’t anticipate for surprise customer demands ("hey, we just merged with another company…").

Soft isolation can be at various levels of infrastructure and resources, but it often starts in the database. For example, some databases offer row-level security (RLS), a mechanism that includes a tenant identifier in the schema for every table. This intrinsically isolates tenant data from each other because data retrieval is based on the tenant identifier. This is a bit of a simplification, but the idea with RLS is that the application code is simplified and can make a call without validating user permissions first:

// Without RLS: code must check tenant membership
if (dao.isOrgMember(userId, orgId)) {
dao.updateOrg(updatePayload);
}

// With RLS: make the call, RLS restricts response to tenant data-only
dao.updateOrg(updatePayload);

There are some complexities in soft isolation that need to be considered. When resources are pooled, how you do tenant data recovery ("I accidentally deleted all my data, can you please restore to 6 hours ago?") or data migrations for end users ("hey, can you please move all my data to eu-west-2, and by the way, there are new regulatory requirements there"). It may also be a sharded multi-tenant solution where tenants are distributed across multiple databases to better distribute load, but that raises additional operational issues. While implementing multi-tenancy isn’t a small feat (we barely scratched the surface), from a business standpoint, it’s a day 0 security requirement and should be done well to simplify management, scale up, and get cost efficiencies.

Event handling and synchronization with the data plane

The core infrastructure SaaS offering is the set of resources which encompasses the product, deployment configuration, customer data, data processing, and supporting infrastructure. This is the data plane. When a user provisions a new ETL pipeline, they get a slice of underlying shared or pooled infrastructure that is managed through an automated deployment platform like Kubernetes or whatever is the data plane platform choice. It might spin up the right connectors to read and write data per the pipeline specifications.

After the user request is recorded in a database, some service needs to detect and fulfill each request, reconciling it with the data plane so the actual state in the data plane matches the desired state in the control plane. How do you keep track of the changes in the database? How do you process them in order and only once?

Synchronize to the data plane

This is typically done by implementing something like an Apache Kafka pipeline with a CDC connector that streams changes from a database, but there is overhead to build and maintain those solutions. Other solutions tie directly to the database: database triggers can be configured to fire when there is a data change made to a table. Or flipping the responsibility to the data plane, an agent can run every few minutes to query the database for new or updated rows derived from a flag column in a table. Triggers may initially seem simple enough to implement but they have their own overhead and they couple logic to the database itself which can eventually result in unmanageable complexity. What happens if a schema changes, or if there are multi-statement transactions across more than one table, or multiple databases that require additional cross-database and privileges management ?

Actually, an events pipeline or database trigger is an implementation detail—what the developer wants at the end of the day is to focus on the events themselves, not the pipeline. So if the implementation of an event service can be abstracted away with a robust API, then the developer can just call a method to listen for events as they happen. They can spend their time thinking about event processing instead of the pipeline details of brokers, connectors, topic design, partition count, choosing keys, etc. (Note that even with an event service that abstracts away the events pipeline, it doesn't obviate the need for data plane management of its own infrastructure and controllers, cluster resourcing, scheduling, healthchecks, load balancing, etc.) An event service baked directly into their data platform also serves as an audit log which captures history of changes made to the service, who initiated those changes, etc. So when a user requests to provision a new ETL job, it generates an event, and an application listening for events receives it and can take appropriate action, synchronizing the data plane to the desired state:

events.on({ type: entityType }, async (e) => {
// received an event…
if (e.after.deleted) {
// ...destroy resource & update status
} else {
// ...create resource & update status
}
});

The event service also needs to be resilient to any types of failure because they aren’t “if” scenarios, they are “when” scenarios. Whether there’s a failure in the data plane itself, orchestration, cloud provider, event service, etc., desired state needs to be persisted so that whenever a system recovers from a failure, it can act on the user requests. An event service also helps in these cases because a system can pick up listening to events where it left off.

Eventually, the resource is provisioned and the data plane can update the source of truth with its latest status. The platform records the status back into the database and then the service sends a notification to the end user about the pipeline status.

This workflow demonstrates how events can be used for synchronizing the data platform to the control plane. But an event record can potentially contain a lot of detailed information about different resources and state changes. Coupled with filtering to consume at various levels (per tenant, entity type, or specific instance), it makes an event service flexible for different scenarios. It could be used to process updates to any type of entity, like acting on authentication tokens changes, invalidations, etc., or a real-time messaging application that distributes messages to appropriate users or channels. An event service that aggregates data plane and control plane events and provides a great interface for delivering the events makes it useful for deployment, notification, or troubleshooting any aspect of the infrastructure SaaS. For example, Datadog Events provides this kind of rich experience for an event service, programmatically through a browser or Datadog Events API.

Datadog Events API

Metrics and consumption-based billing

Infrastructure SaaS companies have various billing models, often reflecting whether a user has dedicated or shared resources, but ideally in the multi-tenancy deployments with soft isolation, they charge based on “pay-as-you-go” cloud resource consumption. Whatever the billing model, a viable cloud business needs to be entirely transparent and show users all the costs and the metrics from the underlying entities in the data plane. The metrics can be anything: compute time, API calls, latency, ingress/egress throughput, workload capacity, features enabled, etc. They should be able to provide usage-based billing, answer every user’s ultimate question “What did I get charged for?”, and provide a more flexible billing system with dashboards, bill breakdowns, alerts for reaching quota, etc.

Metrics and usage-based billing

In addition to using metrics for usage-based billing, the business uses metrics and KPIs that tell how healthy the company is, show customer activity, churn rate, annual growth rate, and project future growth (see The 8 KPIs That Actually Matter—And How To Measure Them). Even more so, your CEO may want to experiment with pricing models, tiering structures, or tenant customizations, to see which approach maximizes subscriber growth or recurring revenue or profits. In fact, this is precisely the type of experimentation that enables any SaaS to move quickly and adapt to and expand their target user base.

KPIs

(source: https://databox.com/dashboard-software/business)

Infrastructure SaaS companies report user-consumption metrics into frontend dashboards built directly into the web application. They also integrate the operational or business-relevant metrics into industry-standard tools like Grafana or Prometheus. To serve up those metrics, some companies build a telemetry architecture resembling something like what is described in Scaling Apache Druid for Real-Time Cloud Analytics at Confluent. That architecture has a data plane that emits metrics, which are fed into an event messaging system, sent to a database like Apache Druid that is optimized for time-series data, offered up by an API, and then consumable by downstream applications.

Metrics and billing APIs

Developers shouldn’t have to build from scratch a metrics pipeline or support all these extra components. But don’t just pull the metrics from the business data warehouse or from Prometheus or any other metrics collection tool! They will not enforce the tenant isolation and access controls needed in your product, and they often aren’t maintained or optimized as a production database. A data platform that already handles the control plane metadata and has built-in access control and multi-tenancy can provide these metrics to end users and internal operations alike. If the backend has an endpoint that easily serves up metrics, a developer can focus on writing the business logic for processing the metrics and sharing the usage-based billing with customers. So having metrics be a first-class built-in capability of any infrastructure SaaS allows the business to more quickly launch their product.

Control plane access control

Authorization within an organization allows a group of users to belong to the same organization but have access only to the subset of resources that they need. Developers need to create access policies that should follow principles of least privilege and zero trust security. To achieve that, the policies can get quite granular, configurable for a variety of attributes (or "signals" as Netflix calls them, see Authorization at Netflix Scale) like entity properties, user location, user role, suspicious fraudulent activity, etc.

When a user or service account provisions an ETL job, the request gets sent to the backend to validate that they have the appropriate permissions to create/view/edit the requested resource. Actually, permission validation should happen before the option is presented to the user in the app—don’t even let a user try to create a resource if the request is going to fail because that is inefficient and just bad UX.

Control plane access control

Sometimes access control is defined at the application layer by adding in an authorization middleware. But there is development work here, and each downstream microservice application may choose their own authorization tools. In the Netflix architecture circa 2018, they use the same authorization tool which means lots of code duplication! This complexity scales up with the number of applications. As a result, any time you add a new application or microservice, there is additional cost to add in authorization. Since developing a secure middleware is sometimes less interesting than working on the product itself, it gets postponed "to the next sprint" which leads to security vulnerabilities.

A cleaner architecture would be to put access control directly into the data layer. Because data is shared across different applications and multiple tenants, applying access policies on the data itself ensures that policies are applied consistently to all applications. Access control at the data layer also abstracts it away from application implementation, so in theory an application can evolve with new business requirements without changing the security model.

The following code demonstrates a way to apply an access policy to data itself, by granting a specific user access to entities in the development environment, and it’ll be checked whenever any applications tries to retrieve data.

req = CreatePolicyRequest(
actions=[Action.ALLOW],
resource=Resource(type=entity.name,
properties={'environment': 'dev'}),
subject=Subject(email=user.email),
)
policy = create_policy.sync(
...
workspace=workspace,
org=org.id,
json_body=req,
)

In practice, there is a bit of complexity in configuring access control on the data because it requires both a database that supports rich access policies and a team of DBAs with strong skills to build and maintain. As the database schemas and services grow, it can get harder to support and troubleshoot. But if it’s designed into the service from the beginning and not bolted on later, it can provide a cool differentiator in your infrastructure SaaS.

Consistent user experience for UIs and APIs

The discussion so far has focused on the backend workflow, but end users experience the product through workflows like self-serve signup, login, creating a space, the proverbial "3 click" provisioning, programmatic management of resources, monitoring usage, billing, etc. All these workflows need to interact with the backend, and if there are new features or changes rolled out in the backend, then the UI and APIs must be updated too. So to move fast, you really need a robust interface to the backend so users can have a consistently stellar experience through any of them.

A really good UI starts with a stylized look-and-feel to build a brand, but a common problem to figure out is how does the web application interact with the backend. Reusable web components and micro frontend architectures accelerate building the web application, but working with the backend has more complex dependencies and as the backend changes, so must the frontend. In all likelihood, there’s already some kind of API to the backend, but it also has to be robust enough to build a web application on top of and to let customers automate against. A simple task might be creating and managing a new pipeline, with a connection to a new database. From the web application, it might look like this:

Web form with components

A developer could custom-build a frontend to capture the name and other properties with text fields, get the form values, format them the way the API expects, then send it to the back end. But prebuilt components and hooks for common user workflows are specially designed to handle both UX and API interaction and abstract away the API calls. For example, instead of coding up a new organization form from scratch, a developer can drop in a OrganizationForm component (refer to this PR to see how this was handled in a Nile example) that automatically handles interacting with the backend.

components/CreateOrg/index.tsx
<OrganizationForm
onSuccess={(data) => {
router.push(
paths.entities({ org: data.id, entity: entity.name }).index
);
}}
/>

Easy-to-deploy and pre-built, fully customizable web components and simple filters also reduce the development work to serve up events, report metrics, and configure access policies in the frontend. So a great backend API paired with customizable web components really helps the frontend developers provide the slick web application that differentiates their product offering. A public API also can provide programmatic access to the backend for any custom application. Especially for more infrastructure use cases, end users expect to interact only through APIs so that they can automate their own deployments and integrate with their CI/CD. Providing these robust APIs help differentiate from its competitors and other open source projects.

Summary

Walking through the infrastructure SaaS workflow of provisioning an ETL pipeline highlighted some complex problems that need to be solved, how to:

  • provide a database as a source of truth with built-in multi-tenancy
  • give developers an event service to reconcile with the data plane
  • serve up metrics for consumption-based billing, experimentation, and other business operations
  • authorize users with a flexible access control model
  • provide great UIs and API along with a slick frontend with web components customized to the backend

This set of problems is common for all infrastructure SaaS and they get solved over and over again by each new company. Nile addresses these complex problems OOTB by providing a tenant-aware, serverless database that is used to build control planes, just like the one discussed in this blog, which enables companies to iterate quickly and deliver their product to the market as quickly as possible.

Nile control plane

Launching an infrastructure SaaS product should be easier than it is today with codification of the infrastructure SaaS lifecycle management. Companies should be able to focus on their business logic and let someone else handle the complexities. If you’re interested in learning more about building a SaaS on Nile, talk to us to learn more and run our GitHub examples to see it in action.

· 16 min read

At Nile, we’re making it easier for companies to build world-class control planes for their infra SaaS products. Multi-tenancy is core to all SaaS products and especially those with control-plane architectures. At Nile, we’ve built multi-tenancy into our product from day one. If you are working on an infra SaaS product and need a multi-tenant control plane, you should talk to us.

From previous experience, we’re familiar with multiple multi-tenant SaaS architecture options. We decided to store everything in a single Postgres schema since it provides a balance of scalability, cost optimization, and flexibility. However, this requires serious investment in database authorization to guarantee that we never leak customer data.

Authorization in a multi-tenant db is something many companies have to deal with, and in previous companies, I saw authorization implemented in probably the most common way: appending WHERE user_id = $USER_ID to queries. This is also the way things started out at Nile, but as we added more features we noticed that we were forced to add many branching and repetitive WHEREs to our code. We needed a solution that would allow us to add features quickly and confidently, and using custom filters in every single query was error-prone and hard to evolve if our data model changed.

RLS code excerpt

One solution that I knew about was Postgres Row-Level Security (RLS), a db-level mechanism that filters rows based on a per-user basis. I expected it would allow us to iterate faster and dramatically reduce security risks. You can learn the basics with these two blogs that show how to build multi-tenant apps using Postgres RLS. As with most solutions, the blog version was easy to implement, but there was an especially long tail to ship to production.

In this blog post, I’ll talk about the alternatives we considered - both for multi-tenant architecture and for securing data access - why we chose RLS, and the various challenges we encountered and overcame while shipping it to production.

Existing multi-tenancy solutions

Schema-per-tenant and database-per-tenant

We considered both of these approaches but went with the single-schema approach for its minimal operational complexity, low cost, and ability to scale later on. I won’t go into detail about these approaches, as there are countless resources on the topic. Here are two resources I’ve found to be helpful:

  1. Multi-tenant SaaS patterns - Azure SQL Database | Microsoft Docs
  2. A great paper from Microsoft -Multi-Tenant Data Architecture

Single schema with dynamic WHERE queries

Pros

  1. Easiest and most straightforward zero-to-one solution.
  2. Transparent and easy to reason about.

Cons

  1. Possibility of forgetting to add a filter to a query. Since queries are permissive by default, this is easy to miss and hard to detect without extensive testing. There are some solutions to this ( i.e: @Filter in Hibernate) but I find that ORMs make simple querying easier and complicated querying harder. At Nile, our authorization model is complicated enough that we didn’t want to rely on Hibernate for this.
  2. Repetitive, ugly, and annoying to implement. Imagine you have 20 API endpoints that require authorization and 2 different types of roles, USER and ADMIN. The access controls for these two roles are different, so you might have to define 40 WHEREs across your codebase. This doesn't scale well when adding new roles or modifying existing ones across many API endpoints.

External authorization systems

Pros

  1. Highly flexible
  2. (Claim to be) scalable

Cons

  1. $$$ cost, if managed. Operational cost, if self-hosted.
  2. Unnecessary if the permissioning model isn’t particularly complicated. At Nile, so far it’s not.
  3. External dependencies often make testing more difficult and reduce engineering velocity. The benefits have to outweigh these costs.
  4. As a control plane, multi-tenancy is core to our product. We believe in building foundational capabilities in-house so that we can push the envelope rather than be constrained by external solutions.

What might a better solution look like?

After we chose to use a single multi-tenant schema, we were looking for a solution that would be cleaner and less error-prone than dynamic queries and lighter than an external authorization system.

In the rest of this blog post, I’ll lay out what I discovered about RLS in the few weeks I spent researching and implementing it at Nile, and how it solved our problem (at least for now) of building authorization with speed, confidence, and maintainable architecture.

A quick overview of RLS

The high-level process to set up RLS is:

  1. Define your data model as usual, but include a tenant identifier in every table
  2. Define RLS policies on your tables (i.e: “only return rows for the current tenant”)
  3. Define a db user (i.e: app_user) with all the privileges your application will need to interact with the db, but without any superuser roles. In Postgres, this is necessary since superuser roles bypass all permission checks , including RLS (more on that later).

A simple org access control example

Imagine your API has an /orgs endpoint that should only return organizations the calling user is a member of. To achieve this via RLS, you’d define your tables, policies, and db user as such:

rls_policy_setup.sql
CREATE
TABLE
users(
id SERIAL PRIMARY KEY
);

CREATE
TABLE
orgs(
id SERIAL PRIMARY KEY
);

CREATE
TABLE
org_members(
user INTEGER REFERENCES users NOT NULL,
org INTEGER REFERENCES orgs NOT NULL
);

-- ** RLS setup **
ALTER TABLE
orgs ENABLE ROW LEVEL SECURITY;

-- Create a function, current_app_user(),
-- that returns the user to authorize against.
CREATE
FUNCTION current_app_user() RETURNS INTEGER AS $$ SELECT
NULLIF(
current_setting(
'app.current_app_user',
TRUE
),
''
)::INTEGER $$ LANGUAGE SQL SECURITY DEFINER;

CREATE
POLICY org_member_policy ON
orgs
USING(
EXISTS(
SELECT
1
FROM
org_members
WHERE
user = current_app_user()
AND org = id
)
);

-- Create the db user that'll be used in your application.
CREATE
USER app_user;

GRANT ALL PRIVILEGES ON
ALL TABLES IN SCHEMA public TO app_user;

GRANT ALL PRIVILEGES ON
ALL SEQUENCES IN SCHEMA public TO app_user;

The above RLS policy will only return true for organizations that the current user is a member of. Simple enough. Later on, we’ll see how things can get more complicated.

Note the current_app_user() function. In the traditional use case of direct db access, RLS works by defining policies on tables that filter rows based on the current db user. For a SaaS application, however, defining a new db user for each app user is clunky. For an application use case you can dynamically set and retrieve users using Postgres’ current_settings() function ( i.e: SET app.current_app_user = ‘usr_123’ and SELECT current_settings(‘app.current_app_user)).

What it looks like from request to response

Request to response diagram

Why we chose RLS

It fails by default - and therefore secure by default

The biggest benefit of RLS is that if you define a policy that’s too restrictive, or forget to define a policy, things just fail. Compared to dynamic queries where forgetting to add a WHERE will leak data, this is a big win for security. I didn’t appreciate this until I wrote some integration tests for access patterns (i.e: testing if a user can access orgs they’re a part of). Initially, all the tests failed, and for cases where users should have access tests only passed when I added the appropriate RLS policies.

RLS is, of course, not a silver bullet. Accidentally defining an overly permissive policy is hard to catch without extensive tests so it’s important to still be careful.

Defined once, applied everywhere

One of the main challenges with dynamic queries in single-schema multi-tenancy is that changes to tables often require touching many different queries. RLS solves this problem since policies are tied to tables and not queries. After modifying a table, all you need to do is to change its access policies, which will be applied to all queries.

Composability

With RLS, it’s easy to add more access rules as your multi-tenant data model evolves. According to the Postgres docs:

“When multiple policies apply to a given query, they are combined using either OR (for permissive policies, which are the default) or using AND (for restrictive policies).”

Since by default policies are combined with OR, this makes it super easy to define more policies as your access rules get more complex. This isn’t so straightforward with dynamic queries, where you might have to define your own logic for combining access rules. Or, as probably many of us have seen before, just create monster WHERE statements.

Separation of Concerns

Instead of mixing filters that are related to our application logic with filters that are related to the multi-tenant database design in the same WHERE clauses, we now have a clean separation:

  • Our application applies all the filters that are requested by users through APIs and other application logic.
  • RLS is responsible for filters that are required due to the multi-tenant database design.

Cases where RLS isn’t a great fit

Every technology has its tradeoffs and cases where you shouldn’t use it. Here are two cases where we think RLS isn’t a great fit:

If you need stronger isolation between tenants

RLS in a multi-tenant db isolates access to database rows, but all other database resources are still shared between tenants. It doesn’t help with limiting the disk space, CPU, or db cache used per tenant. If you need stronger isolation at the db level, you will need to look elsewhere.

If you have sophisticated access policies

As you will see in the next section, our current access policy is fairly simple - tenants are isolated from each other, and within a tenant, you have administrators with additional access. More mature access control policies such as RBAC/ABAC require their own schema design and can be more challenging to integrate with RLS and even more challenging to make performant.

We’ve recently started the design for the RBAC/ABAC feature in Nile (talk to us if you are interested in joining the conversation), and we will have a follow-up blog with recommendations on best practices for adding RBAC/ABAC to multi-tenant SaaS.

Implementation challenges

A few gotchas

One gotcha we encountered was that RLS doesn’t apply to superusers and table owners. According to the Postgres docs:

“Superusers and roles with the BYPASSRLS attribute always bypass the row security system when accessing a table. Table owners normally bypass row security as well, though a table owner can choose to be subject to row security with ALTER TABLE ... FORCE ROW LEVEL SECURITY.”

Both of the blogs I shared earlier create a user called app_user that’s used in the application. We did this as well, locally, but didn’t change the database user when deploying to our testing environment. Thankfully, we caught and fixed this issue quickly.

Another issue we caught during testing was that some requests were being authorized with a previous request’s user id. We discovered that since the user id for RLS was being stored in thread-local storage and threads were being reused for requests, it was necessary to set up a post-response handler to reset thread-local storage.

Overall, so far the gotchas haven’t been too tricky to diagnose and resolve, but as one might expect with anything security-related, they do have serious consequences if not addressed.

Initial widespread code changes

Although RLS addresses the problem of continuous widespread changes well (see “Defined once, applied everywhere”), initially switching from dynamic queries to RLS requires more code changes than you might think. Here’s an example of how RLS might affect an API endpoint to update an organization that’s only callable by users in that org:

before_and_after_rls.java
/*
** ---- Without RLS ---- **

1. Check if user is a member of the org
a. If so, execute the update query
b. Else, return a 404
*/

Org update(userId, orgId, updatePayload) {
if (dao.isOrgMember(userId, orgId)) {
return dao.updateOrg(updatePayload);
} else {
throw new NotFoundException();
}
}

/* -- DAO layer -- */

boolean isOrgMember(userId, orgId) {
return query("EXISTS(SELECT 1 ...)");
}

Org updateOrg(updatePayload) {
return query("UPDATE orgs SET ... RETURNING *");
}

/*
** ----- With RLS ---- **

1. Execute the update query
a. If the org was returned from the db, return the org in the response
b. Else, return a 404
*/

Org update(userId, orgId, updatePayload) {
Optional<Org> maybeOrg = dao.updateOrg(updatePayload);
if (maybeOrg.isPresent()) {
return maybeOrg.get();
} else {
throw new NotFoundException();
}
}

/* -- DAO layer -- */

Optional<Org> updateOrg(updatePayload) {
return query("UPDATE orgs SET ... RETURNING *");
}

In this example, authorization without RLS is done before writing to the db. With RLS, since authorization is determined at query time, write queries might fail so error handling has to be pushed down to the db level. This isn’t a mind-boggling change but is one you should keep in mind when planning to add RLS in any project that involves a multi-tenant db.

The gaps between blog-ready and production-ready RLS

Recursive permission policies

Let’s say you want to add an admin user type and implement the following access rules:

  1. Users can read, update, and delete their own user profiles.
  2. Users can read the profiles of other users who belong to the same tenant.
  3. Users with admin access can read, update, and delete other users who belong to the same tenant.

The first two use cases are possible with straightforward RLS policies, but the third isn’t. This is because we must query the users table to see if the user in question is an admin ( i.e: SELECT 1 FROM users WHERE id = current_app_user() AND is_admin = TRUE). Since querying a table triggers its RLS policy checks, executing this query within a users RLS policy will trigger users RLS policy checks, which will call this query, which will trigger RLS policy checks, resulting in an infinite loop. Postgres will catch this error instead of timing out, but you should make sure to test your policies so this doesn’t happen at runtime. You can avoid this problem by defining a function with SECURITY DEFINER permissions that’s to be used in the RLS policy. According to the Postgres docs:

"SECURITY DEFINER specifies that the function is to be executed with the privileges of the user that owns it."

In our case, this user is the superuser that you probably used to set up your database. So they bypass RLS.

note

By using SECURITY DEFINER you are allowing users to bypass the security policy and use superuser privileges regardless of who they really are, so you must be careful. I recommend reviewing the “Writing SECURITY DEFINER Functions Safely ” section of the Postgres documentation before using this capability.

Here’s an example of how to implement RLS policies that satisfy the three use cases above:

complex_rls_policy.sql
CREATE
TABLE
users(
id SERIAL PRIMARY KEY,
is_admin BOOLEAN
);

ALTER TABLE
users ENABLE ROW LEVEL SECURITY;

-- Users can do anything to themselves.
CREATE
POLICY self_policy ON
users
USING(
id = current_app_user()
);

CREATE
FUNCTION is_user_admin(
_user_id INTEGER
) RETURNS bool AS $$ SELECT
EXISTS(
SELECT
1
FROM
users
WHERE
id = _user_id
AND is_admin = TRUE
) $$ LANGUAGE SQL SECURITY DEFINER;

CREATE
FUNCTION do_users_share_org(
_user_id_1 INTEGER,
_user_id_2 INTEGER
) RETURNS bool AS $$ SELECT
EXISTS(
SELECT
1
FROM
org_members om1,
org_members om2
WHERE
om1.user != om2.user
AND om1.org = om2.org
AND om1.user = _user_id_1
AND om2.user = _user_id_2
) $$ LANGUAGE SQL SECURITY INVOKER;

-- Non-admins can only read users in their orgs.
CREATE
POLICY read_in_shared_orgs_policy ON
users FOR SELECT
USING(
do_users_share_org(
current_app_user(),
id
)
);

CREATE
POLICY admin_policy ON
users
USING(
do_users_share_org(
current_app_user(),
id
)
AND is_user_admin(
current_app_user()
)
);

Note the use of the do_users_share_org() SECURITY INVOKER function. According to the Postgres docs:

“SECURITY INVOKER indicates that the function is to be executed with the privileges of the user that calls it.”

In our case, this is app_user (who doesn’t bypass RLS), so we just define these functions for reusability purposes.

Logging

It’s important to set up logging before shipping any feature to production. This is especially true with RLS where logging the execution of the actual policies isn’t directly possible . For each request, it’s helpful to log the user and tenant IDs to be used for RLS when:

  • Parsing them from auth headers
  • Setting and getting them from thread-local storage
  • Setting them in the db connection
  • This makes it easier to identify bugs related to thread-local storage When resetting them in thread-local storage

It’s also a good idea to enable more detailed logging in the db, at least initially, to see the values actually being inserted/retrieved. If policies return too few/many results, or inserts fail unexpectedly, it’s easier to figure out what went wrong.

Testing

In multi-tenant SaaS, guaranteeing the security of each tenant is critical. We have an extensive suite of integration tests that test every access pattern to make sure that nothing ever leaks. The tests spin up a Postgres Testcontainer and call the relevant API endpoints, checking that proper access is always enforced.

In order to minimize the execution time of a large suite of integration tests, we avoid setup and teardown of the database between tests and annotate the order in which tests run to make sure the results are deterministic even without a full cleanup in between tests. As we scale, we’ll look into other options like property-based testing and parallelizing our tests.

The switch from dynamic queries to RLS has been seamless in our integration tests. All we had to do was to make sure our tests were using the newly-created app_user that doesn’t bypass RLS.

Conclusion

Every modern SaaS product is multi-tenant, but the good ones are also scalable, cost-effective, and maintainable. Scalability and cost-effectiveness are the results of careful system design. Maintainability includes design considerations such as the DRY principle (don’t repeat yourself) and a separation of concerns, which make mistakes less likely and testing and troubleshooting easier.

As we’ve shown, a single-schema multi-tenant database with RLS ticks all the checkboxes for scalable, cost-effective, and maintainable architecture. This blog includes everything you need to get started with your own multi-tenant SaaS architecture. But if this seems like too much and you’d rather have someone else handle this for you - talk to us :)

· 18 min read

When we explained the control plane approach for Infrastructure SaaS, most developers we talked to understood what SaaS control planes do and often said things like oh! I built this many times! I didn't realize it is called a control plane. This reaction confirmed what we initially suspected - that everyone is building control planes, but no one is talking about it. If you are interested in SaaS control plane - sign up to our mailing list. We'll send you our latest content and reach out for a chat

We also received a few responses like "why is this difficult?"

Legally blond meme - &quot;what, like its hard?&quot;

When we started building Confluent Cloud, we thought that building a control plane would be easy. Few tables for users and their clusters, run everything on K8s... how hard can it be? A two-pizza team effort of 6 months seemed like a very reasonable estimate. We discovered otherwise. Recently, when we talked to other SaaS Infra companies, we learned that regardless of the company's specific architecture or size, building a successful Infra SaaS product required at least 25% of the engineers to work on the control plane.

Building a control plane sounds simple - but it isn't. If you ever said, "oh, I can do this in a weekend," only to keep struggling for weeks or months, you should be familiar with deceptively simple problems.

Chess only has six pieces and ten rules. It sounds like a pretty simple game. The DynamoDB paper is short, yet every implementation took years to complete. Copying data from one Kafka cluster to another sounds simple. No one expected it to require four rewrites in 7 years. Pick the simplest software engineering topic you can imagine - SQL, JSON, REST, PHP - and you will be able to find a blog explaining why this topic is rather complex.

tweet about &quot;I can do this in a weekend

In this blog, we'll look at the challenges waiting for the engineer who sets out to build a control plane.

As we described in the previous blog, the control plane is responsible for providing the SaaS capabilities, metadata management, and controlling the life cycle across all the data planes.

Building a control plane is a big job. In this blog post we'll discuss the many challenges that may not be apparent when you first try to estimate the effort involved. We divided the problems into two parts:

  • SaaS Flows: The problems that come up as you try to create a unified customer experience across many services in the control plane and the data plane.
  • Control Plane and Data Plane integration: The problem that come up when you need to send information from the control plane to the data plane and vice versa.

Different problems become relevant at various stages of building a control plane. An MVP for a control plane can be relatively simple and become more challenging as requirements add up. Having this map of challenges will help you understand the technical depth involved and the realistic investment level.

While this blog focuses on control planes that serve the customers of Infra SaaS, the challenges involved in building internal control planes are similar. We will address the topic of internal control planes in a future blog post.

Seamless SaaS Flows

Your customers signed up to your SaaS product because they need to get something done and they will use the control plane to do it. They don't care about its architecture and all the services that are involved - they need it to disappear so they can focus on getting things done.

You can probably think of products where even though the product is complex, getting things done is a seamless experience. The same concepts are used through the product and you can't tell which teams own which microservices behind the scenes. Apple's iPhone, Stripe, Notion all have a seamless user experience.

Compare this to AWS network configuration. All you want is to run a container that can communicate with the public internet. But you have to figure out how to configure EC2, security groups, load balancer, target groups, subnets, routing tables and NATs. Each with its own documentation. If you do it wrong, you won't have connectivity. But because each component is configured separately, it is nearly impossible to understand why the packets don't arrive correctly.

We use the term SaaS Flow to describe common user activities in SaaS products that interact with multiple services.

There are SaaS Flows that are standard in SaaS products - they interact with standard SaaS capabilties.

For example: inviting a user to an organization in a SaaS product is a SaaS flow - a single activity from the user perspective, but the implementation spans the authentication service, user management service, notification service and perhaps an access control service as well. You can see an example diagram of an invite flow below.

There are also SaaS Flows that interact with entities that are specific to your application.

Creating a new set of API keys that give access to the data plane database that was provisioned by a customer. Upgrading an account from free trial to a paid version and updating the number of concurrent builds that can run in the data plane. Handling an expired credit card, deleting a user, deleting a tenant - all those are examples of one user activity that has to be handled across many services some are general (payments, notifications) and others are product specific (pausing all ETL jobs, updating data plane access policies).

There are hundreds such flows in any SaaS product.

state machine for user invite flow

Every control plane is a SaaS Mesh - it is made of many multi-service SaaS Flows

User management, access management, audit, billing, notifications, provisioning, upgrades, metrics, logs... not to mention anything specific to your product. Every SaaS flow will involve several of these services, which means that they continously need to exchange information, events and actions. Each component has an owner, but who owns the end to end flow? Who makes sure that services send each other the relevant information and that each service handles its part of the flow in ways that fit with all the other parts? You can think of this as a SaaS Mesh - a seamless product that is generated from independent components and clear contracts between them. Or it can become a SaaS Mess, if the interfaces are not well defined and the dependencies are introduced ad-hoc.

As an example, think of a scenario where the credit card on file for an organization has expired. The organization has 15 users. Which one of the users in the customer's organization will be notified? how will they be notified? will your sales team or support get notified too? will the customer's clusters or jobs keep running? for how long? if the cluster is de-provisioned, will you keep storing their data? what about the list of users and their emails? metrics? once they update the credit card details, will every service resume its activity? will the new card get charged for any of the lapsed time?

It is important to also handle all the failure scenarios in each of these flows - what if the notification service is down? What if Salesforce returns an error or throttles your requests? Is it possible to save the current flow state and try again later? Can you restart the flow from the beginning or were some notifications already sent?

note

SaaS Flows can be modeled with a state machine and each event or API request/response moves the system between states. This model helps you persist the latest state of the flow, so completed steps won't re-run but failed steps can be retried and the flow can continue. In addition, this modeling helps in monitoring the load and health of each flow.

SaaS Flows across control and data planes.

Creating Seamless SaaS Flows that touch both control plane and data plane services is an even bigger challenge. This is especially true when a customer request encounters a failure in the data plane.

Balancing enough abstraction for a seamless experience when things go right, but enough details for meaningful debugging when things go wrong, is an engineering challenge in general. It becomes more difficult when the user interacts with a control plane but problems happen in the data plane.

Think of a scenario where you built a SaaS ETL product. A customer tries to define a new ETL pipeline through your control plane, but something in the process failed. If the failure was due to lack of compute on a K8s cluster, the control plane shouldn't show the exact K8s error, since your service abstracts K8s. But if the failure is in loading data to the customer's DB, showing the exact error will actually help your customer identify the DB issue on their side.

Example of an actionable error message in a SaaS flow:

is this error useful?

If the error is transient, it makes sense to retry the SaaS flow - starting from the point of failure. Does the control plane manage the retries by re-publishing the "create new pipeline" request repeatedly until it successfully completes? Does the pipeline itself persist the event until it is successfully acknowledged? Does the dataplane store the in-flight requests locally until they complete? Each one of the architectures has its own tradeoffs.

In cases where the user does interact with the data plane directly, we discovered that user's mental model is that all admin activity will still be available in one place and that there will be a consistent permissions and access model between control plane and data plane.

A user who just created a database in the control plane will expect to be able to also be able to create tables, insert data into these tables and run queries. The expectation is that the control plane is a single pane of glass that reflects all the data plane systems. It will be a non-ideal experience if they need to use two or three different tools for all those activities and an even worse experience if the user who created the database doesn't have permission to create a table or to query the table that they created.

SaaS Flows that involve business systems

In addition to the control plane and the data plane, there are other parts of the business that have a relationship with customers.

Support teams will need a view of the current state of the customer's metadata - especially if there were any recent changes or errors. They will need to be able to dig into any relevant metrics or logs on behalf of the customer and perhaps even take action on the customer behalf's (after proper approvals).

Marketing teams may need metrics regarding the customer's engagement or specific activities they took (or did not yet take) in the product. And they may wish to tweak certain aspects of the product experience to drive growth in certain segments or personas.

Sales teams may need to know when the customer's usage passed a certain limit. They may also need to be aware of any serious incidents, SLA misses and significant planned maintenance that will affect their customers. And of course business analytics or data science teams will need access to all the usage, engagement, costs and revenue data in order to prepare dashboards for the executives.

A credit card expiration flow may have a step that updates the sales team via Salesforce, along with many other steps:

services involved in credit card expiration

All those business requirements indicate the need for ETL and reverse ETL between the control plane and multiple business systems - data warehouse, analytics store, marketing tools, sales tools, support systems and so on. Those integrations also require monitoring and ideally should be part of the integration testng, so you can quickly catch any breaking changes.

When using 3rd party services - you still own the SaaS flows

Since SaaS control planes are large in scope, it makes sense to integrate with 3rd party providers for specific features such as payment processing authentication or transactional notifications.

Using 3rd party services helps Infra SaaS companies deliver value to their customers faster, but those customers still need seamless SaaS flows. External services can be part of these flows but the flow itself is still owned by the control plane developers.

Lets say you use a 3rd party authentication service. Authentication is solved, but information about users still has to exist throughout the control plane and even the data plane, since it is part of many SaaS flows. There is still a "user data store" and "user service" which provides APIs and events to every other service that needs information related to users. All the issues we describe in this section are still problems that you own and need to address: designing SaaS flows, error handling, access management between control and data planes, testing and monitoring.

Trust but Test and Monitor

SaaS Flows have to be tested as a flow - coverage of each service alone leaves many gaps for customers to fall through. You will want an integration testing framework that allows you to test all the services, including the 3rd party ones. Testing the "reset password" API will require an environment with the authentication service, user management service and notification service.

It is also important to test all the cross-service APIs. You will want to avoid breaking compatibility between services when possible, and to know when compatibility was broken so you can come up with a deployment plan that involves all services that use the modified API. There are also APIs that were not meant to be used by other services, and yet they are. Breaking those undocumented APIs will break your application just the same. There are service mesh tools that can report which APIs are actually in use, and by which services use which API - use those tools to understand which API contracts you need to maintain.

Make sure you collect detailed metrics about the number of users, payments, notifications or other entities in each step of the flow - a large number of flows stuck in a specific state will be your main indication that there is an error condition that your flow did not take into account.

Most SaaS Flows have implicit user expectations around latency - after clicking "reset password", users will expect the website to update in 100ms, the SMS to arrive in 30 seconds and an email to arrive within a minute or two. You will want to measure the latency of each step in the flow and queuing delays between steps.

diagram of spans in SaaS flow

Integrating Control and Data Plane

This is the core challenge of the control plane architecture. We reviewed the overall architecture in the previous blog, but here's the MVP version:

  1. Design the control plane metadata and use Postgres as your data store. Use Postgres built-in REST APIs and access controls and you have a minimal backend.
  2. Use 3rd party integrations where possible. This still requires effort, but it is a good start.
  3. Capture changes from the control plane that need to be applied on the data plane. With this architecture all changes are persisted to the database, so it makes sense to capture changes at the DB layer. This can be done with logical replication, Debezium, database triggers or a home-grown service.
  4. Deliver the events to the data plane: The most common pattern is to have the data plane poll the control plane for new events - this can be via API calls, direct database queries, or an event / messages service.
  5. Data plane services react to events independently, according to their own business logic
  6. Data plane services update the control plane on progress and errors

Once you implement all this, make sure you set up integration testing and monitoring.

Beyond this simple architecture, there are additional challenges that result from the different dimensions in which the system can evolve.

Availability

If your architecture allows users to interact directly with the data plane, you want to make sure that the data plane availability is either completely decoupled from that of the control plane or that both the control plane and the data plane and the pipelines in-between are designed for a higher SLA than what you offer your customers. If you opt for decoupling the data plane availability from that of the control plane, you'll probably end up with the data plane caching information from the control plane locally. It may sound simple, but keep in mind that cache invalidation is one of the two hardest problems in computer science.

Security

If you support enterprise customers, there will be interesting challenges around the security of the communication between the data plane and the control plane. They will need to mutually authenticate each other and the events themselves may need to be signed for authenticity. You'll likely need IP whitelists in both directions, publish approved port lists and support at least one private networking option, possibly more.

Some Enterprise customers may also want you to run and manage the data plane, or even the control plane in their VPC or their cloud vendor account.

You will need support for storing secrets in the control plane. It is very likely that your data plane will need to authenticate to customer resources in other SaaS, so you will ask your users for credentials - and the last thing you need is for those credentials to leak.

Scale

As the number of data plane service instances grows, you need to make sure the control plane can handle the case where they all attempt to connect to the control plane at once and retrieve updated state. This can happen as a result of an incident, a recovery plan or a mis-managed upgrade. A meltdown of the control plane under this coordinated DDOS is not going to be helpful in any of these scenarios. A combination of good database design which minimizes hot-spots and a good rate limiting protocol will help save the day.

Many Infra SaaS have use-cases that are latency senstive. When the target latency is below 100ms, you have to avoid routing these operations via a central control plane (regional control plane may be acceptable). The extra latency for the additional network hop will be meaningful and the risk that the control plane will become a bottleneck is rather high.

Over time, as your product and business evolves, you may end up with multiple pipelines between control and data plane:

  • Metrics and logs are often sent from data plane to control plane, so they will be visible to customers via the control plane ("single pane of glass" is a common name for this).
  • There may be another system for fleet management and upgrades, one that is integrated with your CI/CD system but also with the control plane front-end and the notification service.

While those may be separate channels of tasks and information, it makes sense to view all those pipelines as part of a single "logical" control plane and standardize on the entities, events and APIs that these systems refer to. The reason is that as we discussed when we talked about SaaS Flows, customers expect a seamless experience with the control plane - not multiple control planes. They may want to subscribe to upgrade notifications or even configure a maintenance schedule. If the fleet management and control plane speak different languages, this integrated experience will be a challenge.

Upgrade flow

Reconcilling state between data plane and control plane

Remember that things may happen on the data plane without going through the control plane first. This can be caused by the cloud provider decomissioning machines or upgrading K8s masters with surprise effects, or more often - it can be an engineer acting with the best intentions. Regardless of the cause, operating a system where the control plane has one configuration and the data plane has another is a recipe for failure. Your architecture must include plans for discovering and reconciling divergence.

Summary

It is easy to look at a control plane as "just a Postgres DB with some APIs and an auth service" and believe that it is simple to build and grow. However, even at its simplest, the control plane requires careful design, good guard-rails in the form of integration tests and comprehensive monitoring, and quite a bit of toil to build the needed integrations. Systems that look easy but turn out to be a significant investment are quite common in engineering. At the MVP stage, they require balance between keeping the scope minimal while still designing a flexible system that can evolve and address both customer requirements and operational pains. We will introduce more design patterns in later blog posts that will help you in designing and implementing such systems. Join our mailing list to get notified when we publish additional posts.

· 17 min read

A few months back, we saw a tweet about how every Infrastructure SaaS company needs to separate the control plane from the data plane to build a successful product. Reading this got us excited since we were working on a platform that would make this really easy. We would love to talk to you if you are already familiar with these patterns and are building an Infrastructure SaaS product

Twitter-Snapshot

We spent the last six years at Confluent, helping transform it into a world-class Infrastructure SaaS company. We shared the same sentiment as this tweet - building Infrastructure SaaS products can be much simpler if we have a platform that helps develop a reliable control plane. Companies could save significant costs and time, and they could leverage their engineers to focus more on their core products. We thought it would be helpful to explain the end-to-end architecture of an Infrastructure SaaS product, the role of the data plane and control plane, and the problems that make this challenging.

What is Infrastructure SaaS?

Infrastructure SaaS refers to any infrastructure or platform product provided as a service. It includes data infrastructure, data analytics, machine learning/AI, security, developer productivity, and observability products. Sai Senthilkumar from Redpoint wrote an excellent article on this topic and how these Infrastructure SaaS companies are among the fastest-growing companies.

Infra-SaaS

Infrastructure SaaS companies invest in platform teams to build their SaaS platform. The platform teams are responsible for developing the building blocks needed to build a control plane. The investment in the platform teams continues to grow significantly as the product succeeds and is typically 25-50% of the engineering organization. Based on our experience building large-scale Infrastructure SaaS and talking to other companies, it has become apparent that platform investment is the highest cost to the engineering organization in these companies.

Data plane vs. Control plane - when do we need this?

Control planes are typically responsible for providing the SaaS capabilities, metadata management, and controlling the life cycle across all the data planes. The separation between the control and data planes is common when building an infrastructure SaaS product. There are a few reasons for this:

Infra-Relevant

Productize an open-source infrastructure as a SaaS product

Most open-source infrastructure projects start with only the data plane. The project authors realize that the next step is to productize the open-source infrastructure as a SaaS product. An independent control plane is ideal for achieving the SaaS experience and ensuring that the core open-source data plane is separate. The control plane will help manage multiple data planes across regions and cloud providers.

Building any proprietary Infra SaaS product

The open-source argument is pretty strong. However, the need for a control plane is not just limited to open-source infrastructure. It becomes a core need for any infrastructure SaaS product, either close or open source. Almost all Infrastructure SaaS need a central management layer that enables tenant management, user management, cluster management, and orchestration of all the data planes. The control plane provides a single pane of glass experience for the end-users, coordinating with all the data planes and responsible for the overall life cycle management.

Data locality with customer location

With infrastructure SaaS, there is a general need to keep the data plane close to the customer location for a few reasons.

  • Cost
    The data transfer cost will be prohibitively expensive if the data plane is network intensive. You typically want to eliminate this cost by being in the same region as the customer. There are a few other networking options to mitigate this cost (a post for another day).
  • Security
    For enterprise customers, the data plane location depends on substantial compliance and regulatory requirements. Extremely security-conscious customers might want the data plane in their account to control access more tightly.
  • Latency
    Mission-critical infrastructure typically has low latency requirements. The data plane must be in the same region as the customer to ensure excellent performance.
  • High availability
    For high availability, you want to avoid your connections to the data plane across GEO and be more resilient to network failures cross-region or cloud. In addition, a single data plane cluster may be hard to scale due to capacity reasons and would need to be sharded. It becomes much easier to scale the data plane by decoupling it from the control plane.
  • Multi-cloud
    Finally, supporting multiple cloud providers is becoming very popular. One model to support this would be to centralize the control plane in one cloud and deploy the data plane in different cloud providers for the same customer. There are more variants to this which we will look at later.

What does a world-class control plane need?

It would be helpful to understand what capabilities a control plane needs to support. These requirements will influence the architecture of an Infrastructure SaaS product.

Infra-Requirements

User, organization, and metadata management

Users and organization management are basic requirements for an Infrastructure SaaS product. User management includes authenticating users, managing users' lifecycle (add, invite, delete, update), and supporting user groups and third-party identity integrations.The control plane needs to ensure the access controls for a user is reflected on the data plane when the user lifecycle APIs are invoked.

Organization management, sometimes known as tenant management, includes supporting the organization hierarchy data model, applying quotas, SKUs, security policies at an organization's scope, and end-to-end life cycle management. Multitenancy is a basic need for a SaaS application, and Infrastructure SaaS is not any different. For larger customers, organization management becomes pretty complex, including supporting flows to merge two or more organizations, suspending organizations, and implementing clean organization deletions based on regulatory requirements (GDPR, FedRAMP, etc.).The control plane needs to ensure tenant lifecycle management is reflected on the data plane as well. For example, when an organization is suspended, the control plane needs to ensure that the data plane cuts access temporarily.

There are many standard SaaS entities that a SaaS application needs - users and organizations are examples of that. At the same time, there is a lot of application-specific metadata. For example, an infrastructure product that lets users manage a set of database clusters could define metadata like ‘cluster’, ‘network,’ and ‘environment.’ The metadata needs to be defined, CRUD APIs need to be written to manage them, and their access needs to be controlled by the same security policies defined for users and the organization. The central control plane needs to be the source of truth for this metadata and support its management.

Orchestration and integration with data planes

The control plane should have near-instantaneous communication with the data plane - whether it manages a single data plane or hundreds of clusters across different regions and cloud providers. It needs to communicate and transfer data securely across the data planes and receive data back. The control plane needs to provide a single pane of glass view of all the metadata of an organization’s data plane. Pushing configuration changes, sharing application metadata, deployment, and maintenance operations are a few examples where the control plane needs to have the ability to orchestrate across the different data planes.

Lifecycle management of the data plane

One of the control plane's core needs is to manage the data plane's end-to-end lifecycle. Typically, this includes creating, updating, and deleting resources in the data plane. For Infrastructure SaaS, all these operations are asynchronous, and you want the control plane to manage the end-to-end flow. It needs to provide an excellent experience for the end-user when they invoke these lifecycle operations, ensure low latency and correctness in executing these async lifecycle operations and can do this management at scale for hundreds to thousands of data planes across regions and cloud providers.

Security Policies

Security policies include access controls, quotas and data governance for each organization and user. Access controls can range from simple permissions to complex RBAC support. Typically, these access controls need to apply to both the control plane and the data plane. When there are IDP integrations, the control plane may have to apply access controls based on what is defined in the IDP. Quotas are bounds to the set of operations that users and tenants can perform on the infrastructure. This is typically done to protect any denial of service attack and build a healthy multitenant system. Quotas, similar to access controls, can apply to the control plane and data plane operations. Data governance is increasingly critical for larger customers. Governance includes ability to find data easily, store data in compliance with country specific policies and purge data based on data retention rules. For all the security policies, the control plane needs to ensure they are applied to all the data planes consistently based on tenant location, policies and user rules.

Metrics, alerting, and insights

For infrastructure SaaS, users typically execute some commands on the infrastructure and would like to know how the execution is progressing. They might want to get notified or alerted when something is not going right or if it needs their attention. For example, a database product will have users execute a set of queries and look at query metrics to understand the response times, errors, and usage. Users may also want to get notified when a query starts failing. Users also need aggregated metrics across their databases in different regions and cloud providers. The control plane needs to aggregate all the metrics across the data planes, provide insights on the metrics to the users, and alert the customers on critical anomalies.

Subscriptions and Usage-based billing

Infrastructure SaaS has fundamentally changed how billing for SaaS works. Traditional SaaS is billed based on the number of seats/users. For Infrastructure, it is typically a combination of monthly subscription plus billing based on product consumption. For example, a specific infrastructure product might provide customers with three different SKUs with increasing value. The first SKU could be a free tier with limited quota, and the rest could have a base monthly subscription fee. In addition, the users would pay based on their usage of the product that month. Consumption could include throughput, number of queries, storage, number of compute instances, etc.

The control plane must be able to compute the billing based on the user SKU, monthly rate, and usage. The usage component must be calculated based on the metadata or usage metrics aggregated from all the data planes. The metrics and insights shown to the user (explained above) should match the billing data to ensure the user has a consistent experience.

Back office integration

SaaS companies must integrate customer metadata with all the back-office systems. When a user signs up, the marketing team will want the user information in their campaign tool to start including the new users as part of their marketing campaigns. In a sales-led company, the sales rep must create a new production account for the customer once the deal is successfully closed in their CRM. The customer metadata must be pushed to the Data Warehouse for business insights. These examples need a reliable pipeline that integrates data between the production database and the back-office systems. For Infrastructure SaaS, the control plane is the central customer metadata store in production. It needs to provide a reliable pipeline that integrates both ways with all the back-office systems. In addition, the data needs to be available in all the different systems within an acceptable SLA agreed by all stakeholders.

Infrastructure SaaS architecture

With our understanding of the control plane and infrastructure SaaS requirements, let us delve into the architecture of a typical Infrastructure SaaS product. We will review the basic building blocks and discuss a few architectural considerations.

Controlplane-arch

The basic building blocks of a control plane

SaaS fundamentals (aka The SaaS Mesh)

Infrastructure SaaS products need a world-class SaaS experience for their customers. It includes authenticating users and user management, organization management, providing a permission model for access controls, defining different SKUs for different product offerings, and the ability to bill based on subscription or usage of the product. These are basic expectations from end-users for all Infrastructure SaaS products. A basic version of all these features listed may be good enough for a free tier offering with some quotas, and they get complex as you serve higher segments (e.g., enterprise). For example, mid to enterprise customers may need to integrate with their identity management system for authentication instead of the default offering that the product provides.

There is also a complex interconnect between the SaaS features that we call ‘SaaS Flows’. The SaaS experience of an infrastructure SaaS consists of a bunch of SaaS flows. For example, when a user signs up for the product, you may also want to create an entry in a marketing tool to send campaigns. A more complex example of a SaaS Flow could be when a credit card for a specific organization expires. On expiry, you want to notify the customer a few times to update the credit card information. If there is no response from the customer, you might want to temporarily suspend the account, disable access to the data plane and eventually reclaim the account after waiting for a sufficient amount of time to avoid incurring infrastructure cost. This SaaS Flow example connects user management, org management, billing, notifications and the data plane. We call this interconnect between the different SaaS features a ‘SaaS mesh.’ SaaS mesh is needed to build the different SaaS flows. The SaaS flows include customer-facing experience, and back-office flows for the other stakeholders in a company.

Orchestrating the data planes

One of the core responsibilities of the central control plane is to orchestrate the different data planes. Typically a single customer could have applications or clusters in multiple regions or cloud providers. With more customers and data planes, a few things have to be supported -
  • Propagating the SaaS metadata to all the data planes
  • Pushing new configurations and application versions across the fleet
  • Defining maintenance windows and sequence of deployment based on customer priorities
  • Capacity management of all the data planes to ensure infrastructure is within the cloud limits
The control plane is the source of truth across all the data planes. Managing the data plane information centrally helps provide a single pane of glass experience for the end customers to access all the information about their applications or clusters.

Data plane management

The data plane is where the actual customer application or cluster is deployed. The deployment can happen on a Kubernetes cluster or directly on cloud instances. The cluster or application could be co-located on one Kubernetes cluster or separate. There is usually an agent running on these data planes that helps execute the local life cycle operation on the data plane based on the commands from the control plane. The agent, in a sense, acts like a mini control plane co-located in the data plane. Like any architecture, there are different ways to manage the data plane. Kubernetes, Terraform, or Temporal are tools that could be used to manage the lifecycle of each data plane.

Closed feedback loop

Any control plane architecture is not complete without a closed feedback loop with the data plane. As mentioned previously, the control plane is the source of truth about the current state of customer applications or clusters. The data plane needs to report the status of the operations to the control plane. In addition, the control plane would also want to collect application metrics and metadata to show insights to the users about the infrastructure.

Other considerations

Customer account vs. Fully hosted

In the fully hosted model, the data plane is deployed in the cloud account of the service provider. Due to compliance requirements, some companies and customers demand that the infrastructure be deployed in their own cloud accounts. Some infrastructure Saas companies need to support both models. It is possible to unify the architecture for these different deployment models. To deploy infrastructure in the customer account, permission model in the customer account, billing plans (the customer gets usage cost in their cloud bill), support (who gets access), and development overhead.

Testing

Testing new data plane changes against the central control plane adds complexity. Mocking the entire control plane for end-to-end tests is not desirable since you typically have to make changes in the control plane to enable new data plane features, which need to be tested. A reasonable solution is to provide each developer their own local sandbox of the control plane with only their changes. It will help them to test their changes locally before pushing the changes to pre-production. Without a good testing strategy for the control plane, every change gets harder to stabilize as the product and teams scale.

Disaster recovery

An essential part of control plane design is to have a sound strategy when the control plane becomes entirely unavailable. From a user perspective, the data plane needs to be available even if the control plane is unavailable. In addition, there needs to be a plan to bring the control plane back up in the same region or another region (sometimes in another cloud). Restoring the data without any data loss is critical. You can provide a highly available service if you can bootstrap a control plane automatically from the backup data.

This is hard!

We plan to publish a post soon covering the complex parts of the building and scaling a control plane. We have listed a few questions below that take significant time and cost to design and build.

Controlplane-hard

  • How do you build the SaaS fundamentals for your product? How do you support the different SaaS flows?
  • How do you manage all the SaaS and application metadata to provide a single pane of glass experience for your users across all the data planes?
  • What mechanisms do you use to ensure the metadata changes are available to all the data planes?
  • How is everything designed to support multitenancy? How do you enforce metadata access, quotas, and SKUs at the tenant scope?
  • How does the architecture change when you orchestrate thousands of data planes?
  • How can you collect all the metrics and metadata from all the data planes to provide insights to the users, compute billing based on usage and integrate with the Datawarehouse for business intelligence?
  • How can the control-plane scale, and what SLAs do you provide?
Look out for a blog post that will discuss the challenging problems of building a control plane in more detail.

Building Infrastructure SaaS?

In the next eight weeks, we will release a series of blog posts describing different aspects of the control plane architecture for Infrastructure SaaS. We hope this will help all the companies building, scaling, or rearchitecting their control planes to provide their infrastructure as a service in the cloud.

We will love to talk to you if this problem sounds familiar to what you are tackling! We are building a platform that will make it easy to build and scale Infrastructure SaaS products. We hope to provide a world-class platform that Infrastructure SaaS companies can leverage to develop their control plane.