Skip to main content

Launching an Infrastructure SaaS Product, An Example Walkthrough

· 18 min read

An infrastructure Software-as-a-Service (infrastructure SaaS) enables users to self-serve without buying underlying infrastructure and get it with minimal effort. If an infrastructure SaaS offering isn’t more compelling than its competitors (who may be other infrastructure SaaS companies offering similar services that are easier to use) or cheaper than open source projects (that may be self-managed but free), then people just won’t use it. So a viable infrastructure SaaS company has to do it better and deliver it faster (After you finish reading the blog post, talk to us if this problem space sounds familiar).

Let’s consider a hypothetical infrastructure SaaS company: an ETL as a service that for its most basic service enables users to deploy jobs that connect to a source system, read data from it, store it, and write it to some other destination. The end users own their end systems, and the SaaS owns the underlying pipeline infrastructure. A generalized user workflow may look like this:

  1. Provision a test job through a web UI with minimal clicks: provide end system connection information and credentials, and go!
  2. Get notifications of real-time updates on the status of the pipeline: connected to database, volume data copied, last read timestamp, etc.
  3. Provision production jobs programmatically through an API and integrate it with their own CI/CD
  4. View existing jobs of teammates in the same organization
  5. Monitor a pre-built dashboard of relevant job metrics that show pipeline performance
  6. Get billed based on consumption of resources

Infra SaaS architecture

The architecture usually comprises a control plane to manage the user and ETL metadata and to integrate with the data plane that actually does the job execution. Developing this kind of infrastructure SaaS product can have a lot of complexity in lifecycle management and codifying best practices. (Watch Infrastructure SaaS - A control plane first architecture for a deep conversation with Ram Subramanian and Gwen Shapira on SaaS control planes.) These SaaS architectures have to solve a lot of really hard problems, including:

Since a new SaaS company will have strong pressures to find product-market fit and to deliver something as quickly as possible, sometimes shortcuts are taken on some of these problems in order to avoid delaying product launch. But these shortcuts can be 10x more expensive than addressing them the right way in your architecture from Day 0. They eventually catch up to you because shortcuts expose even bigger security/scalability/velocity problems that are more expensive to resolve, and customers always demand them anyway. So addressing these solutions early on enables you to keep moving fast. In this blog post, we will use self-serve provisioning in the ETL infrastructure SaaS example as a way to explore these problems more in-depth.

Self-serve in Infra SaaS

Multi-tenant control plane

A user starts by logging into the frontend UI (or maybe API) and sending a request to self-provision a pipeline. The backend receives this request and persists it to a database. Recording user CRUD events in a database ensures that there is a source of truth representing the desired state. Having a database on the backend is a basic thing that all SaaS companies do, but it’s not as simple as persisting the ETL job info directly into a single database and moving on. A critical security concern for cloud companies is isolating tenant data to ensure that tenants don’t see data from another tenant that they’re not supposed to. Companies don’t win customers unless they have tenant isolation to secure customer data. Tenant isolation is also important to ensure that one "noisy" tenant who might be creating a new job every second does not cause delays in provisioning for other tenants.

Record request in a database

Since all control planes have a database, achieving organization awareness and tenant isolation raises questions around database design, schema management, and multi-tenancy. Is it one monolithic database? What is the tenancy model? Is it soft isolation or hard isolation? What are the APIs between the applications and the database? How does it scale up as the company grows, etc?

Many infrastructure SaaS companies make a business decision for time-to-market reasons to have hard isolation because they think it’s faster than implementing soft isolation. Although it makes some tasks like backups and migration easier in the short term, it tends to come with higher cost, poor scalability, and negative impact on tenant onboarding because new infrastructure needs to be provisioned. It also doesn’t anticipate for surprise customer demands ("hey, we just merged with another company…").

Soft isolation can be at various levels of infrastructure and resources, but it often starts in the database. For example, some databases offer row-level security (RLS), a mechanism that includes a tenant identifier in the schema for every table. This intrinsically isolates tenant data from each other because data retrieval is based on the tenant identifier. This is a bit of a simplification, but the idea with RLS is that the application code is simplified and can make a call without validating user permissions first:

// Without RLS: code must check tenant membership
if (dao.isOrgMember(userId, orgId)) {
dao.updateOrg(updatePayload);
}

// With RLS: make the call, RLS restricts response to tenant data-only
dao.updateOrg(updatePayload);

There are some complexities in soft isolation that need to be considered. When resources are pooled, how you do tenant data recovery ("I accidentally deleted all my data, can you please restore to 6 hours ago?") or data migrations for end users ("hey, can you please move all my data to eu-west-2, and by the way, there are new regulatory requirements there"). It may also be a sharded multi-tenant solution where tenants are distributed across multiple databases to better distribute load, but that raises additional operational issues. While implementing multi-tenancy isn’t a small feat (we barely scratched the surface), from a business standpoint, it’s a day 0 security requirement and should be done well to simplify management, scale up, and get cost efficiencies.

Event handling and synchronization with the data plane

The core infrastructure SaaS offering is the set of resources which encompasses the product, deployment configuration, customer data, data processing, and supporting infrastructure. This is the data plane. When a user provisions a new ETL pipeline, they get a slice of underlying shared or pooled infrastructure that is managed through an automated deployment platform like Kubernetes or whatever is the data plane platform choice. It might spin up the right connectors to read and write data per the pipeline specifications.

After the user request is recorded in a database, some service needs to detect and fulfill each request, reconciling it with the data plane so the actual state in the data plane matches the desired state in the control plane. How do you keep track of the changes in the database? How do you process them in order and only once?

Synchronize to the data plane

This is typically done by implementing something like an Apache Kafka pipeline with a CDC connector that streams changes from a database, but there is overhead to build and maintain those solutions. Other solutions tie directly to the database: database triggers can be configured to fire when there is a data change made to a table. Or flipping the responsibility to the data plane, an agent can run every few minutes to query the database for new or updated rows derived from a flag column in a table. Triggers may initially seem simple enough to implement but they have their own overhead and they couple logic to the database itself which can eventually result in unmanageable complexity. What happens if a schema changes, or if there are multi-statement transactions across more than one table, or multiple databases that require additional cross-database and privileges management ?

Actually, an events pipeline or database trigger is an implementation detail—what the developer wants at the end of the day is to focus on the events themselves, not the pipeline. So if the implementation of an event service can be abstracted away with a robust API, then the developer can just call a method to listen for events as they happen. They can spend their time thinking about event processing instead of the pipeline details of brokers, connectors, topic design, partition count, choosing keys, etc. (Note that even with an event service that abstracts away the events pipeline, it doesn't obviate the need for data plane management of its own infrastructure and controllers, cluster resourcing, scheduling, healthchecks, load balancing, etc.) An event service baked directly into their data platform also serves as an audit log which captures history of changes made to the service, who initiated those changes, etc. So when a user requests to provision a new ETL job, it generates an event, and an application listening for events receives it and can take appropriate action, synchronizing the data plane to the desired state:

events.on({ type: entityType }, async (e) => {
// received an event…
if (e.after.deleted) {
// ...destroy resource & update status
} else {
// ...create resource & update status
}
});

The event service also needs to be resilient to any types of failure because they aren’t “if” scenarios, they are “when” scenarios. Whether there’s a failure in the data plane itself, orchestration, cloud provider, event service, etc., desired state needs to be persisted so that whenever a system recovers from a failure, it can act on the user requests. An event service also helps in these cases because a system can pick up listening to events where it left off.

Eventually, the resource is provisioned and the data plane can update the source of truth with its latest status. The platform records the status back into the database and then the service sends a notification to the end user about the pipeline status.

This workflow demonstrates how events can be used for synchronizing the data platform to the control plane. But an event record can potentially contain a lot of detailed information about different resources and state changes. Coupled with filtering to consume at various levels (per tenant, entity type, or specific instance), it makes an event service flexible for different scenarios. It could be used to process updates to any type of entity, like acting on authentication tokens changes, invalidations, etc., or a real-time messaging application that distributes messages to appropriate users or channels. An event service that aggregates data plane and control plane events and provides a great interface for delivering the events makes it useful for deployment, notification, or troubleshooting any aspect of the infrastructure SaaS. For example, Datadog Events provides this kind of rich experience for an event service, programmatically through a browser or Datadog Events API.

Datadog Events API

Metrics and consumption-based billing

Infrastructure SaaS companies have various billing models, often reflecting whether a user has dedicated or shared resources, but ideally in the multi-tenancy deployments with soft isolation, they charge based on “pay-as-you-go” cloud resource consumption. Whatever the billing model, a viable cloud business needs to be entirely transparent and show users all the costs and the metrics from the underlying entities in the data plane. The metrics can be anything: compute time, API calls, latency, ingress/egress throughput, workload capacity, features enabled, etc. They should be able to provide usage-based billing, answer every user’s ultimate question “What did I get charged for?”, and provide a more flexible billing system with dashboards, bill breakdowns, alerts for reaching quota, etc.

Metrics and usage-based billing

In addition to using metrics for usage-based billing, the business uses metrics and KPIs that tell how healthy the company is, show customer activity, churn rate, annual growth rate, and project future growth (see The 8 KPIs That Actually Matter—And How To Measure Them). Even more so, your CEO may want to experiment with pricing models, tiering structures, or tenant customizations, to see which approach maximizes subscriber growth or recurring revenue or profits. In fact, this is precisely the type of experimentation that enables any SaaS to move quickly and adapt to and expand their target user base.

KPIs

(source: https://databox.com/dashboard-software/business)

Infrastructure SaaS companies report user-consumption metrics into frontend dashboards built directly into the web application. They also integrate the operational or business-relevant metrics into industry-standard tools like Grafana or Prometheus. To serve up those metrics, some companies build a telemetry architecture resembling something like what is described in Scaling Apache Druid for Real-Time Cloud Analytics at Confluent. That architecture has a data plane that emits metrics, which are fed into an event messaging system, sent to a database like Apache Druid that is optimized for time-series data, offered up by an API, and then consumable by downstream applications.

Metrics and billing APIs

Developers shouldn’t have to build from scratch a metrics pipeline or support all these extra components. But don’t just pull the metrics from the business data warehouse or from Prometheus or any other metrics collection tool! They will not enforce the tenant isolation and access controls needed in your product, and they often aren’t maintained or optimized as a production database. A data platform that already handles the control plane metadata and has built-in access control and multi-tenancy can provide these metrics to end users and internal operations alike. If the backend has an endpoint that easily serves up metrics, a developer can focus on writing the business logic for processing the metrics and sharing the usage-based billing with customers. So having metrics be a first-class built-in capability of any infrastructure SaaS allows the business to more quickly launch their product.

Control plane access control

Authorization within an organization allows a group of users to belong to the same organization but have access only to the subset of resources that they need. Developers need to create access policies that should follow principles of least privilege and zero trust security. To achieve that, the policies can get quite granular, configurable for a variety of attributes (or "signals" as Netflix calls them, see Authorization at Netflix Scale) like entity properties, user location, user role, suspicious fraudulent activity, etc.

When a user or service account provisions an ETL job, the request gets sent to the backend to validate that they have the appropriate permissions to create/view/edit the requested resource. Actually, permission validation should happen before the option is presented to the user in the app—don’t even let a user try to create a resource if the request is going to fail because that is inefficient and just bad UX.

Control plane access control

Sometimes access control is defined at the application layer by adding in an authorization middleware. But there is development work here, and each downstream microservice application may choose their own authorization tools. In the Netflix architecture circa 2018, they use the same authorization tool which means lots of code duplication! This complexity scales up with the number of applications. As a result, any time you add a new application or microservice, there is additional cost to add in authorization. Since developing a secure middleware is sometimes less interesting than working on the product itself, it gets postponed "to the next sprint" which leads to security vulnerabilities.

A cleaner architecture would be to put access control directly into the data layer. Because data is shared across different applications and multiple tenants, applying access policies on the data itself ensures that policies are applied consistently to all applications. Access control at the data layer also abstracts it away from application implementation, so in theory an application can evolve with new business requirements without changing the security model.

The following code demonstrates a way to apply an access policy to data itself, by granting a specific user access to entities in the development environment, and it’ll be checked whenever any applications tries to retrieve data.

req = CreatePolicyRequest(
actions=[Action.ALLOW],
resource=Resource(type=entity.name,
properties={'environment': 'dev'}),
subject=Subject(email=user.email),
)
policy = create_policy.sync(
...
workspace=workspace,
org=org.id,
json_body=req,
)

In practice, there is a bit of complexity in configuring access control on the data because it requires both a database that supports rich access policies and a team of DBAs with strong skills to build and maintain. As the database schemas and services grow, it can get harder to support and troubleshoot. But if it’s designed into the service from the beginning and not bolted on later, it can provide a cool differentiator in your infrastructure SaaS.

Consistent user experience for UIs and APIs

The discussion so far has focused on the backend workflow, but end users experience the product through workflows like self-serve signup, login, creating a space, the proverbial "3 click" provisioning, programmatic management of resources, monitoring usage, billing, etc. All these workflows need to interact with the backend, and if there are new features or changes rolled out in the backend, then the UI and APIs must be updated too. So to move fast, you really need a robust interface to the backend so users can have a consistently stellar experience through any of them.

A really good UI starts with a stylized look-and-feel to build a brand, but a common problem to figure out is how does the web application interact with the backend. Reusable web components and micro frontend architectures accelerate building the web application, but working with the backend has more complex dependencies and as the backend changes, so must the frontend. In all likelihood, there’s already some kind of API to the backend, but it also has to be robust enough to build a web application on top of and to let customers automate against. A simple task might be creating and managing a new pipeline, with a connection to a new database. From the web application, it might look like this:

Web form with components

A developer could custom-build a frontend to capture the name and other properties with text fields, get the form values, format them the way the API expects, then send it to the back end. But prebuilt components and hooks for common user workflows are specially designed to handle both UX and API interaction and abstract away the API calls. For example, instead of coding up a new organization form from scratch, a developer can drop in a OrganizationForm component (refer to this PR to see how this was handled in a Nile example) that automatically handles interacting with the backend.

components/CreateOrg/index.tsx
<OrganizationForm
onSuccess={(data) => {
router.push(
paths.entities({ org: data.id, entity: entity.name }).index
);
}}
/>

Easy-to-deploy and pre-built, fully customizable web components and simple filters also reduce the development work to serve up events, report metrics, and configure access policies in the frontend. So a great backend API paired with customizable web components really helps the frontend developers provide the slick web application that differentiates their product offering. A public API also can provide programmatic access to the backend for any custom application. Especially for more infrastructure use cases, end users expect to interact only through APIs so that they can automate their own deployments and integrate with their CI/CD. Providing these robust APIs help differentiate from its competitors and other open source projects.

Summary

Walking through the infrastructure SaaS workflow of provisioning an ETL pipeline highlighted some complex problems that need to be solved, how to:

  • provide a database as a source of truth with built-in multi-tenancy
  • give developers an event service to reconcile with the data plane
  • serve up metrics for consumption-based billing, experimentation, and other business operations
  • authorize users with a flexible access control model
  • provide great UIs and API along with a slick frontend with web components customized to the backend

This set of problems is common for all infrastructure SaaS and they get solved over and over again by each new company. Nile addresses these complex problems OOTB by providing a tenant-aware, serverless database that is used to build control planes, just like the one discussed in this blog, which enables companies to iterate quickly and deliver their product to the market as quickly as possible.

Nile control plane

Launching an infrastructure SaaS product should be easier than it is today with codification of the infrastructure SaaS lifecycle management. Companies should be able to focus on their business logic and let someone else handle the complexities. If you’re interested in learning more about building a SaaS on Nile, talk to us to learn more and run our GitHub examples to see it in action.