Name: Starforge: Simplifying Multi-Cloud Kubernetes Deployment for StarTree Pinot
Uploaded: 2025-05-14T23:58:05.737Z
Duration: 27 min 4 s
Description: Starforge: Simplifying Multi-Cloud Kubernetes Deployment for StarTree Pinot

Transcript for "Starforge: Simplifying Multi-Cloud Kubernetes Deployment for StarTree Pinot": Hi, everyone. I'm here today to present, StarForge, StarTree's next gen, cloud architecture for managing our StarTree Pinot deployments for our customers. I am Chris Kellogg. I'm a software engineer at StarTree working on cloud and platform services, and I'll be presenting with, one of my colleagues who I'll let introduce himself. Hello. I'm Krishna. I've been, like, with StarTree for about three years. I work alongside Chris on the cloud team, and we developed, StarForge, along with a few other our other colleagues. So I will sort of, give an overview of sort of what we're going to present today. I'll give a sort of brief overview on StarTree and kind of what we work on, and then I'll sort of go into our, next generation cloud architecture. And then we'll talk about some of the challenges that we had before and then some of the challenges, we've had working on this new architecture. And then sort of at the end, we'll kind of walk through some of our, solutions and in a way and then sort of how we solve some of the some of the challenges. So I'll start with sort of a a quick quick overview of of StarTree and sort of what we do. So, StarTree, we're a company that works on a open source Apache project, which is Apache Pinot. It's a real time analytics database that came out of LinkedIn. What we do is we offer managed services of Apache Pinot for our customers and sort of various different, deployment models. So on to the sort of this next slide. So, what we sort of offer is we offer this managed service, and we we offer it in in a variety of different, deployment models. And so one is BYOC, which stands for bring your own cloud. Another is, dedicated SaaS, which is similar to BYOC. Another one is multitenant SaaS. And then we have another one, which is a what's called BYOK, which is sort of bring your own Kubernetes and run and run Apache Pinot. We also, work we also, develop and work on all three of the different clouds. So Amazon, Azure, GCP, etcetera. That's so that's sort of a brief overview. And and so now I wanna sort of get into the this next generation architecture and kinda why we did it and then sort of what what it's about. So when we set out to design this next generation architecture, we had a few, requirements that we wanted to we wanted to do. One is we wanted to, build a sort of single stack that could handle multiple different product offerings. So as we get need as we as we sort of, work with our customers and sort of provide them the services that they need, we've we've sort of discovered we need to potentially package things in sort of different ways. So that's either, a SaaS offering, a multitenant offering, a BYOC, or a BYOK offering. And so we didn't wanna build individual solutions for all of those. So we we tried this we tried to come up with a way. Can we build some single stack architecture that can support all of these different product offerings? So if something new comes in tomorrow or a different requirement, you know, we don't have to sort of develop a lot of code or change things to support a new deployment model or offering for our for our customers. We also wanted, you know, to provide fast iteration. So as new requirements come in, as things change, we're able to add and remove, different features and and things that the customer needs, as well as, you know, it need to be highly scalable to support tens, hundreds of clusters, thousands of clusters, you know, across all the different clouds. It also needed to be, like, sort of decoupled in a way where we're able to, sort of modify or change things without affecting other pieces of the of the stack. And then also, security was very important. And so one one of the things we sort of, discovered is, you know, there are customers that are very secure or or, you know, people are very secure conscious, and, we sort of need to limit access to their clusters. We also need to be able to limit outbound access, and we'd be so and be able to control, who and what is is accessing the cluster. And we'll sort of explain in the in the design of how we sort of, handle that. So I wanna start out at sort of a high level view of the, our new our new architecture. And so you can see sort of in this diagram, there's sort of, I would say, like, four major components, and we'll kinda drill down into some of those different areas as we go along. So at the top, we have this thing we call, like, a cloud portal. This is where people come and sort of interact and create their different types of environments, whether it's a SaaS, BOC. It's how they do their different configuration for, their where they wanna deployment, what type of cloud, how they wanna sort of scale or configure Pinot. That sort of interacts with some of these underlying services that you see. And so we have these sort of support services that do configuration management. We have, you know, some services that does upgrade management, as well as we have secret management. And then we have this control plane, which is sort of the heart of all the architecture, which is sort of like the the brain of everything. And so that offers some, you know, APIs that has this controller and it has this task, management framework. We'll sort of dig in later on as well. And then we have these sort of infra as a service that sort of, work and provision resources in the different cloud environments. And then we have this, data plane agent framework that sort of that sort of runs in the customer's account or in our account sort of outside of our control plane that that actually, manages and deploys the, Pino deployments and software that that the customers, run and use. The cloud portal, someone comes and configures a Pinot environment underneath that gets, pieced together into these sort of individual entities that the control plane manages. And so the the cloud portal, works with this config service to sort of generate these entities, then they send to the control plane. Once they're in the control plane, that's where all of the, work and processing starts to happen. So we, as we, sort of provision and and bring up these environments for our customers. So if I kind of now sort of central, dive into the control plane, I sort of talked about the the entities. So we have this concept of these entities within this Cloud v two design. And what these entities are is essentially a model that abstracts software and infrastructure. And and with these entities, we're able to provide this sort of, modeling and relationship where things can get stitched together. And once they're stitched together, with and sent to the control plane, we have this, controller. So so we essentially form this DAG of these stitched together entities. And then we have this controller, framework within the control plane that sort of inspect sort of monitors and watches these different entities and goes through a state machine process. And and from that process, it decides when it needs to generate tasks, to be worked on. And then we have this task framework management that, sort of processes and manages the life cycle and of the different task. And I'll kind of, sort of dive down into that in a in a couple slides. And then, we also have the secret management. So we we have this, way where we can securely store, credentials, and and we have a pipeline where we can securely ship them down into the into the data plan for use. And then we also have, we have access control and so so, limited ways to, generate access for people to access the cluster. So now I'm gonna kinda dive into some of the the modeling of the StarTree entities. So, at a high level, we, people come to the control plane. And at the control plane level or sort of this product level, is you think in accounts, organizations, deployments, projects. I want these pieces of software to run. And so this diagram kind of shows you how, we sort of think at the product level or at the control plane level of how we configure this. And so, what the portal does is is sort of, deals with this product level pieces. And, at the end of the day, what it does is it maps these different concepts into these logical entities that the control plane, understands and can work on. So this is an example of mapping, let's say, an environment into the cloud portal mapping into a into a a DAG. And so, as you can see here, we have different entities that we sort of need to create for, an environment for our customers. So if I kinda walk you through, it needs a, like, an AWS account, and then we need a a VPC, and then we need a Kubernetes cluster. And then we have this notion of cells where we deploy, packages of of software. So we can see within the cell, we have Pinot, data portal and some authorization, and then we also have storage. So what the control plane does is is it knows it's creating this BYOC environment. It then creates these individual entities, sort of going through the config service and it creates these relationships and then it sends it to the control plane. Then within the control plane, there's a controller that works on there and then generates tasks based on if its dependencies have been completed because within the framework, there's outputs of sort of the parent relationship that are needed for the child relationship. And so, the control plan and the task framework sort of handles handles all of this. So now I'll kind of dive into some of the the key sort of, aspects of the this this test framework. So I kinda mentioned, every we have these sort of individual entities that can be stitched together into different deployment models, and they are sent to the control plane. Once they're sent to the control plane, we have this task framework that actually sort of does the work on these actual entities. And so, within the control plane, we we have these different services that work on these different entities. So as an example is we'll have a network service, a storage service, a Kubernetes service. These services will pull out to the, control plane every so often asking are there new tasks to work on. And tasks are generated from the control plane and the controller working inside. Once they've, sort of gotten these tasks, they they sort of go out and create the infrastructure into the in the cloud, and then they send back status information. And then once those have been completed, the controller sort of recognizes and is able to is able to, schedule child job sort of dependent jobs once the the parents have been completed. So we have we built this sort of generic task framework and this sort of generic entity model where we can sort of stitch together, or leave things out, different deployment models. So as customer needs change or as we need to experiment or do a different deployment model, we can really just make configuration changes and just stitch these entities together as opposed to going out and having to, make significant code changes. So now I'm going to pass it off to Krishna, and he's gonna kind of walk you through, a little more of the the architecture and then kinda talk about our, challenges and and solutions. Thank you, Chris, for walking us through the control plane and the task framework. Now let's dive deeper into data plane. The data plane is composed of multiple interconnected components, each handling a specific responsibility to ensure that the platform runs smoothly, securely, and reliably. Now we'll walk through a few of the components and the features that we currently, like, have inside the data. First off, we'll start with the Uber agent, which acts as a central coordination point within the data plane. Its job is to, like, retrieve the latest cluster specifications from the control plane and distribute them to all the cell agents. You can think of it as a command relay, which ensures that every part of the system knows what it needs to do, and it stays aligned with the desired state as defined inside the control plane. So one of the other features of Uber agent is in regards to the status pipeline. While the Uber agent pushes instructions downstream, the status pipeline feature works in the other direction. It pulls health metrics and status updates from the components and channels then back to the control plane through the Uber agent. Using this continuous feedback loop, it ensures that we have real time visibility into component health and performance, helping proactively address any issues before they escalate. Next, it's the cell agent. Once the Uber agent disputes the specifications, the cell agent takes over locally. The cell agent is responsible for actually executing those instructions, creating, updating, and managing components as per the cluster specifications that it received from Uber agent. Each cell agent has a few namespaces that it controls, and all of the components running in those namespaces are managed by it. It then shows that the desired state, which is set inside the control plane, is released and is, like, running in the cluster. This means that all of the components are, like, fully decoupled, and then we can upgrade them individually without having any dependency concerns. Along with the sale agent, there's also, like, a bunch of start free suite apps and, and parts that are, like, running inside the cluster. These include, like, data manager, apps manager, and third eye along with the PINO control, the PINO components and the PINO operator. These services provide the core platform functionality from maintaining data to orchestrating applications and forcing path and role based access control. The pinot operator, which runs inside the cluster, takes care of all of the operations on pinot on pinot components, like scaling them up, scaling them down based on the workload and the configuration provider. With all of these, StarTree apps, it acts as a backbone for our platform services, which ensures security, compliance, and operational efficiency. I know in the previous slide, we spoke about the status pipeline, but we also, like, have, monitoring components that are, like, present inside the cluster, which complements the status pipeline. These tools collect metrics from the entire cluster, giving both the internal teams and also the customer a holistic view of the system health as well as performance. Using these metrics, it improves the operational transparency, optimizing performance, and planning for future growth. No platform is complete without any global security controls. This includes how the secrets are handled along with the TLS certs across the platform. By integrating these controls directly into the data plane, we ensure that the data in transit and addressed is secured, and all of the sensitive credentials are protected at all times. Next, we have the configuration management. This feature uses a template based deployment specifications that flow through the agent hierarchy. This approach not only streamlines deployments, but also, like, ensures consistency across the environments and making it easier to manage large scale clusters with conference. Finally, we'll come to the operational workflows where these are, like, structured processes that guide us on how upgrades and components are added into the cluster. By having all of these, like, well defined workflows, we ensure that any changes to the system happen in a controlled and a repeatable to wrap it up, the data plane is tightly coordinated where each component plays a distinct yet interconnected role from central coordination and real time monitoring to robust security. And, of course, these are just highlights, and each of its components have its own deck. Before we jump any further, let's take a moment to understand some of the challenges we are facing in managing these environments. The first major challenge is around operational changes. In any large scale system, we are not dealing with a handful of components. We are talking about upgrading tens and thousands of custom clusters at a time. Whether it's rolling out new configurations, updating software versions, or making adjustments to the platform services, the scale here adds complexity. Coordinating and applying changes to tens and thousands of cost clusters while maintaining consistency and minimizing downtime is a nontrivial task. Next, we have challenges with the configuration management. Given the scale and diversity of the system, we need an easy and a reliable way to generate entity payloads. The configuration data that actually, like, defines how each component should behave. But it's just not about generating the configurations quickly, but also making sure we, like, process them and maintain customizations over time. Different environments and customers and have different use cases, And we need a system that actually, like, preserves these customizations even as we, like, push broad updates. Without this, we'd risk overriding local changes and leading to potential outages and misconfigurations. So how do we, like, plan on solving this? So we introduced a new service called config service, which is, like, at the center of our solution. The conflict service is a purpose built service to simplify the management of StarTree entities at scale. Firstly, it uses templates to define and manage entities. These templates give us structured and a repeatable way to generate configurations. So instead of manually creating thousands of individual configurations, we simply define a template and reuse it. What's powerful here is that the config service doesn't just stop at templating. It also allows the UI to dynamically render inputs for each entity. This means that the users aren't hand coding any JSONs. They just get a user friendly interface that adapts based on template, making it much easier to manage configurations without needing any deep technical knowledge. Another critical feature is persisting customizations. Do you remember the challenge we spoke of earlier in the previous slide where the updates could potentially wipe local changes? The config service ensures that the customizations made at the entity level are retained and even during upgrades or bulk operations. So teams can, like, safely upgrade components without losing any tweaks or overrides. Once the templates are created and inputs are captured, the conflict service renders these templates into JSON payloads for the entities. It also, like, stores details of template usage, giving us a full visibility into which configuration was applied across the system or in a specific cluster. And using this, it helps us to enable bulk updates, Whether we need to, like, push new versions of a component or roll out a broad policy changes, the config service allows us to do this efficiently and reliably. Coming to batch upgrades. Now we've, like, covered how we manage configurations and ensure, like, flexibility at scale. You know, they say, you don't really understand scale until you try to push an update to a few thousands of clusters at once and hope they all listen. This is where our patch upgrade framework comes into play. We have designed this to be both flexible and safe, giving us full control over how upgrades are rolled across the clusters. First, the upgrades are entirely configurable. This means that the workflows themselves are defined in configuration and not hardcoded, giving us the flexibility to adjust and tune the process as needed. We can upgrade the entire fleet to a target version in a controlled way, whether that's rolling out a critical patch or moving to a new major release. Safety is built at every stage. We run preflight checks before any upgrade begins, making sure that the environment is ready and all of its dependencies are in place. And once the upgrade completes, we perform post upgrade validations to confirm that the system is in a healthy state and the upgrade was successful. To add an extra layer of caution, we have a preview mode. This allows us to simulate the upgrade plan without making any changes. So we can verify that the actions and anticipate any issues and plan safely. When it comes to execution, we have flexible options to suit different environments. We can adjust the batch size and also, like, control the level of parallelism, which means how many number of environments can be upgraded at in a at any point in time and how many number of environments in parallel can be pushed as upgrade. Error handling is also, like, pretty robust, and we support fail fast mode so that whenever we first sign we first see any sign of trouble, we just, like, stop, or we can continue with the best effort mode to upgrade unaffected entities. To avoid wasting any time and resources, we have also, like, built in smart skip capabilities where the system identifies and skips entities that already at the desired version in the state, preventing any redundant work. Of course, tooling and and monitoring is, like, very important and very critical in these operations. Throughout the upgrade, we have, like, real time status updates, detailed logs, and live metrics to monitor the progress and quickly identify any issues. This level of observability means operators will have, like, full confidence in the process, and then they can track any in flight and upgrades and intervene if needed. So in overall, the batch upgrade framework gives us the flexibility to handle any large scale changes and do it safely without risking the system stability. Thanks a lot for attending this talk. We are, like, more than happy to, like, answer any questions or, like, please join our Slack channel in case if you want to, like, find out more about StarForge and all of its capabilities.