Multi-tenant Kubernetes reaches a maturity milestone
The best part of Kubecon 2021 was undoubtedly being in person with colleagues again and experiencing a camaraderie we’d lost over the course of the pandemic. While we’ve found myriad ways to cope virtually, it doesn’t compare to being in the same room (or expo hall). Huge kudos to CNCF for putting on an event with such extensive and thorough Covid protocols — I never once felt unsafe or uncomfortable. While the sponsors’ floor was massively spread out and the crowds were lighter than expected, it didn’t result in a lack of productive conversations.
We talked to developers and operators from organizations of all sizes about their incident management posture and what sorts of processes they wished they could automate. We asked, “how often is a runbook buried in the depths of Confluence and proven to be outdated once it’s finally found?” The answers didn’t surprise us — as companies grow and processes balloon across n systems, teams spend an ever-increasing amount of time managing the sprawl. We had so much fun showing folks how we could solve that for them.
Exhibitors were the usual crowd: the big names with enterprise-grade k8s solutions; mid-tier folks making headway in delivery or security or operations as a niche business in the k8s world; and the start-ups charting new courses, pitching a better way in a known space, or the generalists, like us, whose product isn’t k8s specific but has a lot of appeal for both devs and ops managing containerized workloads. More on that in a moment.
This year I decided to focus on a single theme for the sessions I attended because I was also going to spend a fair share of my time giving demos at our booth and wanted to go deep on at least one topic. I chose multi-tenancy because it’s something I worked on in a previous life and one that will present itself as a challenge to every organization at scale that wants to run their workloads on k8s. Multi-tenancy asks, how do you best accommodate many teams with multiple services in a large Kubernetes environment?
There are three fundamental approaches:
Namespaces, which allow many tenants on a cluster while preventing collisions among them by restricting their resources within a boundary concept known as a namespace
Virtual clusters, which are nested clusters with their own control planes within the physical cluster, and finally,
Fully separate clusters
Each of these has its benefits and drawbacks across a range of factors like security, networking, cost, and resource consumption. Not all of them are possible or fully featured in commercially managed versions of k8s, like EKS, so you need to understand where your platform will be running and which factors are most important for your business.
Start with whether you’ll be managing bare metal Kubernetes or if you can use one of the major cloud providers. AWS and GCP offer fully managed cluster implementations that free up a ton of operational energy, but the tradeoff is you’re bound by their feature set and their implementation. Rolling your own gives you the flexibility to be as cutting edge as you’d like, but make sure you can support it. Then think about your business requirements —do you have strict security and access control needs? Do services need to be rigidly separated on your network? Is cost a primary consideration or do you have flexibility there? Most organizations will find a combination of namespacing and a fully managed version of Kubernetes to be the best mix of service separation, cost, and manageable operations overhead. I would recommend starting there unless you have a hard requirement otherwise.
On the cutting edge end of things, the talk from the SIG on multi-tenancy covered a lot of ground on the state of the community versions. Hierarchical namespacing is gaining more adoption, bringing further flexibility to the namespacing construct, and open-source commercialized virtual clusters solutions are now hitting the market as well. While those developments bring more options into the mix, the maturity curve for multi-tenancy is flattening out, as the three models have proven to be the best core options.
One of the talks closest to my former life as a platform product manager came from a Salesforce dev whose team implemented an internal Kubernetes PaaS. They struggled with all the same sorts of problems we worked to solve. You have to ask yourself a lot of questions about the culture of your development teams and what they want out of orchestration while balancing that against the direction the organization needs to move in as a whole. Questions like whether your platform should be opinionated or not; how much of the operations should be abstracted away in the name of ease of use vs. how much should persist and be the responsibility of the development teams. It sounded like they were having a lot of success with the choices they made.
In any environment promoting DevOps practices, the ops component can introduce a lot of complexity, especially in multi-tenant Kubernetes, which has a steep learning curve. My former organization saw this first hand as we pushed for a world in which the ops of DevOps truly became the realm of the dev teams. From Helm Charts to Terraform, we advocated for as much of a declarative posture as possible but it could quickly overwhelm developers.
The fact of the matter is: devs want to dev! DevOps pours ops responsibility on to developers who love their apps but don’t want to live in their operations. You have to be careful and listen to your teams. Especially when the business is prioritizing features and expects ops to just work. Good ops may not win you customers, but bad ops will drive them away. This is the same question the Salesforce team had to wrestle with — how much ops do we believe our teams should own and how much should we as the platform team manage for them?
While every organization will fall somewhere differently along that spectrum, all of them could benefit from the platform we’re building at Transposit. Even with a massive amount of abstraction, dev teams will still experience incidents and even with a separate, dedicated ops team, they’ll still get paged. So how can Transposit help?
We simplify rote ops work through repeatable, automated runbooks. We connect runbooks to webhook triggers to set the table for investigating alerts and managing incidents. We help dev teams codify the collective knowledge of the team by backing runbooks and actions in git. While this is by no means the limits of our capabilities, just these three examples give a DevOps team a platform on which to build a wide range of process optimization and management that would be spread across a dozen tools and probably as many team member’s heads.
Think of all the time you could reclaim for releasing features! I’m certainly biased, but when I was managing our Kubernetes platform, I would have loved to have included Transposit as one of our offerings to help pull together our many systems and make ops easier. If any of this sounds interesting to you and your team, you can join our Early VIP Access waitlist.