An Explosion of Complexity From On-Prem to the Cloud
"Automate Everything" is a popular mantra in modern DevOps and SRE, yet human judgment is still a key factor in every stack's management, from daily operations to incidents.
This week, instigator and co-author of the Google SRE books, and Azure SRE lead, Niall Murphy, and our Transposit co-founder and CTO, Tina Huang, met up via Zoom from California and Ireland for a fireside chat about how to make the most of automation in today’s complex engineering environments across DevOps, SRE, and beyond. They touched upon a number of interesting topics, from how the movement from on-prem to multi-cloud has changed the role of automation in a stack, to the constraints and considerations around using machine learning to replace human judgment.
In this series, we’ll share our favorite highlights of what Niall and Tina had to say, or you can watch the whole conversation directly:
Stephen Trusheim (Moderator): Let’s all get the same mental model about history here. Even over the last five or ten years, we’ve seen major changes to the way applications are deployed. What do you think are the most impactful changes to tooling and the applications stack, and how do those impact the organization?
Niall Murphy: It’s a huge question. Depending on how far back you’re going or how much of a 50ft view you want to take, my one sentence summary is: no one cares about the single machine case anymore. Single machine is dead and has been for a long while.
"No one cares about the single machine case anymore. Single machine is dead and has been for a long while."#
What we experience now is that your call stack is externalized across a large, essentially arbitrary number of machines. Your application, even if it is what they used to call embarrassingly parallel back in the old days, which means it can be linearly split and composed, even if you are in that situation, there is still some kind of cross talk or reason for the machines to intercommunicate, in particular for debugging, state inspection, etc. Observability, distributed tracing, and all of these phenomena which really only exist because of the multiple machine case and cloud environments - all of those are essentially mandatory for doing anything significant these days…
Tina Huang: I have two different pieces I’d add here. One is that the move to the public cloud has shifted a lot of things. What’s interesting about it to me is that it’s not just the fact that we have cloud-managed VMs instead of VMs on our own hosted machines in infrastructure, but what the cloud has really meant (and this is what Google really paved the path towards) is increased efficiency at the cost of greater complexity. One of the ways I often think about this is the example of load balancers. Load balancers used to be hardware load balancers, and the logic wasn’t something that you’d ever touch or have to deal with. You’d restart the load balancer if it was a problem. And we shifted to a model where we now have AWS ELBS (Elastic Load Balancers) and ALBS (Application Load Balancers) and that application load balancing allows us to do much more sophisticated logic in terms of how we balance the load across our servers. But that means that if you’re that ops person debugging why that load balancer is kicking off a certain machine, then there’s a lot more complexity there.
"These days with the teetering stacks of software that we’re all dealing with, the abstractions have gotten leakier and leakier as we’ve moved up to the multi-machine case."#
But the other piece is the increase of SaaS tooling, and a lot of this is because of what Niall just talked about. Because we have multiple machines, we have this whole ecosystem of SaaS observability that sits on top of that that we need to use to manage. In general with SaaS applications what we have are highly specialized applications which are great, but any time we need to transfer data between different Saas applications, that’s almost always a manual process. The easiest example is: I go to my observability stack. I go to Data Dog or something and grab the graph, and I need to share with my team, in Slack, copy it into Jira, and all of these things are highly manual now. That’s one of the ways we have this opening for automation in the operations use case.
Stephen: You’ve spent a lot of time in that movement from on-prem into the cloud, and then into multi-cloud and multi-machine environments. How have you seen that change the need for automation or the adoption of automation as SREs are trying to put a lid on this complexity and keep it where they don’t have to pay attention to it anymore?
Niall: A lot of this is sectoral actually, to the extent that in the very beginning of the multi-machine transition folks from the Microsoft background would be accustomed to automation coming with a product or you buy a product to give you automation capabilities of various kinds like system management server, etc. Now if you're of a Unix background or if you're experienced with the development of ISP style services growing from the mid-90s up to what cloud is today, I think you probably have a very different perspective on automation which is more continuous, but I think multi-machine automation comes into Windows with Powershell and with cloud APIs.
"Today so much of software development is about the question of what you can successfully ignore."#
Putting that to one side for a moment, and concentrating on a unitary community view of this, today so much of software development is about the question of what you can successfully ignore. So you can think of object oriented programming and so on and information hiding as being questions about as a software engineer writing an application, I can successfully ignore the fact that this object encapsulates some complexity for me. I can poke it to do stuff and read from it to see what it did and so on, but actually these days with the teetering stacks of software that we’re all dealing with, the abstractions have gotten leakier and leakier as we’ve moved up to the multi-machine case, and we’re attempting to treat that with an even leakier abstraction at an even higher level, so the question of what we can successfully ignore as software engineers has become more nuanced. Particularly for folks in a multi-cloud configuration where aside from their VMs launching and unlaunching, etc there’s actually quite a significant degree of variation around what the products offer across the cloud providers. You might go here for data analytics and here for storage and here for VMs, etc. and so I feel today software engineers have probably genetically speaking about the same amount of horsepower that they had ten years ago to wrestle with complexity, but the increased levels of complexity and the increased leakiness of the abstractions have impaired productivity in one sense on the debugging and system model sides and have increased productivity in terms of having a much massiver fleet that you can, in theory, move the levers on.
Tina: I think what Niall hit on - the leakiness of abstractions - is at the center of everything. We’re talking about lots of different leakiness, from the developer and engineers and the product side into the operations of that stack where that model shifts the abstraction between what it means to run a support org and SRE. If you imagine the old world, what are these fundamental shifts that have happened in the last five or ten years? In the old world, what we had were these long bake periods. When I started my career at Apple we were literally shipping boxed software. And so we had these gold master dates and a month before that there was very strict change control, feature freezes, code freezes, etc and part of that was because we knew it would be months before we could even patch the code. And so in that world the support org is really there to help the user work around whatever problem is there or figure out their user error, and even when you move this to the web world in the early days before agile, you also had these long manual QA bake periods.
"Incidents are timely and critical in a way that they weren’t before."#
Even in my early days at Google, we would have two week long manual testing that happened because the idea of hot swapping some fix in-line was hard back then. So, as a result, operations was also operating on their own, and the expectation wasn’t that the engineers had to be involved to figure out the code change and code deploy. So we’ve just muddled the whole world. Now, a customer support ticket comes in and an engineer is actively paged at that time. It’s possibly a piece that’s an infrastructure problem but it’s just as likely it’s a code problem that you have to fix, so all three of these organizations have to work really closely together. It also means that incidents are timely and critical in a way that they weren’t before. So the expectation is now that when an incident happens you’d better fix it within minutes or hours not weeks or months, and so all of this pressure of having all of these people across the organization with different skillsets and also having a stronger urgency and timeliness for fixes – all of that muddles together to mean that you need more automation.
Next time, Tina and Niall will discuss the implications of operations as code, cross-functional collaboration in a modern engineering org, and how teams can overcome the challenges that are keeping them from achieving the goal of maximum automation in their stacks.