An exploration of incidents that changed how we operate since Y2K
This is the second post in the series on IT Operations and DevOps since Y2K. Check out the first and third part of the series.
Initially, when I started writing this series on Y2K, we hadn’t crossed over into 2020 yet. I wondered, would any Y2020 bugs pop up? Some thought we would have learned our lesson from Y2K, while others had a little less faith. As seen from Twitter, there were a few of these 2020 bugs lurking out there. The reality is, our software is full of ticking time bombs.
As Andy Piper, who was a developer on the then-middleware infrastructure team at the UK Post Office during Y2K, says, “it’s the nature of software and code and technology AND HUMANS.” It is “inevitably an issue.” I talked to various people who were sysadmins and developers at the time of Y2K, and I asked them:
What events have happened or might happen that reminded you of Y2K?
Matt Stratton, DevOps Advocate at PagerDuty, who was working as a sysadmin at Heller Financial during Y2K, said it best, “I think we’ve probably had several Y2Ks and we just didn’t think about them in that way, but pretty massive underlying bugs based around the way we used to think about a thing.” Throughout my conversations, two events kept coming up: left-pad and Heartbleed.
One of the significant differences between 1999 and today in how our software operates is the use of open source software and shared libraries. Much more of our software today is being built on top of libraries that we don’t have total control over but need to deliver products to users quickly. This has brought about the use of centralized package repositories for different languages and frameworks, like npm, PyPi, RubyGems, etc.
If you are not familiar with the left-pad incident, the short version is: A Node.js package that many libraries depend on, left-pad, was unpublished, and it caused cascading failures in a vast number of depending projects. The security risks of using a library from centralized package repositories are already well covered, so I am not going to hash that out. The bigger, more important message is as Matt McLarty, Global Leader of API Strategy at Mulesoft, who was working as a developer at a bank in Toronto, Canada during Y2K, says, “[we’re] constantly near these situations. But the only time you ever become super aware is when something went bad” and “all of a sudden, people started becoming more concerned about centralized package repositories.”
At the time, many people were dependent on left-pad and had no idea. If someone was running continuous integration (CI) at the time of the left-pad incident, they might have seen that many builds that relied on the left-pad package were suddenly failing. This is an excellent example of how interconnected our software is, which was discussed in the first part of this series.
Speed to market is often cited as one of the benefits of practicing the DevOps philosophy. Therefore, while centralized package repositories with shared libraries allow us to deliver products faster than before, it is crucial for developers practicing DevOps to be prepared to handle the tradeoffs that come with this and understand the potential for cascading failures or vulnerabilities that may occur.
Questions to think about:
And then came Heartbleed, I think the Heartbleed website (yes, it got its own website) sums it up best: “The Heartbleed Bug is a serious vulnerability in the popular OpenSSL cryptographic software library. This weakness allows stealing the information protected, under normal conditions, by the SSL/TLS encryption used to secure the Internet. SSL/TLS provides communication security and privacy over the Internet for applications such as web, email, instant messaging (IM) and some virtual private networks (VPNs).” It was later revealed that there were no unit tests for the change that created the vulnerability, and the project was severely understaffed, which is again an adjacent and well-covered issue in open source software that I won’t cover here.
I want to focus more on the response to Heartbleed from sysadmins and operations folks across the internet. It quickly became an all hands on deck situation, similar to Y2K, but on a different timespan. One of my favorite takeaways from this blog post by Stephen Kuenzli was that “Heartbleed did bang into my head that every team needs to be able to go fast, safely with the capability to release at-will, possibly Right Now.” Stephen also mentions that it is part of his “motivation to help teams adopt continuous delivery and standardized application delivery systems like containers.” I also really enjoy this Twitter thread from Colm MacCárthaigh on AWS’ response to Heartbleed. AWS was able to update millions of customer and internal network connection endpoints in hours, all without any significant customer impact all because of their deployment pipeline.
In Programmability in the Network: Stop a Bleeding Heart, Lori MacVittie, Principal Technical Evangelist at F5 Networks, points out that “by enabling a view of “infrastructure as code,” programmability not only enables a more software-oriented approach to deploying services, but also a software-oriented approach to quickly creating and deploying an appropriate response to emerging threats. A DevOps approach can facilitate this capability, and push protection into the network as code.”
For many teams, more modern operations practices, such as continuous integration and deployment (CI/CD) pipelines, made Heartbleed a faster fix. CI/CD emphasizes automation, which can help set up for future DevOps success. It moves IT away from batch deployments to a culture where releases can happen at-will. This can help teams reduce the number of errors when you rely on manual testing and process. Also, it wasn’t as hard to track down all of the places the Heartbleed vulnerability existed with the help of automation that did not exist in 1999.
I would be mistaken not to mention that both of these incidents happened in the open. We were able to have an open debate about bugs, incidents, and vulnerabilities in commonly used libraries that would not be possible in closed source software. (Related: Jessie Frazelle has a great talk on why open source firmware is important.) Matt Stratton speculates that we’ll continue to surface decades-old bugs in core libraries, and that “the good news is, the other reason why maybe we’re not seeing it as impactful is because it’s so much easier to share information and get fixes out there faster.”
Questions to think about:
As I said before, there’s a lot of ticking time bombs out there. One that has gotten a bit more attention is the Y2038 bug. As far as potential global computer system failures go, Y2K was pretty easy to explain to a general audience. Since we handled it mostly with grace, it is also easy to forget. I believe these are two reasons why time-related bugs like the Y2038 bug aren’t in our distant past. The Y2038 bug is that the latest time that can be represented in Unix’s signed 32-bit integer time format will happen on January 19, 2038, and times beyond this will cause a signed integer overflow. Some people call it Unix’s Y2K. Explaining Y2038 to someone outside of technology can be a bit tricky. (Pun intended.)
I’ll spare you the ugly details, but if you are living in a 64-bit world, using a 64-bit kernel, and running 64-bit applications (which will overflow in around 292 billion years), you cannot escape the fact that many legacy data formats, network protocols, and file systems still specify 32-bit time fields. It will likely require fixes that aren’t only on your side of the code.
It’s a bit hard to track down how widespread the issue is. With the number of embedded systems out there compared to Y2K, there are a significant number of systems vulnerable. Luckily not all of these rely on datetimes. The more significant problem is that often these systems are harder to update because they are designed to last the lifetime of the machine. In some cases, we assumed these machines would be out of commission by 2038, but this might not be entirely true.
When I brought it up, people often joked that they would be retired then. Even though I know it is said in jest, it feels like the common technology pattern of passing the buck off as someone else’s problem.
You would hope that we would have learned their lesson the first time, but the Y2038 bug is already here too. People have already seen related bugs in the last decade. For example: Changing the datetime on an Android ZTE Blade phone and essentially bricking it. Or a PowerPC Mac running OS X 10.4.11 Tiger won’t even let you set a datetime beyond December 31, 2037. (Rest assured: The Y2038 bug has been fixed in Mac OS since OS X 10.6 Snow Leopard.)
Who knows what IT operations look like in 2038? (That’s if the robots don’t take over.) There will be more left-pads, Heartbleeds, and other bugs of large proportions that affect huge swaths of the internet, and their extent will often be unknown until it happens. It will be an unknown unknown. As Matt McLarty says, “there has been a big mind shift away from, we have to be afraid of change, we can’t make any changes, to, we know things are going to fail all the time, so how do we deal with that failure?”
Building systems that practice resilience engineering, performing chaos engineering experiments, and establishing a collaborative culture of DevOps are great places to start. In my final post of the series, I’ll explore how these ideas prepare us for change in our systems.