Artificial Chaos

Hey Look Ma, I Broke It! [Chaos Engineering]

August 16, 2021 Season 1 Episode 1
Artificial Chaos
Hey Look Ma, I Broke It! [Chaos Engineering]
Show Notes Transcript

In episode 1 of Artificial Chaos, Holly and Morgan introduce the concept and benefits of chaos engineering, and how to get started running your own experiments. Holly does some impressively dumb things with code, and Morgan adds some whimsy with weird analogies about boats.

Resources

Morgan:

Absolutely do not go into your place of work and destroy a bunch of servers and tell them that Holly and Morgan told you to do that. We accept no responsibility for any service outages that are caused as a result of you listening to this podcast.

Holly:

Next time you accidentally break a server, just tell your boss, it's Chaos Engineering! So how would you describe Chaos Engineering? If you had to give a quick definition?

Morgan:

Chaos Engineering is this concept that we experiment on production systems in order to build confidence in how those systems will perform under duress.

Holly:

Does it have to be production?

Morgan:

Not strictly speaking, but, um, unless you have pretty much an identical copy of your production system, it's never going to respond in exactly the same way as a production system would. So it doesn't really give you the same confidence.

Holly:

Yeah. So if your test environment doesn't match your production environment, then testing in test is not going to be a very good test. I know that it sounds dumb, but there's so many organizations that are in that situation. Um, you say experiments, so what are we, what are we talking about here? So earlier when we were talking, I made a distinction between tests, meaning for me, unit tests and integration tests. And then you're saying chaos experiments. So why, why are we using a different word here? What does experiments mean?

Morgan:

Um, I think experiments are more open-ended rather than considering it as a specific test, the parameters won't always be the same. And you also don't know exactly how a system is going to respond. You can't really hypothesize well in advance for these. So there's a concept of blast radius and limiting the potential impact that your experiment will have and start off with something small.

Holly:

So I wrote a blog post a little while ago, where I was talking about just kind of introducing this idea of Chaos Engineering, to people who hadn't talked about it before. Who hadn't maybe seen that term before. And for me in that context, I was being a little bit facetious, but one of the things that I said is like, Hey, I turn systems off to see what happens. And for me that is pretty much Chaos Engineering. Not that it is limited to just turning things off, but like that idea of let's disrupt the system and then, and then see how it reacts. So when I was doing those experiments, one of the things that I was looking at it was really just answering the question of, we have a system that we know when an instance goes down a new instance will be spun up and the system will handle that disruption. But what we didn't know was how long would that take, so one of the ways that we tested that was, Hey, let's turn a web server off and just time it, how long does that instance take to come back? And then we're also looking at things like, Hey, when we turn an instance off, how visible is that to the user? I mean, ideally not at all, right, but you turn so many things off. Eventually something's going to be visible to the user, or maybe you've built the system in such a way that when you cause some disruption, something unexpected happens. But that was really how I started with Chaos Engineering was we've built a system. We think it's resilient, let's turn some stuff off and see not only how it breaks, but how long it takes to come back. Is that your experience with Chaos Engineering? Have you, have you, are you thinking about this in a different way?

Morgan:

Um, so I think you really need to define what you mean by resilience before you can begin to experiment in order to achieve resilience. And there are a bunch of different interpretations of resilience and different metrics that you can use to define it. So do you mean absorption or recoverability or, you know, mitigation of incident or event in the first place?

Holly:

So what do we mean by resilience? Then is going to be my next question, because for me resilience, like I would try and define that as something like a system's ability to continue functioning, even if components fail, for me, resilience means something very different to availability, but in my head they're somewhat linked. And whenever anybody says availability, I think cybersecurity, right? It's one of the pillars of cybersecurity: confidentiality, integrity, and availability. So I would think of denial of service attacks, but that is still linked to resilience. Right? If the system is very resilient, then having an availability impact would be difficult. So for me, resilience is how well can it handle some component level outages? But you've used some different terms though. So absorption could be, how well a system can handle a high traffic load. That could be one thing.

Morgan:

What about elasticity?

Holly:

Oh, elasticity's an easy one. So elasticity is, is a, is a big thing for me because our systems, um, scale up based on user load, right? But elasticity is not just scaling up. It's the ability to scale down when that, when that user load or reduces as well. So one of the things that we're trying to do with our systems is effectively when, when the system is quiet, run them as small as possible. And the reason for that is of course, with public cloud systems we're running on OPEX, right? So we're, we're effectively paying per usage with public cloud providers, there's lot of different ways of managing that you can use reserved instances and things, but keeping things simple. If users aren't using the system, we want to scale things down so that the cost is reduced. And elasticity is part of that, and for me, elasticity is really when you're talking about scaling down, as opposed to scaling up

Morgan:

A handy part of auto scaling in AWS is that you can scale based on instance number or size, which is really handy. Interestingly, the way that I learned about auto scaling and how handy that can be for business, I was actually studying for my Solutions Architect Associate exam at the time. And there was a, um, medical supply conference in America. And I think it was like a couple of days before the conference, that website was hit by a DDoS attack. AWS basically said, we can scale up your instances that are, like your provisioned instances. And we'll just absorb this as much as we possibly can and see what our systems can handle. And they scaled up and they didn't scale back down and just sort of absorbed the attack until it phased off. And then they kind of split the bill at the end of that. But I think that sort of covers scalability and absorption, which would be two metrics or resilience, or what I would personally define as resilience.

Holly:

I think most people who deploy auto scaling groups. So this idea that when the user load increases more instances, are, deployed people think about that in the context of elasticity where it's like hey, if we suddenly have a bunch of users come in, we'll scale up to handle that. And then when those users leave, w e'll scale down again, that isn't for me, the major benefit of elasticity, I think just because it's, i t's such a c ore part of it now that it's almost like I don't even think about it. That's just a part of w hat it is, what I like auto scaling groups that what they can do is monitoring the health of instances. So if an instance has a failure, for some reason, even if that failure is something really dumb, like, oh, that instance log filled the disk and it failed because i ts disk is full now. They auto scaling can detect that. That instance has an outage and then can tear it down and spin up another one and can handle that. So for me, one of the things is, is also not just the ability to handle user load, but it's to handle unexpected outages in, in that way. Now some people might be, like, listening to that and thinking that's, um, that's crazy because your systems should never get into that position, but that's the whole point of what we're talking about today, right? It's like, Hey, sometimes unexpected things happen. We should build systems in such a way that they can handle unexpected things. Um, especially from the context of the user, it's like, I never want the user to know that something failed. If I could possibly help it. It's like, can we, you know, aggressively handle it in some other means? So, yeah, it's, it's almost funny now that auto scaling groups to me, have almost got little to do with scaling anymore and just kind of become more to do, to do with resilience. Um, so chaos engineering, I think we've defined pretty well now. It's like we're experimenting on our systems. So we're going to introduce faults to see how our systems, um, handle those. I think for a lot of people who are listening to this, maybe they've still got an old school way of thinking about this. And when they think about outages, they might be thinking in the way of like disaster recovery, business continuity type scale. And they might be thinking about like, Hey, you have an outage and then you go and get your incident response plan out and you start going through the incident response plan. But what we're talking about is trying to build systems in a way that they're resilient by which I mean that outage is just handled automatically, right? It's like we don't need human intervention here. We want to build a system in such a way that it's just, handled.

Morgan:

I would agree, but there are other benefits to Chaos Engineering rather than just building self-healing infrastructure. So you can architect systems to ensure graceful degradation of services for an example, LinkedIn's project LinkedOut, which is their application infrastructure, Chaos Engineering project actually introduces this concept of graceful degradation. And it identifies like core workflows and processes on a particular page that will be accessed by a user. And if there is some component failure or an error on the backend, it will prioritize those core workflows and processes over say running ads, loading, kind of, third party content. But again, you would need to define what your core workflows were, before you could really put that into practice.

Holly:

I think that it's like, you need to define these things. It's such a big part. I was working with a company recently who I was talking through their, really, their disaster recovery process, but like, Hey, what would happen if you had an outage and how would you recover from that? And one of the things that was very interesting to me with that with that company is they hadn't even defined which systems were critical. Like they had no priority for, for systems and they were absolutely adamant that they could recover from an outage because they kept pointing at backups. And they're like, Hey, if this server disappears here at the backups for that server, and what I was trying to point out was you might not have an outage of that level. It might not be hey that one server explodes. It could be like, Hey, the entire network is down. How are you going to recover now? And I'm sure that they would have been able to recover because they have the backups, like the technical process of doing that would have been okay. But what I was worrying about was like that priority that you mentioned there just like, Hey, you want to bring up the important systems first and, and they didn't have that. So I guess, can we look a little bit about some of the definitions we have around first disaster recovery? And then maybe build up from that. So I've talked about system criticality there, but there are the metrics as well. Right? So there's things like RTO and RPO, should we bring those in?

Morgan:

Yeah. So RTO, recovery time objective and RPO recovery point objective. Broadly speaking, are defined as the maximum tolerable downtime that your systems can, can, can deal with. And the point in time that you can withstand as a business from a data loss perspective.

Holly:

Yeah. That's exactly how I think of it. It's like, RTO is, how quickly do you want the systems to become available again? But RPO is how much data loss are you willing to accept? So I almost think of it in terms of like, so RTO would be counting forward from now, how long are you happy for, for these systems to be down for? Whereas RPO would be how far back in time to want to go. And I think a lot of, a lot of companies have maybe not thought about these things. And one of the things that should be, they should be like the absolute maximum that the company can accept. And then they should also be in my opinion, like a desirable value. Right? So one of the things that, that I've been playing around with recently in terms of experiments is not only can, can the system recover from an instance failure, for example, but how long does that take? One of the reasons there just being, knowing exactly how much resilience we have in that system, even instance goes down, how long should that tank that's useful for knowing if something else has gone wrong? You know, Hey, this instance usually rebuilt itself in like one minute forty-five or something. It's been three minutes. Maybe there's something else that's going wrong. But I think from a business point of view as well, it just, it makes you really think of that criticality side of things. If you're saying, Hey, all of these systems are going to go down. Maybe you have a different RTO depending on how important that system is to the company. When, when I was talking to that company, I mentioned a second ago, one of the things that they were prioritizing based on was whether it was publicly facing or not. So the thing that they wanted to bring up as quickly as possible was their email systems. So that like, Hey, if somebody emails us, we don't want to miss that. Whereas internal systems that were only used by their staff, even if they're important, they're like, oh, we're going to have a lower criticality for that because nobody will notice it's like, they will feel the pain, as opposed to external people like customers feeling the pain.

Morgan:

Yeah, absolutely. I think it takes quite a lot of maturity in terms of risk management for an organization to have defined firstly, what, you know, assets they've got, because there are so many organizations that don't even know what's on their estate and then to have defined critical services and assets and processes that are running, whether you decide to prioritize things that are client or customer facing or internal kind of finance processes or payroll systems and things like that, it's going to be individual to each organization, but then experimenting beyond your business continuity and disaster recovery strategies is sort of a level above really when it comes to maturity of risk management and operational resilience management as well.

Holly:

It's so funny to hear you say that cause cause it's absolutely true. And I know it's true, but I almost don't think about it anymore from our systems where you say there are many companies out there that don't know what, what assets they've got, they don't have any kind of asset management system. We can pull this up into other topics as well, story for another time. But one of the things that I always like to see from customers is can you tie your asset register to your vulnerability management platform? So not only do you know what assets you have deployed, but can you tie vulnerabilities to those assets and then can also, can you tie dependencies to those assets? So it's like, Hey, this service is really important to us because it stores all of our data, For example. A lot of companies don't think about, yeah, but if this other system is down, if the authentication system is down, for example, we can't log into that system. So it's like, yes, the asset is important, but also it's dependency is important. And I think that's really where we start getting back into Chaos Engineering is like building, as you said, that maturity. Where it's like, not only do we know what assets we have, we know what's supposed to be communicating with other things, what their dependencies are, but now we're going to test that in a really interesting way. So I guess we should talk a little bit about, about testing because I think when you use that, that term test, instead of experiment, as we were using earlier it's like test means something really different to different people. Right? So I do a lot of software development. So when somebody says test to me, I immediately think like unit tests and integration tests. And it's like, is this piece of code in isolation working the way, that I expect it to is this piece of code in-situ working the way I that expect it to, but that is very much something that I have defined rigorously. Right. So for the unit test, it might be something like when I type a username can the system accurately tell me whether that username is taken or not. Whereas what we're talking about here with Chaos Engineering is more experiments. So it's like the unexpected side of Things.

Morgan:

Yeah. So if I switch this server off, what's going to happen. Like you don't have a predefined outcome, like a, a binary result that you can rely on to kind of prove that your supposition was correct. Um, yeah. It's much more open-ended. Um, which is why I suppose chaos experiments as more accurate and more relevant than chaos testing.

Holly:

Yes. Because like, you should still, you should still have a hypothesis of what's going to happen and you're testing that hypothesis, but I guess we're going into it knowing that what we're doing could break something unexpected and we should be prepared for that. Right?

Morgan:

Yeah, absolutely. Um, but then also you've got the, I guess, consideration that with cloud infrastructure part of the benefit of kind of the scalability and how quickly you can deploy new resources and things is that you can kind of just spin things up without maybe much knowledge or understanding of that cloud providers, catalog of services and things. So AWS have this framework, the well-architected framework, which will broadly speaking, teach you how to deploy, an, application that is, uh, like resilient. So, um,

Holly:

Yeah. Yeah. It was just as listening to you, try and say that. And it's one of those things where you don't want to use one of the words that they've used as, as a pillar. It's like, you don't want to say the well-architected framework allows you to build a system that will preform well, because performance is one of those pillars. It helps you build a system that is well architected. Is one way of putting it another way is it helps you build a system that is good? Broadly good.

Morgan:

Cost-effective?

Holly:

So the well-architected framework has five pillars that help you, I guess, from my point of view, it's just like, make sure you've covered everything, right. So how, you know, historically we might have talked about people, um, building systems that are functional, but maybe not secure that's one weakness. The well-architected framework tries to help you mitigate that, but there's also building a system that's maybe cost ineffective. The well-architected framework tries to help you mitigate that. So the way that I think of the architecture framework is just like almost like safety rails of just like you're building a thing and it's gonna set out a way that you can build it and consider the major aspects.

Morgan:

Yeah, no, I would agree with that. And I think as long as you've kind of followed that you do have like the base capability there to be able to perform some of these experiments on your infrastructure, it doesn't really take a great deal. So I said before that it takes maturity and your risk management approach. But I think for a lot of startup companies and like new tech companies that are using cloud infrastructure that maybe don't have kind of 30 or 40 years of in-house like risk management experience, you can still get involved with this and it's still something that can benefit your organization.

Holly:

I was just cheating there and quietly searching for:"How does AWS actually describe the AWS well-architected framework?" And by look of things they cheat as well. And just point at the five pillars anyway, like AWS, well architected helps cloud architects build secure, high-performing, resilient, and efficient infrastructure. It's like, pretty much just listing the pillars there aren't you missing of course, operational excellence, which I always think sounds great. It's like helps you build excellent things.

Morgan:

So how do tiny companies, um, tech native, cloud native companies do Chaos Engineering Holly?

Holly:

Wow. So many things to break down from that one sentence, two things that I want to talk about tiny companies. Let's, let's dig into that in a second, but also you used the term cloud native. And this is something that I'm very, very passionate about. I'm very passionate about this concept of cloud native, because I think a lot of organizations at the moment are moving to the cloud, but they're doing that in. I don't want to be rude, but like maybe a naive approach to going to the cloud. For example, maybe they had systems that were on prem where maybe they, overtime, had gone from having physical servers to virtual servers, and then they had lifted and shifted those virtual servers into the cloud. So now they're just running effectively VMs on somebody else's hardware. So, you know, a virtual machines for Azure or EC2 for AWS. I don't really in my head, and this is going to sound a little bit surprising to some people, I don't really consider those cloud systems and people might point at them and go, but they're hosted in the public cloud it's like, it's not a cloud system though.

Morgan:

They're legacy systems that you've basically containerized in an around about fashion. Um, I, I think about it like this this as well, so I sort of think about moving to the cloud or deploying infrastructure in the cloud as getting on a boat. That was my English accent, but that, that accent...

Holly:

Pause to appreciate the word"boat" there for a second.

Morgan:

Boat.

Holly:

Also. Sorry, what?

Morgan:

Yeah. So if you think about, um, there's like a little fishing boat or something like a rowing boat at the end of a dock and somebody who is set up to appropriately use cloud infrastructure or to begin deploying infrastructure and assets in the cloud, will just get in the boat and kind of row away. But there are lots of sort of older companies who are really kind of sentimentally attached to all of their on-prem crap and really don't want to get on the boat. So they've got one foot on the dock and one foot on the boat and they've like un-moored it and it's starting to drift away a little bit and they're going to fall in the water.

Holly:

I love that analogy. Allow me to try and make it technically coherent.

Morgan:

Please do that.

Holly:

So the way that I think about it is the first step to moving to the cloud is hosting in the cloud, right? So that would be absolutely virtual servers. Hosted on somebody else's infrastructure may be looking at things like IaaS, so infrastructure as a service, but what I think of when somebody says a cloud system is really the extreme right-hand side of that. So function as a service. So as you have functions or AWS Lambda at that kind of thing, where we're looking at technologies like serverless and there's a whole, there's a whole pathway that right from moving from virtualized systems to moving to microservices, functioning as a service serverless services. Absolutely. But that's what I think of as, as cloud services. And I think the, way to kind of summarize that if somebody is trying to like follow along with what we're saying here is, has it been moved to the cloud or was it built for the cloud? That that's the real distinction to me. Yeah, absolutely. You mentioned tiny companies though. So I guess there's two sides to this. Like I'll rant about tiny companies in a second because I love startups, but in the context of Chaos Engineering, I guess the implicit question there is, is chaos engineering, something that you build up to through scale because we have through this discussion repeatedly used the term maturity is that an organization gets to certain age, a certain size, a certain number of employees before Chaos Engineering becomes important. Or is it something that you can do from day one?

Morgan:

No, I absolutely think you can do it from day one. So I think if you have cloud native infrastructure, you are at an advantage because your systems, dependencies, interactions, processes are all going to be mapped in a more predictable fashion. You you're likely going to have a better understanding of your estate and how it functions. Whereas if you have sort of some on-prem legacy infrastructure, and then you've got some kind of cloud deployed infrastructure as well it is much more difficult to see or to predict how it's going to react. So I think a lot of kind of newer companies that maybe don't have that maturity already are actually in a better position to kind of run these sorts of experiments.

Holly:

The, the way that it works is we all know how virtual servers work, right? Get a computer, you cut it up into pieces. That's a virtual server. On the other side of things, we have serverless- that's magic, right?

Morgan:

It's just somebody else's hypervisor.

Holly:

Say that it's just like, it's just, I send code out into the cloud. Some thing somewhere runs it. And then it gives me a response that's it.

Morgan:

It's not really serverless. It's just not your server.

Holly:

Somebody else's server. So like, like, we're going to have to come back to serverless in the context of supplier security questionnaires, because I've recently had some, some significant pain with in particular auditors who don't understand some of the weird stuff that we're doing with infrastructure. I'd like to talk about that another day, because I want to talk about the cool stuff we're doing and the infrastructure, because that's awesome. But also just some of the questions we get asked from a security point of view, are pretty crazy, but small companies. I think the thing here is exactly like we had a second ago with what does cloud system mean? It can mean a lot of different things. Small companies can mean a lot of different things, a sole trader is a small company, a micro company where they're just small by nature, you know, they're doing something, they have no intention of scaling. Maybe it's just a few people you could think of something like a solicitors or an accountants where they're just a company that is small and intends on staying small. And then you could also think about things like startups. We'll have to have a rant about startups at some point, because I think startup for a lot of people means any new company. I hear the term startup applied in so many different ways where somebody would say like, oh yeah, I'm working for this startup. And then there'll be like 10 years old. And it's like, is that really a startup. It's like, I'm working for this startup, they're on series H. And it's like, is tha-, they're on series F. Is that really a startup? Um, the reason I did that is, I don't want the internet to know how I pronounce the word H.

Morgan:

It can't be worse than how I pronounce boat.

Holly:

Uh, you know, they're on series F, is that a startup? Or you might be talking to somebody about like, um, oh, uh, Unicons right. A startup that's valued over a billion dollars. So it's like, is that a startup? So for me, the word startup is so poorly applied because it somehow applies to everything from a micro company that is never going to scale to a billion dollar company that's 10 years old. It doesn't make sense. So we'll have to, at some point, have an episode about startups and have that rant, but bringing us back to Chaos Engineering. Yeah, my experience, you know, um, I run a startup, our systems are built for the cloud. We run experiments all the time to test how, how things work. And in fact, I've, I've had some conversations with you recently in terms of just like, have we built this thing? Well, is there anything that you can think of that we've missed? And like I said earlier in the show, my experience with that was really just turning things off and then seeing what happens.

Morgan:

I think an important thing to note though at this point is that doesn't work if you don't have monitoring in place.

Holly:

That was the pause. What I was trying to prepare for was somebody like accidentally rm-rf-ing, a server in production, and then just like emailing their boss. Oh, we're doing chaos engineering today. Has anybody told you? Server on fire.

Morgan:

Yeah. This is not a get out of jail free card. You can't just break things and tell them it's Chaos Engineering.

Holly:

Remember we had a call the other day and I blew a server up. Um, do remember I destroyed one of the databases. So it was like, oops, database is gone. It's like, I'm just going to wait for 2 minutes.

Morgan:

You were surprisingly chill about that as well.

Holly:

This is Chaos Engineering though! Like I was surprisingly chill about that because like I've had that kind of failure before we have tested for what happens if a database has a major outage. And I know that as a first step auto-scaling with takeover. So the way that that would work in that context is if I've done something bad to the database, stop, the database service, auto-scaling will pick up on that because it's, um, health metrics will, will go into a failure state and then it will kill a database and spin it up from a known good copy. So yeah, it is. It is that weird thing, I guess, I guess that's the whole journey that we're talking about here with, with Chaos Engineering, isn't it, it's like you start with some testing where it's like almost the beginning of the testing journey is, is this functioning it's like, do I have some way of knowing that this is working? So for example, if we push some new code to production, is that code just working? Have we broken anything? And then you go from that to like user acceptance testing and those kinds of things. It's like, not only is it functioning, but is it functioning in the way that the user is expecting? And then as you build up in this chaos journey, you do get to the point where you're like, oh, that thing blew up, but it's fine because we've tested it so many times. And we know that the system is just going to recover from there.

Morgan:

The really cool thing about the example that you just gave about the database is that in a typical or traditional company, maybe, um, the business continuity plan that recovery time or point objective might be several hours for a critical service. And it might be that they can tolerate up to 24 hours worth of data loss as their RPO, but the health metrics monitoring for auto-scaling on your database would pick that up really quickly, which brings your RPO back down to a few minutes, just as, as long as it takes to pick up on that and then deploy a new database. So it really does change business continuity completely.

Holly:

There's also just like this ability to observe things happening. So even if you don't necessarily change the business's risk appetite in terms of where your RPO is, just knowing that it's happening is, is better. Right. I guess we can talk a little bit about AWS game days in a second. Um, but just like, it makes you feel more comfortable if you know what's to happen. And if you know what the state, of the system is, I guess again, there's a different maturity journey here in terms of like organizations on one side of the spectrum, not even knowing what assets they've got, let alone, if those assets are working correctly and then building up from that, it's like, we know what everything is. We know where everything is. And then building up from that it's like, or you accidentally blow up a database. And then you hope that your auto scaling groups are working. Um, and then there's building up from that, and it's like, Hey, if there is, uh, an outage, it's fine. But also like we, we know it's happened and we've picked up on that pretty quickly. I mean, I'd love to know how many organizations public facing website could just go down and then how long it would take for them to even notice that it's gone down.

Morgan:

I bet there's something on the internet that tracks that kind of thing. There's gotta be some stats for that.

Holly:

There is. There's some companies that, that, um, track that stuff where that is a service that they offer where they'll track systems. Also going down, it depends on what you mean by that. And that's the whole thing with, with Chaos Engineering, right? It's like, we're not just talking about a system outage, but it could be a component level failure. So Hey, your public facing website is up, but nobody can log in like that is, you know, that's an issue.

Morgan:

Yeah.

Holly:

People come to this podcast to hear that's an issue. Your website's not working. I'll tell you something, something else, which has been on my mind because we're talking about monitoring something that I've been playing around with. It's not, it's not directly chaos engineering, but it's just another thing to think of when you're thinking through different kinds of testing from unit testing, to integration testing to Chaos Engineer is, if somebody made a failure, that would be difficult in code to detect, but you'd really want to know about it. I can give you an example of this. What if somebody changed the color of the login button to match the background? So if your test code comes in and checks, you know, is there a login button? Yes. Does the login form work? Yes. Can you type in the input box? Yes, but users can't see the button so nobody can log in. It's like, how would your organization detect that. This is brought to you from a recent code change that I made.

Morgan:

Did you do that?

Holly:

Uh, no, I didn't. That's actually a textbook example, but one of the things that I actually did do was, um, I'm not going to go to the long story of how this occurred. And some people no doubt are going to contact. And be like, how on Earth did you do this? But what I did was I changed the z-index of the input fields so that they displayed correctly, but a user couldn't click them, because they were effectively behind the forward div. So all of our system tests passed because the page loaded correctly, the server responded quickly, the login form was loaded. And you could navigate the page with the browser. If you were for example, to use the tab key, to select inputs that would select them, but you couldn't click them. So the automated testing that we have, we use effectively a headless web browser. So we drive through selenium. Uh, so we can actually test, like, not only does this page load, but this page loads in a browser that user are using and all these example, user activities would work correctly. Didn't, didn't work for that one though. So yeah, testing, like you can get really, really far into testing if people are interested in, in how we detect that- logins per second. We have a metric that is logins per second. And when that falls off a cliff, you know, something b roken, in your login box, it doesn't really matter what it is. Nobody's logging in anymore. Something has gone wrong and you might want to investigate that. So I guess that brings us back to metrics around Chaos Engineering. And like, we've talked about RTO and RPO, but I p resumed there is o ther metrics that might be useful from an e xperiment's point of view.

Morgan:

Yeah, there really are. But the point that you just made I think is, is kind of more valid. Like it, it links back to another kind of value add that you can get out of Chaos Engineering is that it identifies problems with your monitoring. So you can drive maturity in that space and improve thresholds, metrics, what your tolerances are and just your kind of fine tuning of those things, which makes it easier to detect incidents and outages and failure in future.

Holly:

So you. So we keep saying monitoring and I gave an example of how we monitor our system. But I guess in the context of just like broader cloud systems, what do you mean by monitoring? Do we have like a constant ping running? Is this just like the server exists. It's all probably fine or is there a little bit more to monitoring? How do you know your systems working?

Morgan:

[Obscure English Literature reference].

Holly:

You have to explain that reference to somebody who doesn't watch TV.

Morgan:

That's Shakespeare! That's the Merchant of Venice. I don't watch TV. They didn't either.

Holly:

Literature degrees. Honestly, that's something that's going to come up on the show at some point is that I have a real degree- Information Security.

Morgan:

I have a Mickey Mouse English degree.

Holly:

What's, what's the full title of your degree?

Morgan:

English literature with French.

Holly:

That's not even an English degree!

Morgan:

That's English!

Holly:

It's got French in the name!

Morgan:

But it was a minor. It was only a little bit of French.

Holly:

Do you know what my minor was?

Morgan:

No.

Holly:

Privacy.

Morgan:

Oh.

Holly:

That's the correct way to pronounce that word for all the Americans whose brains just exploded- privacy. Uh, yeah. Information security with privacy. Oh man. Could you imagine doing like Information Security with French? That would be...

Morgan:

I think that'd be really fun actually, because then you could actually understand that French security conference that you go to every year.

Holly:

Yeah. Hack in Paris, or Nuit du Hack. Nuit du Hack. There's no H in French. I would do much better in French than it would in English because I don't pronounce that letter very well. So yeah, I guess from a, from a monitoring point of view, I mentioned logins per second, being one of the things that we track, but really like there's a whole host of things that you should be tracking. Right. I mentioned earlier an instance, failing because it's disk got full, surely that should never happen right. You should, you should be monitoring those kinds of things of just of like, um, system health should be quite a broad thing. That's, that's monitored.

Morgan:

Yeah. Well, I think you need to, again, define what healthy looks like for your systems so that you can implement some baseline monitoring. Um, and then you can use these sorts of experiments to tune that.

Holly:

So we've talked about experiments, generally speaking, and I gave the example of occasionally I turn production servers off to see what happens. Um, I actually want to do a conference talk at some point where I like get on stage to talk about Chaos Engineering and then turn some production systems off as kind of like a mic drop, case in point, you absolutely know that's not going to go well just like...

Morgan:

The demo gods will not smile on you that day.

Holly:

Break, bring production down in the one-time where it's the hardest to fix it, because I'm literally on a stage. I just think it's one of those, like, you know, practice what you preach kind of thing. It's like, if I'm going to talk about Chaos Engineering, I should be able to demonstrate that I can blow systems up and the systems are going to recover.

Morgan:

So there are a bunch of other experiments that you can do, which I think is what you are getting at. So there are two projects, um, that I can think of that have notable experiments, um, of examples in this. Um, and the first would be Netflix who basically pioneered Chaos Engineering. Um, and they wrote a tool suite called Simian army. Can you say that Holly?

Holly:

Simian army?

Morgan:

Yep. That's right.

Holly:

What's been got at, there, is the fact that I was today years old, when I realized that the word is Simian and not Symbian, I don't know how that happened, but I presume- I was really into mobile devices when I was younger and Symbian is a mobile operating system. I remember playing around with that a lot, certainly prior to like iPhone applications and that kind of thing. And it was kind of looking at how those systems could be built and programming on them and those kinds of things. And that word apparently go locked in my brain. And now anytime anybody has talked about Simian army, I have heard Symbian.

Morgan:

So yeah, those Simian army, um, there are a few notable kind of functions in Simian Army and the, I guess most friendly would be chaos monkey, which pretty much will just shoot a server. They'll take down an EC2 instance. That's is quite a small place to start if you've got auto scaling turned on and...

Holly:

As well, like what we've been talking about and it, it's mainly because like me bringing recent experiments that I've been running is we keep talking about like a system outage, like an EC2 goes down, there's, there's, that, but there's also just like an EC2 fails to respond or it has latency. So all the kinds of it can be components don't have to fail entirely. They can like fail partially as well. Right?

Morgan:

Yeah. So, um, the LinkedIn project, we'll get into that a little bit more. And I guess that the Netflix piece, because it's older was a bit more rudimentary at the time, like the concept of trying to deliberately break your production systems to check if your engineers had architected something for resilience. Um, it was pretty crazy then, like, can you imagine going to a board and saying,'we want to break our customer facing website just to see if our engineers have built it properly'? I can't imagine most companies being okay with that.

Holly:

Yeah. I think one of the things, again, again, it's like how anybody, anytime anybody talks about public cloud, my brain immediately goes, oh, we're talking about AWS. Right. Cause that's my bias. And that's what I build in. But, um, I think a lot of people who have organizations who test well and have kind of like bought into Chaos Engineering, or even if it's not necessarily Chaos, other kinds of testing, would be very at how many organizations are not. I can give you another example from another area that I work in. Um, I recently read a statistic about how few companies perform penetration testing. And that was incredibly like surprising to me the statistic, um, it was delivered in two parts and it's one of those where you read the first part. You're like, oh, that kind of makes sense. Then you read the second part. And you're like, oh, what it said was 52% of large companies perform penetration testing. 13% of all companies perform penetration testing. So really what this statistic is trying to give you is like more than half, big companies do this, but when you account for small companies as well, it's that not very many people doing it. And I think that's the thing with testing is there's a huge number of companies out there who aren't doing really a great deal of testing, let alone like this cool testing stuff that we're talking about. Give you a funny example from the, from the pen testing side of things. And that was a company who had a main internet connection into their office. And then they had a backup internet connection for their office as well. The idea being if the main like goes down, they've got a much slower, smaller pipe, but they've still got some connectivity. So you can imagine like broadband-pipe, dial-up-pipe kind of thing, but they've got this, this backup plan. What we actually found through testing was the backup line had no firewall. So the main line like this gigabit link goes into the, into the firewall, into the main office. And then if that link fails, they've got just like an, Any-Any, you just, clear clear internet access.

Morgan:

So that fails open.

Holly:

It fails open! Absolutely. And the thing that interesting about that was they just never tested it and it's, you can look at someone like adversarial mechanisms of like, could somebody DoS the system in some way to cause it to failover to then use that, to get a security leverage? Like, yeah, that's fair enough. But in the context that we're looking at it, it's just like, you've never actually tested this, have you? It's like, they just never looked into it. So I think that is, you know, what we're talking about here is, for a lot of organizations, they're not doing any testing or certainly not like this level.

Morgan:

Yeah, absolutely. I think it comes back to the maturity conversation that we had earlier and it does take a particular kind of approach and mindset and definitely a culture that is open to absorbing and kind of accounting for failure.

Holly:

And just being like, okay with it, isn't it, it's just like, like accepting the fact that like we should test these things and if something goes wrong, it's better to find out about it in a controlled environment than it is because productions down.

Morgan:

Well, absolutely. And that kind of links into a couple of different things. So Werner Vogel's is the CTO of AWS says that"Everything fails all the time" and it's like a, quite a, well, often-repeated quote of his.

Holly:

So jumping back into talking about Netflix, then and the Simian army, we talked about Chaos Monkey. And you said that that was an easy way to get started. You kind of implied, that, that's just how they started. And that's one way in, and we talked about that being instance failure. So, you know, testing an EC2 goes down or something like that. What what's bigger than that, what's the next step.

Morgan:

Um, Chaos Gorilla is the next step up from that, which simulates an entire availability zone failure. And then there's one above that's called Chaos Kong. And that would simulate a region outage. I don't, I'm not sure if Chaos Gorilla is still available on GitHub Chaos Kong isn't. Obviously that would be incredibly destructive um material in the wrong hands. Accidentally destroy your entire company because the, you know, the, the BC and DR planning wasn't there and the architecture wasn't resilient enough.

Holly:

One thing to jump in at that point as well is when we talk about maturity, one of the things that this means for organizations, b- whatever level we're talking about, we're talking about security, we're talking about resilience. Or whatever it's like, the organization needs to define what is their risk appetite in that space? Because you know, we're talking about here handling an instance, going down, handling an, an availability zone, going down or handling a region going down. It might be the case that for some organizations, depending on scales, budgets and tensions, you might not want to test for everything. You might not want to test to the level of like, Hey, what happens if a nuclear bomb hits a data center or something like that? There is um, an argument to say that organizations should specify, like not only the RTO/RPO, but, but just like the risk appetite in terms of like, how much are we willing to handle? I was talking to one company recently who had, they were testing, uh, instance, failure level and testing an availability zone level, but they had effectively no ability if a region went down, you know, if, if it was bigger than availability zone. That was it. They were going offline and talking to them about that. They were actually quite happy with that because they basically said that not a big enough customer with a big enough user base for anyone to really care. It's like, Hey, if eu-west-2 goes down, nobody is looking at us. And I felt that that was interesting that the organization had had those discussions internally. It's never going to be eu-west-2 anyways, it's going to be us-west-1 clearly.

Morgan:

No.

Holly:

us-west-one. That's the one that always goes down. Which is it? I'm doubting myself now. Is it us-east-, oh, sorry. Cardinal directions are really difficult. It is us-east-1. Isn't it.

Morgan:

You're a vet. You're supposed to have a really good sense of direction.

Holly:

I've never tried to invade America.

Morgan:

Good. Don't do that. They have guns.

Holly:

us-east-1, isn't it? Um, man, I gotta fix that in post. Haven't we. I'm just going to say us-east-1 and then I can, I can transpose that. Yeah. Um, yeah, it was just really interesting that that organization kind of, um, thought up to that level. Exactly. As you said previously, where we're talking about resilience down to a minute level, it's like, Hey, if this instance goes down, the system will be back up fully functioning within minutes. Some organizations might say, Hey, you know what? We can handle 24 hours downtime. That doesn't impact us.

Morgan:

Yeah, absolutely. I think it depends on the criticality of the service that's being kind of provisioned.

Holly:

I think another thing as well is seasonality. It depends what's happening. You know, if you are a media or I guess maybe like a gambling company, something like that. And it's the Grand National, your risk appetite might be different at those periods of time. So it really depends what's happening, you know, for some organizations, it might be like, Hey, we've had an outage, but it's a weekend. So it doesn't matter.

Morgan:

Yeah, definitely. That's actually the only reason we even have Amazon Web Services, they initially built a bunch of proprietary infrastructure to host their website, um, to deal with Black Friday. And they realized that for the rest of the year, they had all of this additional resource that they weren't really using. So they sort of started to rent it out and that's how their business model began.

Holly:

Thanks, Jeff. So we've talked a little bit about Netflix, the Simian army, Chaos Monkey, Chaos Gorilla, those kinds of things and, and there's more to that project of course. But I presume it's not just Netflix that are doing this right. There's other companies.

Morgan:

Absolutely not. And also, um, the other thing to note as well is that, although we talk about these in an Amazon Web Services context, those tools can actually be cloud vendor agnostic. If you're running say Azure or something on a, an orchestration platform. Um, so it plugs into, it plugs into that instead, and will have the same sorts of impact just on the Azure equivalent, whatever that might be. So the other project, which kind of, um, is more recent and I guess looks at more granular experiments of what you were kind of asking about earlier would be LinkedIn, um, and actually came across that engineering blog a couple of weeks ago. And there are some really interesting blogs on there, about how they kind of began that chaos engineering journey. So they're actually called Project Waterbear, which is named after a tardigrade thus named, um, because it can survive in like deep space and deep water, super hostile environments. Um, so, you know, kind of interesting that one, LinkedIn actually split down their approach into both application and infrastructure experiments, which is really cool, but then they also kind of sub categorize the different sorts of experiments that they can do. So for app failure, you can enable, um, timeout, exception or delay failure modes. And then for infrastructure failure, they classify it into kind of half host failure, host offline, rack fail or data centre failure. And then they've written a little bit about a few of the different kinds of experiments that they've done with host level failures, with an intention in the future to kind of improve maturity and start to conduct kind of larger scale experiments. So that concept of kind of keeping the blast radius of your experiment, the potential impact of it, really small. And I think that blog is from like 2017, so it's kind of grown and matured since then, but yeah, really interesting to see how they kind of split that out.

Holly:

You know, we, we started this conversation with, with, uh, a very simple example of a failure. It's like a system level outage. And when you dig into these things, like what failure actually is, could be, it could be a huge number of things. I can give you an example from something that I've been working on recently, and that is taking payments through payment cards. And it sounds like a weird thing. Like what is the link here? But the link was working through, in what ways can that payment system fail and even not, not just not like a resilience point of view, but just like, you know, the, the, the working system would be a user presents that card, payment goes through nothing to worry about. A user presents their card the user doesn't have enough funds in the account. The payment doesn't go through. The user presents the card, there's some missing information, you know, it doesn't have cvv, or something like that, cards expired... And then you start getting into things like, um, the payment goes through, but then later on a charge back occurs, or the payment goes through, but a fraud alert is raised or a fraud alert could be raised later on. And then you can start getting into the real weeds of that kind of stuff. Like I'm looking at applying machine learning to those transactions to try and, um, preemptively detect. It's like, Hey, this might be fraudulent, but we don't know that it is. It's just our detection systems that pick that up. And I think this is one of the big things when it looks at Chaos Engineering is we're talking about failure as if it's like a yes-no problem. And it really doesn't have to be, does it like failure, could be any number of things.

Morgan:

Yeah. There's definitely so many, kind of, levels beyond that. And you can have like sub-categories and partial failure, and also you need to define what failure is as well. Like there are so many different definitions of what that could be. So if you look at like your traditional threat modeling, it could be, oh, what's the Microsoft one, STRIDE?

Holly:

Microsoft? I have no idea.

Morgan:

Oh my God. Okay. So let's not use that, um, denial of service...

Holly:

I'm running AWS on a MacBook. I was like, what, what do you want from me?

Morgan:

Well that's a windows machine! Right.

Holly:

It's an iPad!

Morgan:

Yeah. Um, yeah. So I think there's- tempering, non-repudiation, loss of integrity of systems, denial of service. There's so many different definitions of what failure could be.

Holly:

Security vulnerabilities are a great way of just looking at things, breaking in weird ways. So I guess we've talked in theory about Chaos Engineering now, and I think we've maybe implied some of the benefits that that can, can come about from this, but can we, can we summarize, like how does Chaos Engineering benefit an organization?

Morgan:

Okay. Besides being really cool and probably vastly improving the quality of life of your engineers, not just because it's fun, but also because it promotes building self-healing infrastructure, it allows you to test your business continuity and disaster recovery responses under controlled conditions that you are kind of in the driving seat of, it acts as training for your crisis management and operations teams too. So a lot of organizations will run a table-top disaster recovery exercise annually, if they're regulated, um, just to kind of see how their incident management team would respond in that situation. This gives you the benefit that you can actually test your systems as well. So that in addition to kind of testing the people side of things in the actual process, you can also test the technology.

Holly:

There's also, it's not only testing that, but actively building muscle memory, right? So it's like, Hey, when I blow a database up, why am I so chill about it? Chaos Engineering reduces anxiety because I can just think to, like we've been through this before. It's the same as like a lot of organizations might look at something like tabletop testing for their business continuity plan as quite a naive way of assessing those things, you know, sit down, what would we do if a ransomware attack hit? What would we do if the datacentre had a power outage, what would we do if we hosted in OVH Strasbourg and the building's on fire, by walking through those exercises, you build a little muscle memory. That's a very simple tabletop exercise, the very simple way of doing that. But, um, yeah, I just find it massively reduces anxiety in terms of the team being able to say we've been through this before.

Morgan:

Yeah. I've also seen circumstances where, you know, um, exec level members of an organization, don't take the tabletop exercise seriously, or they'll drop out of it or they'll prioritize another call or something that they've got on that day, because they know it's a theoretical exercise and therefore it's not their priority to do that. If you run a chaos engineering experiment alongside that sort of tabletop exercise, it's actually happening. It's, it's really failing that system. So they do need to be there they need to be dealing with that. And I think it can drive engagement.

Holly:

It's not only like business leaders, not caring, but it's so easy to lie to yourself. So we used to think like, Hmm, how long would it take me to rebuild a web server? I'd probably do that in half an hour. There's a big difference between that and like actively trying to do it and I'll give you an example that I think a lot of people might identify with. If you're revising for something like revising for an exam and you have flashcards and you look at a question and then you turn it over, you can very often go, oh yeah, I knew that. But if you look at the flashcard and then try and say out loud, what the answer is before you turn it over, you very quickly realize that you've been lying to yourself and you didn't know those answers. And I think that's the thing that we're bringing in here with Chaos Engineering is like, it's not in theory. It's not, if this occurred, what would we do next? It's like, okay, let's break this thing. And then actually see. So not only does it build the muscle memory, but it gives you just the accuracy of those results, I think.

Morgan:

Yeah, absolutely. I think it's beneficial to run some disaster recovery testing in advance, especially if you're cloud native so that you actually have those figures already like stats for how long it'll take you to rebuild a web server. Um, if you use infrastructure as code, say like CloudFormation or Terraform templates or something, you can run those, deploy your environment again and see how long it takes to come back up in like a worst-case scenario, like a region outage or something like that.

Holly:

When you say infrastructure as code for people, who've never heard that term before. What do you mean by that? Is that the dirty bash script that I've got my desktop that's called build-webserver.sh. And I just run that. And then a web server appears.

Morgan:

Unfortunately yes, it is.

Holly:

So this is an actual thing that I've got. Where, if a website goes down, the system should build itself again. Right. And it should auto scaling group. It should just bring it back up from the, from the AMI. It shouldn't be a problem and things have gone really badly if I am ever building a web server manually, but it's still a thing that we test, because what if there's something wrong with the AMI? You know, what if we've pushed a code change that has updated the AMI and now with auto-scaling is, you know, rebuilding with, with a bad image. So we still test that and we do have build.sh which will build a web server. And yeah, I guess technically that is infrastructure as code.

Morgan:

Technically in the loosest sense.

Holly:

So, so what, what should it be? What are we talking about from a, a higher level of maturity? You mentioned CloudFormation and things like that. What, what is that for those who've never come across it.

Morgan:

Um, CloudFormation is the AWS specific infrastructure as code service. I guess, roundabout fashion infrastructure as code would be where you predefine the particular AMI, OS version, and things like that, that your infrastructure should be using, maybe the size of the server that you want to provision the numbers of servers and auto scaling groups. You can define everything down to kind of, uh, security groups and access. Um, IAM roles and things like that attached to these resources in a fashion that makes it really easy to redeploy exactly the same environment time and again. So you won't accidentally have somebody who is manually deploying an EC2 instance, use the wrong AMI that then isn't compatible with something else that you've got like a process that already exists. If you've got like dependencies. So infrastructure as code is really good for that reason. It allows you to sort of develop like a gold build of your entire estate.

Holly:

It's really good as well for if you want to duplicate an environment for some reason, so not necessarily just chaos experiments, but it's like, if you just want to try something and you want the environment that you're trying it on to exactly reflect it, CloudFormation or a broader infrastructure as code is just like simplifying deployment.

Morgan:

Yes. Um, and Terraform again, would be the vendor agnostic solution for that. So if you use something else, but it does work within AWS too. So you can do the same sorts of things. Yeah. I think back to the benefits of Chaos Engineering it is just generally, it's really cool. It will make your engineers lives so much easier. It makes on-call easier as well. It makes them less likely to be kind of woken out three o'clock in the morning to rebuild something. It highlights issues with your monitoring and alerting if that exists. So say if your auto-scaling broke and you didn't have something that would kind of account for that, then these sorts of experiments would highlight that for you. So you could set up an alert that says, you know, this hasn't scaled and it should have done, or it's dropped below the minimum set out by your auto scaling. And it just generally improves your understanding of your systems and how they're architected and how your assets interact, which is super beneficial as well. I think a lot of the issues that companies have with legacy infrastructure and systems is that there's so much knowledge loss, from like staff turnover. Um, people that build those systems 20 or 30 years ago that aren't around anymore. And people might not understand how those systems work, whereas these sorts of experiments and the benefit or the novelty of it being cloud native and kind of Amazon pushing updates all the time and, you know, so it being so easy to kind of gamify, and automate everything. You can continually refresh that knowledge. So it doesn't really build up to a point where you have all of that technical debt and you're dealing with outages all the time and trying to support things that are struggling.

Holly:

Yeah. You said, you know, people built these systems 20 or 30 years ago. It's like to be clear if you built a system and six months passes, you've probably forgotten a lot of the reasons that you did things. And a lot of the documentation is probably outdated. So yeah. I take the point. If it was a long time ago then there's a benefit here. Even if it wasn't that long ago, there's still probably a benefit here. You know, I know that for a lot of people, you know, if they deploy a system and then don't think about it for a long time, you just kind of forget how, how to use that skill, right?

Morgan:

Yeah. It might not even be the same person. It might just be that a particular function or process was built by a different team. It could be that simple.

Holly:

So, if, if somebody has been listening in and they're like, exactly, as I think they should be at this point, which is just like, that sounds really fun. How do I convince my boss to let me do that? How can people get started with this stuff if they want to just individually understand more, but also if an organization thinks that this is something for them, like where do you begin?

Morgan:

So there are a bunch of really cool resources on this. Um, somebody that I would recommend checking out is Adrian Hornsby, who has written quite a lot on Medium about this on Chaos Engineering and has given a few talks. I actually went to a talk at the AWS Summit in 2019, which is where I first heard of Chaos Engineering. It was called Creating Resiliency through Destruction. And that's on YouTube. If anybody wants to watch that for a little bit more information, that kind of gives you a bit more background about Netflix and chaos monkey and how that began, um, and actually has a live demo of some chaos engineering in it too. And then there are kind of other conferences and things that Adrian's spoken at. Um, and there are other speakers that you can find that way. There's also lots of open source tools that are available. So if you check out the Netflix, GitHub repositories for Simian Army. Simian Army, I don't think it's been updated since about 2018, they split it out into chaos monkey, um, and security monkey, and a couple of other tools, but that still functions and works. If you want to kind of get started with that also kind of Amazon took notice of, of these sorts of experiments and companies starting to do this on their infrastructure. And recently in March this year announced the AWS Fault Injection Simulator, which can run these sorts of experiments for you too. So that's a really cool thing that you should check out.

Holly:

I think that the thing for me as well is whilst we talk about in theory, these huge things like tearing systems down, or like, Hey, let's tear a region down and see what happens. Um, you kind of just start really small. As well can't you. You can do some really small experiments. We have talked throughout this about, um, testing and production you of course don't have to start there. You know, you can use certainly if you've got infrastructure infrastructure as code, or if you have a test environment that you think is pretty close to production, you can do some experiments there and see how those things happen. Or even you can just, you know, spin up some, some test infrastructure specific for experiments. I don't think people need to jump into the deep end, like small experiments, uh, just fine.

Morgan:

Absolutely. Do not go into your place of work and destroy a bunch of servers and tell them that Holly and Morgan told you to do that. We accept no responsibility for any service outages that are caused as a result of you listening to this podcast.

Holly:

Next time you accidentally break a server, just tell your boss- it's Chaos Engineering. Do we not do this here? Yeah. One of the things I wanted to point out as well is that organizations can do this in a very controlled way as well. So there's also the idea of Game Days. So AWS uses this term in a couple of different ways, AWS themselves run game days, but also your organization can run a game day. And this is where if you're not at the point where experiments are happening just in real time frequently for an organization, what you can do is you can just take a day where you say, okay, today we're going to do some experiments in the same way that we talked about a second ago, where you might say, okay, today we're going to do a tabletop exercise of our disaster recovery plan and kind of, um, prepare for what scenarios you're going to do. Where are you going to test in those kinds of things? You can just do game days where you get the team together and you say, okay, we're going to run some experiments and you can document that stuff ahead of time. We're going to test these things in these ways, just as like getting, getting started with it. So yeah, don't, don't think you have to jump into the deep end and don't think you have to do this. Like every single day, you can just set some time aside, give it a go and see if it benefits the business. And I think the big thing from my experience with playing around with this stuff is a lot of stuff is going to break and that's great, right? It's like, Hey, there are gonna be some problems. You know, you're gonna, you're going to run some experiments and you'll say, okay, we'll break a web server and see how quickly it takes the system to recover. And you break a web server and then some thing else that you thought was unrelated, breaks in a way that you didn't expect. And it's like, Hey, yeah, that's fine. That's also Chaos Engineering. That's still, you know, meaningful. Like, Hey, we didn't think these two systems interacted in this way, but it turns out they do. And now we can do something about it.

Morgan:

Yeah, absolutely. I think related to that, if you plan to conduct any of these experiments, you know, at your place of work, as an example, or as part of a team, you should always have a retro afterwards, like always hold an incident retro as you would with a standard outage or a failure and discuss, you know, the, the impact of that, what your like key takeaways were in terms of like understanding of your architecture, how you would respond differently in future, if there's a missing process or a particular escalation point or some alerting that needs to be implemented so that you can improve your response times, um, in future minimize the potentially the amount of downtime that users notice or that it takes you to restore systems and things. And so, yeah, don't just kind of break some infrastructure and then fix it on your own. Cause that sounds unnecessarily stressful. Like kind of think about it first, come up with some intentions, plan it a little bit and then hold a retro.

Holly:

Yeah and some of it as well can just be like, oh, this person we've discovered is a bigger, single point of failure than we realized. If you, you have to keep going to the same team member to, to deal with some systems may be they're the only person who has a certain privilege, for example, or maybe it might just be like, we can do this thing, but it takes us longer than we realize, because maybe there's some missing documentation or something. When I was looking through for the show notes of AWS's Game Days, they had a few examples of just like silly things that had gone wrong during game days, that were like, now that you point them out, that that was a good learning moment. And I wrote one in the show notes, which was spending 30 minutes debugging, why you can't RDP into an instance, going through all sorts of things, networking tricks, firewall security group settings, all of those kinds of things to find out it's a Linux instance. And it's like, that is a fair takeaway. If, if you have processes that can lead to that kind of thing, or if you have poor documentation where, you know, that kind of thing isn't written down, I mentioned earlier that you're trying to build the team's muscle memory, but it can impact a whole world of things as well. Like, um, processes, people and documentation.

Morgan:

Yeah, absolutely. Just small things as well. Like labeling your instances or tagging them properly. I had a, an issue once when I was working in the cloud team where the Lambda script, that auto tagged our instances so that we were down as the owners for the instances, that Lambda script broke. And I deployed an instance and then it, then wouldn't let me terminate the instance because I hadn't been tagged as the instance owner. And that was the permission that you needed in order to be able to terminate it. So we found that problem and fixed it and then kind of knew to look out for that in the future. But again, it's just such a small thing that you don't really expect to happen.

Holly:

Yeah. And you know, as we've been saying through out, you know, building up the monitoring so that if a service stops, you know, you, you will know how long it takes for you to realize that that service has stopped. That's the end, by the way.

Morgan:

This has been Artificial Chaos podcast. Thank you for listening.

Holly:

It's not the, is it not The Artificial Chaos podcast?

Morgan:

No, just how do you not know what it's called? You registered the domain.

Holly:

I registered a domain. Not the domain. We done here.