Preparing to Fail Fast so You can Recover Faster

The principle of fail fast is either the best thing since the transistor or nothing but hot air. It depends on the size of your organization and the cohesiveness of your teams. If your team members have a strong working relationship, and dev is well integrated with everyday work company-wide, you already have a good foundation for this particular agile thinking.

Most companies that have grown beyond startup-size, and even some startups, may find this idea a bit jarring. It is hard not to interfere when a test comes with risk. And nearly every test comes with some form of risk. Conversely, so does every deployment.

That’s why today we want to look at methodologies that help us fail fast, recover faster, and learn fastest.

What is Fail Fast?

Fail fast is an agile principle that encourages experimentation in order to reach a desired outcome. What is experimentation? Well, it’s not throwing stuff at the server to see what causes a short; it is a thoughtful examination of a problem, with a solution that leans more on experimentation than proven methodologies.

Fail fast relies on near-immediate reporting of errors, and so requires some level of testing and oversight built into the process.

Fail fast works because of backups, monitoring, and general awareness. When you deploy, you load test and adjust. When you’re small, failing fast is a very attractive policy. Facebook, Reddit, and lots of other tech companies have hacked their way to the top of their respective industries on fail fast methodologies.

Fail Fast Shortcomings

The challenges of fail fast tend to arise as you scale. Infrastructure is expensive, people and resources are expensive. Mistakes become very costly, and the benefits of learning and education tend to feel less important by comparison.

Fail fast also requires that teams can get a decision maker’s attention, and also work with other team members as needed. That can be a big ask.

As teams and management grow, coordination can become a difficult task. It’s hard to get the attention of a decision maker from another department, and even more challenging to ask for resources as needed to accomplish your goals. The ability to reach out to others and get what you need, is necessary for fail fast to work.

Failing Fast and Visibly

So what needs to happen in your organization to make fail fast work for you?

Thoughtful Implementation

CI/CD provides a methodology with some of the same principles that fail fast thrives on:

  • Rapid deployment of iterative changes
  • Centralized code bases
  • Ample testing

 

One adjustment to make to your pipeline is to improve your efforts at unit testing. I can hear you groaning from here, but the fact is unit testing can improve development time without incurring too much in cost.

Check it out: This blog goes into great detail about how to write unit testing into your development. 

Good unit testing tests for a specific case, not the entire application. If the behavior changes, the test changes but the behavior is application independent. Good unit testing therefore is as much a principle of development as it is of design, which is fine for simple stuff but what about more complex applications? That’s a different question because what you are really asking is a question of risk. How expensive is testing versus not testing or testing inadequately?

Working Across Teams

As the saying goes, you give and you get. Your team has to rely on the work of others to make fail fast work. The more bureaucracy in place, the less effective fail fast becomes. The temptation is to utilize Scrum, because it sounds great to just have meetings and break through those roadblocks.

True devops is focused on just getting the job done. Meetings aren’t the only option to organize, sometimes a more ad hoc approach may be useful as you establish a fail fast system that works. Communication needs to be streamlined, with accountability in place.

Multi-Tier Web and Application Monitoring

It’s very easy to have several dozen points of failure built into a growing application. Third party services, anomalies between you and the end user, as well as server outages or even bugs all present certain risks to your stability.

We’ve already written that 100% uptime is a myth, so having some oversight on these points of failure is a must.

You probably already monitor items like system resources, but what about connectivity from outside your home domain? If one of your product pages went down, would you know about it, or would a customer need to inform you? If your shopping cart provider failed, do you have monitoring in place to alert you of that downtime so you can communicate with your customers?

These questions form a solid foundation for approaching your search for web monitoring providers.

Cost Benefitting Failure

Alongside fail fast is “recover faster”. We would argue “fail cheap” is important as well, because let’s face it: failure is tough to pitch to your boss. Even a slight “this might not work” could cause enough apprehension to stall a project or force a different approach.

Fortunately, you are in the devops Wild West so you probably have some free reign that traditional development isn’t saddled with. Your organization likely already understands some of the value of agile development. So how can you adequately judge failure and its cost?

A good way to start is with a mockup or an outline of your project. Partner with someone in design, learn Balsamiq or Google Draw, or learn flowchart software. Anything that allows you to quickly translate your ideas into a form of action, so you can approach every project cognizant of potential technical pitfalls and a better visualization of what can go right and wrong.

The Hypothesis of Failing Fast

What do you need to do to get to where you want to go? To start, you need a hypothesis that is fully fleshed out and expresses the intent behind your fail fast experiment.

Let’s flip that around a bit to say at the onset, knowing why is more important than knowing what works.

With devops tending towards an ownership oriented culture, where you build and own your systems, it’s easy to extrapolate application ownership into a hyper-specialized environment. Fail fast isn’t about building your application better. It’s about learning its inner workings, understanding what it can and cannot do, and expressing those ideas to your team.

Failing fast is meaningless if you are not documenting the how of your application’s existence.

When to Fail Fast

Just like agile doesn’t work for every scenario, failing fast doesn’t either. You need to consider your code’s complexity, and the ability to test it. Automated testing can help here, because you can test all the other moving parts while you work on a more selective approach to unit testing.

Yes, the automated tests also take time. This is cost benefit at work: which is more expensive to you and your organization: testing or waiting for failure?

You won’t know if you don’t test and fail a little!

Minute-by-minute Uptime checks.
Start your 21-day free trial with no credit card required at Uptime.com.

Get Started

Don't forget to share this post!

Avatar

Richard Bashara is Uptime.com's lead content marketer, working on technical documentation, blog management, content editing and writing. His focus is on building engagement and community among Uptime.com users. Richard brings almost a decade of experience in technology, blogging, and project management to help Uptime.com remain the industry leading monitoring solution for both SMBs and enterprise brands. He resides in California, enjoys collecting and restoring arcade machines, and photography.

Catch up on the rest of your uptime monitoring news