What Makes SSL Fail, and What Can SREs Do About It?
TLS (and the previously used SSL) protocols make the web go round. They are fundamental when establishing a link between two computers, creating a very special mathematical relationship signified by the all-encompassing gesture of friendship: the handshake.
So fundamental, in fact, that we probably take them for granted when we shouldn’t. The user relies on TLS encryption every day to protect data and the integrity of a session.
Today, we’ll look at the consequences of this failure, as well as some of the causes and what SREs can do about them.
Why does TLS matter?
Most people would probably be surprised to learn that their internet activity is essentially wide open and visible to anybody who wants to snoop on it. Data packets travel from your computer to their destination, starting with your local coffee shop’s Wi-Fi, and going via many essentially random network servers, routers, and links. Along the way, unless you take some special precautions (like TLS), basically anyone can see and read those packets, modify them, or even inject extra packets of their own.
That should give us all pause for thought. Sure, you don’t have anything particular to hide, except the Christmas present you bought your partner from Amazon, or your 3am drunk texts to your ex, or what you Googled in an unguarded moment last Friday night. But it would be impossible to do any kind of commerce or business on the net without secure, end-to-end encryption, and that’s why we have TLS. So it had better work, hadn’t it?
How does TLS work? The short short version
A TLS certificate is a piece of digital data that proves the identity of its owner, the way your driving licence proves who you are. A driving licence is effectively a certificate issued by the state, proving that (among other things) the holder’s identity has been verified. There are special ways you can check a driving licence to ensure that it’s real, and hasn’t been modified so that the holder can masquerade as ‘McLovin’, the 25-year-old Hawaiian organ donor.
If the certificate fails those checks, or has simply expired, you shouldn’t trust the holder (or, at least, not sell them booze). If you visit a website with an expired or invalid TLS certificate, you shouldn’t trust that the site is what it says it is. It could be anybody; even McLovin.
But a TLS certificate also serves to encrypt the conversation between your browser and the server. Given a TLS-encrypted connection, there is no way for anyone to spy on, read, modify, impersonate, or record your web activity, and a malicious server can’t impersonate Google, or Amazon, or the DMV.
What Makes TLS Fail, And What Does it Mean to the End User?
A web admin is going to feel the effects of a failing SSL certificate pretty rapidly. Even a simple change can result in services failing while settings adjust.
Modern web browsers will protect users from sites with expired or misconfigured certificates by blocking the site completely, or showing an alarming warning message. But the user has more ominous concerns lurking, such as man-in-the-middle attacks that steal personal information. A TLS failure leaves the end user at risk to fraud and identity theft.
All of this pressure comes crashing down on the company, which suffers a blow to user trust as it dons the “insecure” label. Doubly so for companies that become the subject of media surrounding the outage. With everyone looking to point fingers, and your site failing the entire time, you’re losing sales and revenue as users abandon.
At best, you’ve suffered from an entirely preventable issue. At worst, you’re in the midst of a cascade of failures and fielding tens of thousands of user reports about the issue.
Getting and renewing TLS certificates used to be difficult. You had to install special software tools, run the tools, paste the output into a web form, fill out more paperwork, enter your credit card, pay a modest to outrageous amount of money, get an email, copy some text from the email onto your server, run more commands, restart your web server, and so on (we’ve omitted some of the more boring parts). And every year, or however long the certificate lasted, you had to do it all over again. Make a mistake at any point in the process and your site could be down or, worse, insecure.
But things have changed… significantly. Let’s Encrypt is an independent, nonprofit certificate issuer which supplies commerce-grade TLS certificates to anyone who wants them, for free, forever. Even better, they provide a service which makes it nearly effortless to get and renew TLS certificates automatically. Let’s Encrypt provides the certificates for 190 million websites today, and they just issued their one billionth certificate.
There are tools to manage Let’s Encrypt certificates for all major operating systems, web servers, and so on. Whatever your platform, you can set it up to get TLS certificates for all your sites on all your domains, install them automatically to your web server, and renew them as necessary, without you lifting a finger or paying a penny. Who says we’re not making progress as a species?
On the other hand, it means that a secure, valid TLS certificate is now the baseline expectation for all websites. There’s essentially no longer any excuse for not having one. So how and why do TLS failures still happen in 2020?
Meet the CAB
Let’s circle back to the Certificate Authority/Browser Forum, which is the de facto regulatory body that serves as the foundation to the SSL/TLS industry. They create baseline requirements that anyone who issues TLS certificates must follow.
Among the various requirements that dictate TLS certificates is one that states a certificate’s lifespan should not exceed 27 months.
If you’re in violation of these requirements, as Symantec once was, your certificate can be de-listed. This is catastrophic, “Bring out the President for a rousing speech to the troops before they charge into a fight they can’t win” level bad.
Symantec’s certificates have run afoul of Google more than once, but the failures in 2015 and again in 2016 were fairly severe. The amount of bad certificates measured in the hundreds of thousands, and Google essentially declared Symantec unfit to issue more certificates.
Thankfully, Google does not unilaterally decide the outcome in instances like this but one major company’s decision can completely affect the landscape.
How TLS/SSL Certificates Fail
There are garden-variety failures and the more interesting ones. Symantec falls under the category of interesting, as in “may you live in interesting times”. Standard failures can be attributed to:
- Mixed Content
- Name Changes
Expiry was the root cause of the recent failure with Microsoft Teams. The failure is being painted as a rookie mistake, but was it? Think about the complexity in your organization, with multiple engineers plucking away at various systems. It’s not so hard to imagine someone using a certificate, and like any human, forgetting to document its renewal date.
Safari wants to create a new standard of 13 months, which could be either a blessing or a curse. In theory it would keep us all a little more vigilant about how we secure our sites. In practice? It could lead to some adjustment pains like the Teams incident.
By default, the Uptime.com SSL check provides alerting when expiry is within a 20 day timespan. This once-per-day check does more than measure expiry, though. It can alert you when your certificate is not secure, or if it fails a standards test as well.
Then there’s failure of the unknown knowns. That site that your company debuted five years before you were hired? The one that used Flash and had all those sweet animations? Someone is probably still paying to keep that site up and running, and it’s ticking away just waiting to break.
Someone in your organization could change providers without giving you and your team a heads up. Think it couldn’t happen to you? No matter how large or small, we all face the same challenge with communication and transparency. Documenting everything is a monumental necessity.
Even if it’s outside your knowledge, maybe especially so, it can fail and hurt your organization.
Taking SSL Failure Seriously
With a greater understanding of the problem, we can consider ways to attack it. SSL/TLS monitoring eliminates the potential failure inherent to human intervention. At scale there are just too many ways to miss something so crucial.
Get specific with your monitoring. Is it enough to monitor TLS/SSL for your root domain? What about applications you host? Email? All of these endpoints utilize encryption that you need to monitor.
Here at Uptime.com, we usually recommend a Domain Health check. You can create 10+ checks related to your server infrastructure, which include checks for both SSL and DNS (the other hidden killer).
Find a timeframe for alerting that works for you and your organization. What’s the turnaround with your company for getting a certificate renewed? What about 50 or 100? Figure out a timeframe that works for you and set your alerting to report your certificate’s expiry within that timeframe. If the margin for failure is too low, you’ll find yourself arguing with finance for a necessary cost in the hopes that something doesn’t crash before you get what you need.
Finally, send the alerts to those who need it most. Who is in charge of SSL for your organization? Who keeps track of the servers and maintains the upkeep? If that’s you, set yourself a central dashboard or push notifications so you’re never caught off guard.
SSL failure is preventable, but you need to know the points of failure.
Minute-by-minute Uptime checks.
Start your 21-day free trial with no credit card required at Uptime.com.