Cloudflare Apologizes After Outage Takes Out Large Chunks of the Web

Cloudflare recently suffered from an outage. In the process, it took out chunks of the web. The company has since apologized.

Cloudflare is currently on the defence after the company suffered from an outage. Cloudflare is, of course, a DDOS (Distributed Denial of Service) protection service that helps websites ensure uptime. If the site in question gets hit with a DDOS attack, the service serves the websites pages, ensuring that the site doesn’t actually suffer from downtime. The service also offers other kinds of services to help ensure overall uptime for websites.

Unfortunately for Cloudflare, the company suffered from an outage. On the surface, this is little more than an ironic moment for the company. However, there were consequences to the outage as well. Large chunks of the web were taken offline as a result of the outage as well. Heavy reports that, in the process, Discord went offline:

Discord, the popular instant messaging and communication app, has gone down Friday, the service confirmed.

According to the service, users are having trouble connecting to Discord because of an “upstream internet issue.” They are currently working on a fix.

Downdetector saw a spike in reports of Discord users having a problem around 5:08 p.m. EST.

According to the status website for Discord, the problem lies with a major outage for Cloudflare, the service’s distributed denial of service (DDoS) protection provider which proxies all their traffic throughout their network.

In addition to that, the website for Discord is also down.

Discord wasn’t the only service affected by the outage. Australian publication, CRN, reports that cloud workload services were also hit:

Many customers use Cloudflare as a proxy to protect workloads running on public cloud infrastructure and reduce their latency, and Friday’s outage appears to have impacted services hosted on all the major cloud providers.

Downdetector.com, which tracks in real time website outages, showed a spike in problems reported for Amazon Web Services, Microsoft Azure and Google Cloud Platform during the Cloudflare outage.

TechCrunch is pointing out that other services were affected by the outage:

Discord, Feedly, Politico, Shopify and League of Legends were all affected, giving an idea of the breadth of the issue. Not only were websites down but also some status pages meant to provide warnings and track outages. In at least one case, even the status page for the status page was down.

Porter Medium on Twitter is offering a screenshot showing several other sites suffering from downtime:

https://twitter.com/PorterMedium/status/1284242231558377474

Cloudflare has since resolved the issue. It turns out, it was a simple outage that was the result of a technical issue. Co-founder and CEO of Cloudflare, Matthey Prince, issued some comments on the outage:

https://twitter.com/eastdakota/status/1284253034596331520

We had an issue that impacted some portions of the @Cloudflare
network. It appears that a router in Atlanta had an error that caused bad routes across our backbone. That resulted in misrouted traffic to PoPs that connect to our backbone. 1/2

We isolated the Atlanta router and shut down our backbone, routing traffic across transit providers instead. There was some congestion that caused slow performance on some links as the logging caught up. Everything is restored now and we're looking into the root cause. 2/2

— Matthew Prince (@eastdakota) July 17, 2020

We isolated the Atlanta router and shut down our backbone, routing traffic across transit providers instead. There was some congestion that caused slow performance on some links as the logging caught up. Everything is restored now and we’re looking into the root cause. 2/2

On the official website, Cloudflare offered a bit more detail as to what happened. In short, it was a configuration error:

Today a configuration error in our backbone network caused an outage for Internet properties and Cloudflare services that lasted 27 minutes. We saw traffic drop by about 50% across our network. Because of the architecture of our backbone this outage didn’t affect the entire Cloudflare network and was localized to certain geographies.

The outage occurred because, while working on an unrelated issue with a segment of the backbone from Newark to Chicago, our network engineering team updated the configuration on a router in Atlanta to alleviate congestion. This configuration contained an error that caused all traffic across our backbone to be sent to Atlanta. This quickly overwhelmed the Atlanta router and caused Cloudflare network locations connected to the backbone to fail.

The affected locations were San Jose, Dallas, Seattle, Los Angeles, Chicago, Washington, DC, Richmond, Newark, Atlanta, London, Amsterdam, Frankfurt, Paris, Stockholm, Moscow, St. Petersburg, São Paulo, Curitiba, and Porto Alegre. Other locations continued to operate normally.

For the avoidance of doubt: this was not caused by an attack or breach of any kind.

We are sorry for this outage and have already made a global change to the backbone configuration that will prevent it from being able to occur again.

Some observers are watching the events unfold and are raising some interesting questions. With so many services relying on Cloudflare, one configuration error can cause so much of the web to go offline. Does this mean too many are relying on Cloudflare? Do we depend too much on one company to maintain the health of the web in the first place? After all, relying on one source for so much is generally a bad idea when it comes to uptime and security.

Of course, there is the other side of the issue where things got pretty bad without any kind of DDOS protection services. Any random nitwit with knowledge of different DDOS-for-hire operations can take out a website with a few mouse clicks. Trying to thwart such operations isn’t exactly all that easy from a technical side. So far, the best bet seems to be to rely on a Content Distribution Network (CDN) and hope for the best. That’s really the best kind of solution websites in general have for the time being. Really, things could be worse, but they could always be better too.

Drew Wilson on Twitter: @icecube85 and Facebook.