“Cloud has changed the nature of outages.” – Miles Ward

There are hundreds of cloud and network provider outages every week, more than anyone ever realizes. Every week, some author only picks the most notable ones, usually 3 or 4 of them. How does this impact you or your organization and customers? Were you aware of the magnitude?

Let’s start with top 10 global outages in 2021

Slack

Perhaps in another year, a work messaging system going down wouldn’t have been that much of a big deal. But in 2021, when millions are working from home, it’s a disaster.

On January 4, messaging service Slack suffered a major outage in the U.S., the U.K., Japan, India and Germany, on the first day back to work and school for millions.

Slack pinned the source of the issue on network scaling issues by the AWS Transit Gateway, which didn’t scale fast enough to accommodate the spike in demand for Slack’s services as millions returned to work and school after the holidays.

LinkedIn

The Microsoft-owned social network was experiencing issues at the beginning of this year.

The company didn’t give a specific reason for the outage, though did jokingly suggest the issues may have been due to a WandaVision character.

It’s rare for LinkedIn to experience outages. LinkedIn last experienced widespread issues in January last year when users couldn’t post to the service or make new connections. Their last outage affected the entire website, and it wasn’t loading for most users at all.

Amazon

A major outage disrupted Amazon’s cloud services on 7th of Dec, temporarily knocking out streaming platforms Netflix and Disney+, Robinhood, a wide range of apps and Amazon.com Inc’s e-commerce website as consumers shopped ahead of Christmas.

“Many services have already recovered; however, we are working towards full recovery across services,” Amazon said on its status dashboard.

Amazon said the outage was related to network devices and linked to application programming interface, or API, which is a set of protocols for building and integrating application software.

Microsoft Teams

February saw two separate issues with the Microsoft Teams collaboration app. First, on Feb. 4, an issue prevented some North American users from joining meetings. By the afternoon that same day, Microsoft issued a statement to say, “we resolved the short interruption that a subset of customers in North America may have experienced connecting to meetings or live events.”

On Feb. 17, Teams was hit by a possible networking issue that led to delays in receiving chat messages for some USA users. The issue with delayed chat messages was resolved after roughly five hours.

Russia blocking Twitter

On March 10, Russia’s agency for regulating the country’s communications (Roskomnadzor) attempted to slow traffic to Twitter, but it inadvertently disturbed much of the country’s mobile internet service instead.

In a mistake that IT professionals around the world can relate to, they were stung by a bad substring match from a poorly formed regular expression. Intending to block Twitter’s link shortenert.co, Russia blocked traffic associated with all domains containing t.co, for example, Microsoft.com and Reddit.com.

Akamai

On July 22, sites like Amazon, UPS, Airnbnb, the PlayStation Network, Steam and FedEx went down, all thanks to an outage with the Akamai Edge domain name system (DNS) service.

People trying to access these sites, plus others including American Express, Delta Airlines and Home Depot, were met with a DNS error message.

Akamai said the outage, which lasted about an hour, was caused by a bug triggered by a software update.

A statement read: “Upon rolling back the software configuration update, the services resumed normal operations. Akamai can confirm this was not a cyberattack against Akamai’s platform.”

Facebook

On October 4, the world’s largest social media platform suffered a global outage of all of its services for nearly six hours, during which time Facebook and its subsidiaries, including WhatsApp, Instagram and Oculus, were unavailable.

With a claimed 3.5 billion users of its combined services, Facebook’s downtime of at least five and a half hours comes to more than 1.2 trillion person-minutes of service unavailability, a so-called “1.2 tera-lapse,” or the largest communications outage in history.

According to FB’s official explanation it was a routine maintenance job that took down the entire platform by issuing a command to “assess the availability of the global backbone capacity which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally.

Github

Microsoft-owned GitHub experienced a more than two-hour long outage in Nov, affecting thousands or potentially millions of developers that rely on its many services. Git operations, API requests, webhooks, pull requests, GitHub actions, GitHub packages, and GitHub pages were down for more than two hours.

GitHub went down for 2 hours last year after errors hit the service and knocked it briefly offline. This latest outage comes just weeks after former GitHub CEO Nat Friedman stepped down from, and the company continues to operate as an independent Microsoft-owned business.

OVHcloud

French colo and cloud provider OVHcloud also experienced a global outage recently.

Earlier this year, one of OVH’s data centers burned down, causing significant outages and data loss. The company plant s 4.7 billion IPO later this year.

After an outage lasting more than an hour, services appear to be slowly returning. The outage began at around 7:00 UTC, with a maintenance reconfiguration error.

“Following a human error during the reconfiguration of the network on our DC to VH (US-EST), we have a problem on the whole backbone,” OVH founder Octave Klaba said.

The fire is expected to cost the company more than €105 million ($122m).

Google

In April there was a partial outage for Google Drive and cloud-based apps such as Google Docs for about three hours, leading to high latency and other issues for some users.

The Google Drive cloud storage service—and associated cloud apps including Google Docs and Google Sheets—suffered multiple service issues during the partial outage on April 12. Other Google services were not affected, including Gmail, Google Calendar and Google Meet.

While users could still access Google Drive, affected users could not create new documents and were “seeing error messages, high latency, and/or other unexpected behavior,” according to the company.

Common Causes

Careful analysis of the outage data leads to a few simple facts about downtime and service degradation:

Human error

Various studies over the last several years have placed human error as either the most frequent or second most frequent causes of server downtime. Whether through accident or negligence, many of the highest profile service outages of the last few years can be directly traced back to human error.

While it’s impossible to guard against human error completely, data centers and other organizations can take significant steps toward reducing the likelihood of error and increasing accountability to deal with problems when they do occur, such as:

  • Accurate documentation of routine tasks
  • Imposing more stringent policies on device usage, and ongoing continuing education to reinforce processes and policies
  • Аutomation through artificial intelligence and predictive analytics
  • DevOps and SRE leveraging automation and collaboration between the development and operations teams to avoid downtime and reduce it when it happens

Cyberattack

Network vulnerabilities create opportunities for hackers to infiltrate systems, allowing them to steal data, shut down applications, and lock down users with ransomware. Even if a system is relatively secure, it may still be vulnerable to a distributed denial of service (DDoS) attack.

Network services and infrastructure

Sometimes the outdated hardware is particularly vulnerable to failure, leading many companies to blame service outages on “old servers.” equipment just breaks. It’s an unpleasant truth, but data center physical infrastructure is always vulnerable to failure of some kind, making it one of the leading causes of downtime. Whether it’s a server going down, an UPS battery failure or a data center cooling system malfunction, hardware presents a wide range of potential problems for IT departments and data center personnel. Part of the challenge here is that many failures can’t be predicted.

Although less common than hardware failures, network systems are only as effective as the software they’re running. When operating systems are updated with patched that haven’t gone through proper testing, entire applications can become corrupted and bring networks screeching to a halt. In any case, software remains one of the more pervasive causes of downtime. Network infrastructure is a key part of smooth operation, and it seems is regularly neglected.

Lessons learned for 2022

Cloud-agnostic can prevent server outages?

The biggest advantage of running a cloud-agnostic application is that you’re assured of a consistent and standard performance whatever platform the application is deployed on. Limitless portability between platforms also means you can easily migrate on-premises applications to the cloud, avoid vendor lock-in, and maximize redundancy—all of which are important considerations for many organizations. Find out more about cloud-agnostic strategy here.

Multi-vendor concept

Nowadays, a concept like Multi-AZ is moving towards Multi-region or even, to be even more secured, Multi-vendor. Our last client even wanted to be multi-regional at the beginning, both in USA and Europe, in this case the design of the infrastructure and services must allow it ( it is not just about the money, but it must be designed that way). The same goes for communication platforms, for example we have a Slack chat in case something goes wrong with Microsoft teams.

We are not giving up our reliance on and enjoyment of these cloud and SaaS solutions. The complexity is sure to not only continue but to continue expanding. All of this speaks to the need for independent visibility and verification. Such complexity affects not just the ability of businesses to operate, but even touches the average consumer.  Full visibility, end-to-end, and at every layer, provides the insight needed to successfully operate the sites and services that we all need and enjoy.