The most significant network and service outages of 2022 had far-reaching consequences. Flights were grounded, virtual meetings cut off, and communications hindered.
The culprits that took down major infrastructure and services providers were varied, too, according to analysis from ThousandEyes, a Cisco-owned network intelligence company that tracks internet and cloud traffic. Maintenance-related errors were cited more than once: Canadian carrier Rogers Communications experienced a massive nationwide outage that was traced to a maintenance update, and a maintenance script error caused problems for software maker Atlassian.
BGP misconfiguration also showed up in the top outage reports. Border gateway protocol tells Internet traffic what route to take, but if the routing information is incorrect, then traffic can be diverted to an improper route, which happened to Twitter. (Read more about US and worldwide outages in our weekly internet health check.)
Here are the top 10 outages of the year, organized chronologically.
British Airways lost online systems: Feb. 25
British Airways’ online services were inaccessible for hours on Feb. 25, causing hundreds of flight cancellations and interrupting airline operations. Flights couldn’t be booked, and travelers couldn’t check in to flights electronically. The airline was reportedly forced to return to paper-based processes when its online systems became inaccessible, and the impact was felt globally. “Our monitoring showed that the network paths to the airline’s online services (and servers) were reachable, but that the server and site responses were timing out,” ThousandEyes said in its outage analysis, which blamed unresponsive application servers – rather than a network issue – for the outage.
“The nature of the issue, and the airline’s response to it, suggests the root cause is likely to be with a central backend repository that multiple front-facing services rely on. If that is the case, this incident may be a catalyst for British Airways to re-architect or deconstruct their backend to avoid single points of failure and reduce the likelihood of a recurrence. Equally possible, however, is that the chain of events that led to the outage is a rare occurrence and can be mostly controlled in future. Time will tell,” ThousandEyes said.
Twitter felled by BGP hijack: March 28
Twitter was unavailable for some users for about 45 minutes on March 28 after JSC RTComm.RU, a Russian Internet and satellite communications provider, improperly announced one of Twitter’s prefixes (104.244.42.0/24) and, as a result, traffic that was destined for Twitter was rerouted for some users and failed. Access to Twitter’s service was restored for impacted users after RTComm’s BGP announcement was withdrawn. ThousandEyes notes that BGP misconfigurations can be used to block traffic in a targeted way – however it’s not always easy to tell when the situation is accidental versus intentional.
“We know that the March 28th Twitter event was caused by RTComm announcing themselves as the origin for Twitter’s prefix, then withdrawing it. While we don’t know what led to the announcement, it’s important to understand that accidental misconfiguration of BGP is not uncommon, and given the ISP’s withdrawal of the route, it’s likely that RTComm did not intend to cause a globally impacting disruption to Twitter’s service. That said, localized manipulation of BGP has been used by ISPs in certain regions to block traffic based on local access policies,” ThousandEyes said in its outage analysis.
One way for organizations to deal with route leaks and hijacks is to monitor for rapid detection and safeguard BGP with security mechanisms such as resource public key infrastructure (RPKI), a cryptographic security mechanism for performing route-origin authorization. RPKI is effective against BGP hijackings and leaks, however adoption isn’t widespread. “Though your company might have RPKI implemented to fend off BGP threats, it’s possible that your telco won’t. Something to consider when selecting ISPs,” ThousandEyes said.
Atlassian overstated outage impact: April 5
Atlassian reported problems with several of its biggest development tools, including Jira, Confluence and OpsGenie, on the morning of April 5th. A maintenance script error led to a days-long outage for these services – but it only impacted roughly 400 of Atlassian’s customers.
ThousandEyes in its analysis of the outage emphasized the importance of a vendor’s status page when reporting problems: Atlassian’s status page had “a sea of orange and red indicators” suggesting a significant outage, and the company said it would mobilize hundreds of engineers to rectify the incident, but for most customers, there were no problems.
A status page often under-emphasizes the extent of an outage, but it’s also possible for a status page to overstate the impact, ThousandEyes warned: “It’s a really difficult balance to strike: say too little or too late, and customers will be upset at the responsiveness; say too much, be overly transparent, and risk unnecessarily worrying a large number of unaffected customers, as well as stakeholders more broadly.”
Rogers outage cut services across Canada: July 8
A botched maintenance update caused a prolonged, nationwide outage on Canadian operator Rogers Communications’ network. The outage affected phone and internet service for about 12 million customers and hampered many critical services across the country, including banking transactions, government services, and emergency response capabilities.
According to ThousandEyes, Rogers withdrew its prefixes due to an internal routing issue, which made the Tier I provider unreachable across the Internet for nearly 24 hours. “The incident appeared to be triggered by the withdrawal of a large number of Rogers’ prefixes, rendering their network unreachable across the global Internet. However, behavior observed in their network around this time suggests that the withdrawal of external BGP routes may have been precipitated by internal routing issues,” ThousandEyes said in its outage analysis.
The Rogers outage is an important reminder of the need for redundancy for critical services; have more than one network provider in place or at the ready, have a backup plan for when outages happen, and be sure to have proactive visibility, ThousandEyes suggests. “No provider is immune to outages, no matter how large. So, for crucial services like hospitals and banking, plan for a backup network provider that can alleviate the length and scope of an outage,” ThousandEyes wrote.
Power failure downed AWS eastern US zone: July 8
A power failure on July 28 disrupted services within Amazon Web Services (AWS) Availability Zone 1 (AZ1) in the US-East-2 Region. “The outage affected connectivity to and from the region and brought down Amazon’s EC2 instances, which impacted applications such as Webex, Okta, Splunk, BambooHR, and others,” ThousandEyes reported in its outage analysis. Not all users or services were affected equally; Webex components located in Cisco data centers remained operational, for example. AWS reported the power outage lasted only approximately 20 minutes, however some of its customers’ services and applications took up to three hours to recover.
It’s important to design some level of physical redundancy for cloud-delivered applications and services, ThousandEyes wrote: “There’s no soft landing for a data center power outage—when the power stops, reliant systems are hard down. Whether it’s an electric-grid outage or a failure of one of the related systems, such as UPS batteries, it’s times like this where the architected resiliency and redundancy of your digital services is critical.”
Google Search, Google Maps knocked out: Aug. 9
A brief outage impacted Google Search and Google Maps, and these widely used Google services were unavailable to users around the world for about an hour. “Attempts to reach these services resulted in error messages from Google’s edge servers, including HTTP 500 and 502 server responses that generally indicate internal server or application issues,” ThousandEyes reported.
The root cause was reportedly a software update gone wrong. Not only were end users unable to access Google Search and Google Maps, but also applications dependent on Google’s software function stopped working during the outage.
The outage is interesting to IT professionals for a couple of reasons, ThousandEyes notes. “First, it highlights the fact that even the most stable of services, such as Google Search, a service for which we rarely experience issues or hear of outages, is still subject to the same forces that can bring down any complex digital system. Secondly, the event revealed how ubiquitous some software systems can be, woven through the many digital services we consume on a daily basis and yet unaware of these software dependencies.”
Zoom outage scuttles virtual meetings: Sept. 15
Users were unable to log in or join Zoom meetings for about an hour during a Sept. 15 outage that yielded bad gateway (502) errors for users globally. Users were unable to log in or join meetings, and in some cases, users already in meetings were kicked out of them.
The root cause wasn’t confirmed, “but it appeared to be in Zoom’s backend systems, around their ability to resolve, route, or redistribute traffic,” ThousandEyes said in its outage analysis.
Zscaler proxies suffered 100% packet loss: Oct. 25
On Oct. 25, traffic destined to a subset of Zscaler proxy endpoints experienced 100% packet loss, impacting customers who use Zscaler Internet Access (ZIA) services on their Zscaler Cloud network 2. The most significant packet loss lasted approximately 30 minutes, although some reachability issues and packet-loss spikes persisted intermittently for some user locations over the next three hours, according to ThousandEyes’ outage analysis.
Zscaler referred to the problem on their status page as a “traffic-forwarding issue.” When the virtual IP of the proxy device became unreachable, it resulted in an inability to forward traffic.
ThousandEyes explained how this scenario could have made critical business tools and SaaS apps unreachable for some customers that use Zscaler’s security services: “This may have affected a variety of applications for enterprise customers using Zscaler’s service, as it’s typical in Secure Service Edge (SSE) implementations to proxy not just web traffic but also other critical business tools and SaaS services such as Salesforce, ServiceNow, and Microsoft Office 365. The proxy is therefore in the user’s data path and, when the proxy isn’t reachable, the access to these tools is impacted and remediation often requires manual interventions to route affected users to alternate gateways.”
WhatsApp outage halted messaging: Oct. 25
A two-hour outage on Oct. 25 left WhatsApp users unable to send or receive messages on the platform. The Meta-owned freeware is the world’s most popular messaging app – 31% of the global population uses WhatsApp, according to 2022 data from digital intelligence platform Similarweb.
The outage was related to backend application service failures rather than a network failure, according to ThousandEyes’ outage analysis. It occurred during peak hours in India, where the app has a user base in the hundreds of millions.
AWS eastern US zone hit again: Dec. 5
Amazon Web Services (AWS) suffered a second outage at its US-East 2 region in early December. The outage, which according to AWS lasted about 75 minutes, resulted in internet connectivity issues to and from the US-East 2 region.
ThousandEyes observed significant packet loss between two global locations and AWS’ US-East-2 Region for more than an hour. The event affected end users connecting to AWS services through ISPs. “The loss was seen only between end users connecting via ISPs, and didn’t appear to impact connectivity between instances within the region, or in between regions,” ThousandEyes said in its outage analysis.
Later in the day AWS posted a blog saying that the issue was resolved. “Connectivity between instances within the region, in between regions, and Direct Connect connectivity were not impacted by this issue. The issue has been resolved and connectivity has been fully restored,” the post said.
Copyright © 2023 IDG Communications, Inc.