Multiple Oracle Cloud Infrastructure (OCI) outages have hit users around the world this week, and coming after interruptions in Microsoft’s cloud services last month, are a reminder of the importance of site engineering for systems administrators whose businesses rely on cloud-based mission critical applications.
The biggest OCI outage this week began on 17:30 GMT Monday and stretched till Wednesday 22:30 GMT, impacting customers across North and South America, Australia, Asia Pacific, Middle East, Europe and Africa.
“Oracle engineers identified a performance issue within the back-end infrastructure supporting the OCI Public DNS API, which prevented some incoming service requests from being processed as expected during the impact window,” the company said on its cloud infrastructure website.
In an update, the company said it implemented “an adaptive mitigation approach using real-time backend optimizations and fine-tuning of DNS Load Management to handle current requests.”
Oracle outages affect multiple cloud services
Oracle said that the outage caused a variety of problems for customers. OCI customers using OCI Vault, API Gateway, Oracle Digital Assistant, and OCI Search with OpenSearch, for example, may have received 5xx-type error or failures (which are associated with server problems), Oracle said. Identity customers may have experienced issues when creating and modifying new domains.
In addition, Oracle Management Cloud customers may have been unable to create new instances or delete existing instances, Oracle said. Oracle Analytics Cloud, Oracle Integration Cloud, Oracle Visual Builder Studio, and Oracle Content Management customers may have encountered failures when creating new instances.
In an apparently unrelated incident, Oracle’s NetSuite ERP suite suffered an outage at its data center in Boston on Tuesday, leading to downtime that stretched from 12:15 p.m. ET Tuesday till services were restored around 11:46 a.m. ET Wednesday.
Oracle did not detail reasons for the Boston data center outage, but the Register reported in a tweet that “smoke was reported at a data center site used by Oracle NetSuite, coming from electrical equipment in a power room.” Firefighters turned off power to the site and evacuated it, the Register reported.
NetSuite users report unrecovered data
Customers reported on Reddit that they were unable to recover data that been recorded for a half hour before the outage began, with one user posting a statement said to have been sent by NetSuite, confirming that the “restoration point was about 30 minutes prior to the outage.” The statement noted that in such cases, NetSuite typically provides users with a report or list of transactions that were created during the period for which data could not be retrieved by customers.
The user who posted the NetSuite statement said that “based on this, we’re assuming we’ll have to manually slog through the missing data and then selectively import it into our ‘new’ NetSuite instance (which is now hosted in Santa Clara, not Boston).”
In yet a separate incident, on Monday, Oracle’s US Ashburn 2 data center experienced an outage for about an hour.
Oracle claims that NetSuite had 99.96% availability over the past 12 months, and the outages this week come just months after Oracle CEO Larry Ellison, in the company’s second quarter earnings call in December, indirectly took a dig at Amazon Web Services, which suffered a major outage that month. Ellison said that a major telecom company told him that Oracle is different from other clouds as it “never ever goes down.”
Microsoft outages affect users globally
Over the last few months there have been other major cloud outages. Most recently, on February 7, Microsoft Outlook and Teams suffered a global outage. That outage came two weeks after a Microsoft outage in January that affected not only Outlook and Teams, but services including Exchange Online, SharePoint Online and OneDrive for Business. The outages impacted users around the world.
Although the cloud giants have redundant data centers and servers in almost every region, data loss has been commonplace for many outages.
Cloud system architecture is key
“Cloud based solutions, like their on-premises equivalents, need to be architected for true high availability and continuity,” said Sam Higgins, an analyst at market research firm Forrester. “Having a cloud foundation and a global footprint does not immediately give you 100% uptime for an application. Especially for applications with a long on-premises history and heritage.”
Higgins added that other factors that lead to outages include client choices, including data residency configurations that may constrain how much data replication and backup a cloud provider can implement on its data center network.
“Add this to increasingly global network complexity, the risk of multiple factors — some human error — and you have a perfect storm in terms of an outage with real data loss potential. It’s this risk that has driven uptake of site reliability engineering,” Higgins said.
Copyright © 2023 IDG Communications, Inc.