What was initially reported as an issue with the potential impact to performance at content delivery network (CDN) provider Fastly quickly escalated into a major service outage for major media firms including video-on-demand players.
The company says that at 10:58 British Summer Time it experienced the global outage due to an undiscovered software bug that surfaced when triggered by a valid customer configuration change. The origin to the incident came on 12 May when the company began a software deployment that introduced a bug that could be triggered by a specific customer configuration under specific circumstances. The specific circumstances that triggered the bug caused 85% of the Fastly network to return errors and lead to the mass lack of access to clients’ web sites.
Fastly says that it detected the disruption within one minute, identified and isolated the cause and disabled the configuration. Its engineering was said to have identified the identified the customer configuration at 11:27 and sites began to recover at 11:36. According to Fastly technical data, the fix was officially applied at 11:57 at which time Fastly warned that customers may experience increased origin load and lower cache hit ratio (CHR) as global services return.
The company says that within 49 minutes of the outage, 95% of its network was operating as normal. Once the immediate effects were mitigated, Fastly said it turned its attention to fixing the bug and communicating with customers. It created a permanent fix for the bug and began deploying it at 18:25.
The outage affected Fastly clients around the globe such as Amazon, Twitch, Hulu, HBO Max, Vimeo, A&E and CNN. Territories includedcAustralia, the United Arab Emirates, Japan, Singapore, Chile, Argentina, Peru, Brazil, the UK, Ireland, Denmark, the Netherlands, Germany, Finland, Spain, Norway, Italy, Sweden, the US, Canada, France, Austria, South Africa and India.
Commenting on the outage, Fastly senior vice president of engineering and infrastructure Nick Rockwell conceded that the outage was “broad and severe” and that Fastly was “truly sorry for the impact to our customers and everyone who relies on them.”
Other voices within the network industry were forthcoming withtheir opinion. Toby Stephenson, CTO at IT and cyber security expert Neuways said the incident highlighted the reliance of many of the world’s biggest websites on CDNs such as Fastly. “As there are so few of these CDN services, these outages can occur from time-to-time,” he remarked. "The technical backends of these big websites are probably fine, but it is the frontends that can’t be accessed and content cannot be pushed as the network is down."
Gaz Jones, technical director of digital agency, Think3 added: “This is what happens when half of the internet relies on Goliaths like Amazon, Google and Fastly for all of its servers and web services. The entire internet has become dangerously geared on just a few players."