As the process of human evolution fueled the necessity to express ourselves lent itself to the development of language and the capability to communicate and transmit ideas between one another, so did the digital revolution of the 21st century fueled the need for computers—via internal networks and the Internet—to do the same.

Instead of words, information systems use a binary code of ones and zeros and specific ports and protocols to ensure the communication is made. Thousands of years have elapsed since the dawn of civilization, and humans have since increased the speed of communication by leveraging information systems to both transmit, safeguard, and ensure connections on specific internet-based platforms are available.

This was all on display on October 4th, when Facebook, the social media monolith, and its sister networks—Instagram and WhatsApp—experienced widespread operational disruptions. This event serves as a much needed reminder of the precarious nature of different platforms allowing for interconnectivity between its users and the multiple layers of redundancy vendors must provide to ensure availability of resources. In this article, we will review Facebook’s explanation regarding recent events while also assessing external observations during the outage.

Facebook’s recounting of what the company calls the “October 4th outage” cites “configuration changes on the backbone routers” as the culprit for its platform, as well as Instagram and WhatsApp, going offline for several hours. A backbone router is what is employed to allow interconnectivity of networks, thereby ensuring information is exchanged between local area networks (LANs) and other networks. As the name implies, this vital resource is required to be fully operational for inbound and outbound connections to exist. Misconfiguration within this resource is akin to a denial-of-service attack emanating from a threat actor (TA) with the end result being the same — business operations and availability of the resource being gravely impacted.

Reviewing the timeline of events from an external perspective, like a Facebook consumer, we must again review events from a very granular, yet simplistic point of view. Networks need to communicate – how do they do so? In order for a request for information to be sent out, the client (user) must know where to send the request. Networks require a way–a route–for communication to occur. To test this route, a user can simply type the command ping 8.8.8.8 in their command prompt/terminal, which will reveal the number of hops (distance) it took from the user’s device to access google.com. An IP address– which information systems use to provide logical context for a resource–is translated into google.com for human understanding and usage. This also serves to test if there are any communication issues but also visualize the route it takes to get there. The same was done via Linux command ‘ding’ to test Facebook’s operability during the outage, as depicted in the graphic below.

Multiple outside sources are speculating that Border Gateway Protocol (BGP)—a mechanism used to exchange routing information between resources—was un-operational. BGP itself was not down, as it’s only a means to an end, but in this case the end being Facebook not having its resource available to make that connection. User’s devices had no way of finding their way to the social media platforms respective websites – this correlates with the internal misconfiguration issue Facebook reported. However, in addition to this observation, if the route to the resource is unavailable the request will simply time out, the issue is what happens if millions of users simultaneously try to use this route and servers do not respond with the requested information, this will cause latency and connection timeout issues. It’s only human nature to want to compulsively hit “refresh” on the website or mobile app they want to access.

The resolution eventually came as a result of fixing the internal misconfiguration and allowing Facebook assets to “rejoin” the internet by ensuring they were client (user) facing so that requests could be made, and those information requests would result in accessing the respective Facebook platform. As of 9:30 p.m. EST on October 4th, it appeared Facebook resources were once again accessible and both BGP and primarily DNS were able to connect users to their desired platform.

The misconfiguration reported by Facebook and the uptick in business interruption incidents impacting clients worldwide demands for layer of redundancy to ensure that confidentiality, integrity, and availability of resources are always in place.

Please contact ECC IT Solutions to find out how our team is ensuring resource availability is provided to each of our clients!