And what we learned from the incident
By Sergey Bachinskiy, Nancy Zhu
Facebook’s Outage Post-Mortem
On October 4th, a mass outage left Facebook and its family of apps, including Instagram, Whatsapp, and Messenger, inaccessible for over five hours. The outage was the worst for Facebook since 2019, and it was caused by changes to the company’s underlying internet infrastructure that coordinates the traffic between its data centers, according to an update issued by Facebook.
Facebook has a network of data centers sprawling across the whole world as well as network equipment that provides internet connections to different areas. During routine maintenance, a bug in the program caused updates in Facebook’s network infrastructure across all data centers, which disconnected the backbone connection between data centers. This then caused the DNS servers to go offline, revoking all records from the system.
During the downtime, which felt like a century to those of us who rely heavily on Facebook’s communication and social platforms, the company’s engineers had to travel to remote data centers and try reviving the network using tools that also got disabled during the outage. The extended outage caused not only anxiety and frustration among Facebook’s end users but also millions of dollars in losses for the company.
Bravado’s Prevention & Progress
Our engineering team also took the opportunity to reflect and improve. Do we have the infrastructure needed to prevent outages like this? What can we learn from this unfortunate incident?
We build our infrastructure on one of the leading cloud providers. Major cloud providers typically provide products that help companies avoid outages. Here are the main features, processes, and systems our DevOps team implemented as the building blocks of a solid infrastructure:
- Load Balancer: Cloud load balancing receives all application traffic and distributes it across our containers; the load balancer could be a point of failure if it becomes unavailable. Our cloud provider provides load balancers that can handle application traffic across 3+ Zones, which are groups of logical data centers. If one data center is down, another data center can continue supporting the service.
- Kubernetes: The Kubernetes system provides managed services for containerized applications. It orchestrates the starting, stopping, and maintaining of application containers. The control plane (i.e., the master node) is located in our cloud provider and is highly available.
- Database + Backups: We use PostgreSQL as our database. We also implemented a feature provided by our cloud provider that automatically creates backups of our database twice a day. This enables us to recover lost data. Additionally, we use Redis for caching solutions.
- (Database feature) Hot Standby: The hot standby feature we implemented on our cloud provider improves the reliability of the database. If the master instance fails, it gets automatically switched to hot standby, which then becomes the master instance.
- (Database feature) Read Replicas: Read replicas allow us to create replicas of a source database instance, thereby providing increased availability when the source database instance fails. Compared to hot standby, which is located close to your database in the same zone, read replicas could be located in another zone. So if one zone is down, we could promote read replicas to be the master instance to continue our services.
- Storage Infrastructure: We implemented a highly available storage solution provided by our cloud provider. We store a duplicated copy of our data and its metadata in another zone in case any information gets deleted and secure the access to it.
- Incident Management System: We have monitoring systems based on Prometheus, which checks our infrastructure and sends alerts. It also creates incidents in Splunk On-Call (formerly Victorops), which in real-time alerts our develop on-call of incidents. Our DevOps team also employs tools including Grafana and New Relic to monitor the status of our database, Kubernetes, etc.
In addition to implementing layers of backups of everything to prevent outages on an infrastructure level, we also started training our entire engineering team for infrastructure maintenance and improvement.
Our DevOps engineer, Sergey, leads a monthly session called “Game Days” for gamified infrastructure training. He would create a few failures (e.g., delete containers, mess up the load balancer) in a staging environment, and the rest of the team then try to fix the failures. Our developers love a good challenge and can effectively learn to fix infrastructure problems through these hands-on sessions. Sergey also provides training and leads discussions to bolster the learning experience. Through activities like Game Days, we not only train all engineers to understand our infrastructure and make fixes in urgent situations but also strengthen our engineering culture and capabilities.
Even though the Facebook outage does not completely apply to us due to the differences in our company infrastructures and sizes, we stay vigilant about areas for improvement in our infrastructure. Here’s what we learned from the Facebook incident:
- Multi-Cloud Architecture: Currently all is well as long as our cloud provider is well. However, we learned that we shouldn’t put all our eggs in one basket by relying on one cloud provider.
- Tool Improvement: We need to continue improving our tools to prevent outages. Currently, in our deployment process, we have mechanisms that prevent changes that’d break production. For example, if a container doesn’t start due to an error, then deployment fails; we have a script for rollback during deployment if something goes wrong with the logic. We are implementing a new mechanism that forbids deployment to production if the code is not on the master branch.
- Team training: We need to continuously train our team on new infrastructure solutions and problems. The visibility and knowledge-sharing will enable more engineers to contribute to the building and maintenance of Bravado’s infrastructure, so we can effectively prevent outages like Facebook’s.
Bravado is hiring! To view and apply to open roles, visit our Careers page.