A Sample Incident Report on Website Outage
In this era of ever-evolving technology and intricate interplay of complex software systems, unforeseen failures occasionally occur leading to temporary disruptions. Yet, it’s in these moments of malfunction that true growth and learning opportunities emerge. As software engineers, we understand that perfection is an elusive goal, and failures are an inherent aspect of the development process. Becoming a remarkable engineer means having the capacity to acknowledge errors, address and learn from them. And nothing prepares a software engineer for such opportunity for like system failures.
Systems failures can manifest due to an array of factors, from coding errors and unexpected spikes in traffic to security vulnerabilities and hardware glitches. After a system failure, the technical team is often expected to write a postmortem or incident report that details the cause of the failure, duration of the outage arising from the failure, components of the system that were affected, the resolution to the failure and corrective and preventative measures to prevent same occurrence in the future. A properly written postmortem is useful for the rest of the employees or users as it explains to them what happened and how it will impact their work or experience. In this article, I share an incident report that I wrote as response to a task in the ALX SE program.
Issue Summary
On 8th August, 2023 between 2:32 PM and 3:47 PM East African Time (EAT), our WordPress website experienced a service outage resulting in 500 error responses to user requests. During this period, users experienced constant 500 error responses for requests for the websites landing page and other pages, impacting 100% of the user base. The root cause of the outage was identified as a misconfiguration in a settings file after an update on the same file, causing incorrect PHP extensions to be used.
Timeline (all times East African Time)
- 2:27 PM — An update on the wp-settings.php file is saved and apache server restarted.
- 2:32 PM: Outage begins.
- 2:34 PM: Customer complains of getting 500 error response.
- 2:38 PM: Operations team begin investigating the problem. Assumptions made that the issue might be related to server load or a database problem.
- 2:43 PM: Debugging efforts mistakenly directed towards optimizing the server and database performance, which did not yield any improvement.
- 3:31 PM: Review of the website’s code base revealed the root cause is a misconfiguration in the wp-settings.php file, where an instance of ‘phpp’ was used instead of ‘php’ for PHP extension.
- 3:36 PM: Misconfiguration promptly fixed using a Puppet manifest that utilized the sed command to replace ‘phpp’ with ‘php’ in the wp-settings.php file.
- 3:45 PM: Server restart.
- 3:47 PM: Website 100% back online.
Root Cause and Resolution
At 2:27 PM EAT, an update on the main settings (wp-settings.php) file for the website was done and saved and server restarted. In the update an invalid extension was added in the file. This misconfiguration caused the web server to interpret PHP files incorrectly, resulting in 500 error responses for users as from 2:32 PM EAT.
At 2:34 PM EAT, a user complained of getting error 500 response when trying to access the website. By 2:38 PM EAT, investigation team began investing the matter and immediately assumed the problem might be arising from large server loads which the server couldn’t handle and at 2:43 PM EAT, engineers directed their efforts to optimizing the server and database performance. However, the issue persisted.
At 3:31 PM EAT, a review of the all PHP files is done and the error is determined to result from a wrong PHP extension within wp-settings.php file. At 3:36 PM EAT, a Puppet manifest using sed command is run to replace the instance of ‘phpp’ with ‘php’ in the wp-settings.php file. After restarting the server at 3:45 PM EAT, the fix resolved the issue and restored the website’s functionality to normal at 3:47 PM EAT.
Corrective and Preventative Measures
To prevent similar incidents in the future, the following measures will be implemented:
- Automated testing will be enhanced to include checks for common configuration errors, ensuring that PHP extensions and other critical components are correctly set.
- Add a test script which recursively checks file extensions and logs errors when unexpected extensions are found.
- Documentation will be updated to include troubleshooting guides for common website issues, facilitating quicker resolution in case of future incidents.