Incident Information
Between the hours of Mon Jul 3 03:05:59 2023 and Tue Jul 4 01:10:15 2023 the home server named BSD.am (also known as pingvinashen.am) was completely down.
The event was triggered by a battery issue due to high temperature at the apartment where the home server resides.
A battery swell caused the computer to shut down as it produced higher than normal heat into the system.
The event was detected by the monitoring system at mon.bsd.am which notified the operators using email and chat systems (XMPP).
This incident affected 100% of the users of the following services:
- jabber.am public XMPP server
- conference.jabber.am public XMPP MUC server
- օրագիր.հայ public WriteFreely instance
- սարեան.ցանցառներ.հայ public Lobste.rs instance
- BIND.am public DNS server and its zones
- Multiple hosted blogs, including this one you’re reading.
- A private ZNC server for Armenian Hackers Community
- git.bsd.am public Gitea server
- A matterbridge instance connecting multiple communities
- A Huginn instance automating tasks (such as RSS to Telegram, RSS to newsletter) for Armenian Hackers Communities
- A newsletter instance running listmonk.app
- A private Miniflux.app server for Armenian Hackers Community
- FreeBSD Jail users’ meetup website
Multiple community members contacted the operator (yours truly) asking for an ETA.
Response
After receiving an email at Mon Jul 3 03:06:49 2023, the Chief Debugging Officer (yours truly) started analyzing the possible issue. According to Monit (mon.bsd.am) all the services were unavailable and the server was not reachable by IP (based on ICMP).
The usual possibility, network failure at the ISP level, was ruled out, as the second home server (arnet.am) was functioning properly.
The person closest to the server physically, was the operator’s sibling (lucy.vartanian.am), however she did not have the background in Unix system administration nor in hardware maintenance. Also, she was asleep.
Hours later the siblings (yours truly) organized a FaceTime call to debug the issues remotely.
The system did boot the kernel properly, however it would shutdown before the services could complete their startup.
Clearly, the machine needed to be shipped to the operator (yours truly) to be debugged at the spot.
So that’s what the team did.
Precise addresses are removed for privacy
Recovery
At the operator’s (yours truly) location, the BIOS logs have listed that the system suffered from a ASF2 Force Off. This usually means a thermal problem.
The operator (yours truly) disassembled the laptop, hoping the system needs a little dust clean-up and a thermal paste update.
Turns out the problem was actually a swollen battery.
After removing the battery, the system booted fine. Just to be sure that the swollen battery was the root cause, a complete system stress test was ran. No issues detected (Well, except “Missing Battery”).
The systems was returned to its residency, connected to the internet and all services were accessible again.
Precise addresses are removed for privacy
Next Steps
- Install a new battery in the future, as the laptop is not connected to a UPS
- Make sure to test the hardware during environmental changes (too cold, too hot, etc)
- Run a simple status page with an RSS feed in a separate environment and notify users
If you’re new here, then first of all I’d like to thank you for reading this IR Postmortem article.
Yes, this was an IR Postmortem of a home server of a tiny community in a tiny country. This was not about Amazon, Google, Netflix, etc.
I wrote this for two reasons.
First, I wanted to show you how awesome the actual internet is. You see, when Amazon dies, everything dies with it. Your startup infra, your website, your hobby projects, everything.
When my server dies, only my server dies. And that’s the beauty of the internet. If you can, please, keep that beauty going.
Second, I run a small security company, illuria, Inc., where we help companies harden their environment and recover from incidents. It’s been years since I wrote an IR postmortem personally (my team members who do that are way smarter than me!), and I thought it would be a nice exercise to write it all by myself 🙂
I hope you liked this.
That’s all folks…
@antranigv Enjoyed this one, thanks for sharing. I’m always curious how others are dealing with things like these. Have you now got anything in place to monitor the temperature of that box?