🧑‍🚒 Migrating to a Modern Alerting, On-Call and Incident Platform

At Skutopia we have recently migrated off our legacy enterprise platform for managing alerts, on-call and incidents. We now have a modern platform thanks to incident.io. The benefits we receive are immense.

Incident.io One Platform to Rule Them All

What are the benefits of having all of our alerts, on-call and incidents in one platform?Quite simply, our engineers only need to look in one place to have a complete understanding of the current state of our platform. Should something go wrong, everything within the platform natively interacts. This makes it easy and straightforward to let engineers do their job.

It doesn't hurt that incident.io is easy on the eyes with very straight forward configurations. It's almost impossible to get it wrong.

Alerts

All of alerts flow into one tab on the webapp. It's very clear to see when we have active alerts. We have full control over what fields we pass in with alerts thanks to the metadata object that we can map in the alert source. This means at a glance our responders are presented with runbooks, alert priorities, links to dashboards, and the effected part of the business.

On-Call

Once the alert goes off, it flows straight into an escalation path. We base our on call at the moment at the product vertical level. With the escalation paths we simply do a lookup on the effected product and off the alerts go down the correct path. The current active responder will receive a ping via slackbot as well as via the phone app. The on-call schedules sync into a slack user group so it's as simple as tagging the @on-call slack handle should something need to be made visible.

Incidents

When an alert is raised, a draft incident is prepared in the background. This allows us to quickly move into a full incident response mode should there be a major issue. Should the business need to raise an incident manually, it's as simple as the user running the slack command /inc . The user is presenting with a short form to fill out to give the responder context.

When the incident is created, we receive a dedicated slack channel prepared by the platform that is a single source of information in-regards to the incident. The troubleshooting that takes place in the channel syncs back to the platform. This allows us to run effective post incident reviews.

Updates are sent to a stakeholder slack channel meaning they don't need to be in the weeds to know what's happening, they can follow the concise updates that the incident lead posts out every ~30minutes.

Post Incident Reviews

One of the big benefits of the platform is being able to run effective post incident reviews. These are effectively our incident retrospectives where we agree on:

What went wrong?
How did we troubleshoot it?
What was the root cause?
What was the impact on the business?
- Cost and hours of engineering
- Cost in terms of Ops
- Impact on customers in terms of number of effected orders and complaints
What can we do to prevent this happening again?

Post Incident Review Tasks

On the back of the meeting we have a clear list of post incident review tasks to complete. These are logged in incident.io and synced across to our ticketing platform. A priority is assigned with clear definition on priority:

Urgent: Drop everything this must be fixed now
High: We can finish our current task, but this is the next priority
Low: This is a backlog item, there are reasonable work arounds

Wrapping Up

Recently we have moved to a modern platform for alerts, on-call and incidents. It gives us a single source of truth that is easy to configure, easy on the eyes, and easy for our engineers to use. It takes us through the full life cycle and keeps us accountable for running reviews and completing tasks.

🧑‍🚒 Migrating to a Modern Alerting, On-Call and Incident Platform

Incident.io One Platform to Rule Them All

Alerts

On-Call

Incidents

Post Incident Reviews

Post Incident Review Tasks

Wrapping Up

Comments

Our Engineering Tools

🧑‍💻 Developer Experience with Swarmia

More from this blog

Speeding-up Event Replay from a Postgres Event Store

Technical Writing at SKUTOPIA

💻 GitHub Codespaces: Simplifying The Development Environment

🏗️ Infrastructure as Code with Pulumi

🧑‍💻 Developer Experience with Swarmia

Command Palette

Incident.io One Platform to Rule Them All

Alerts

On-Call

Incidents

Post Incident Reviews

Post Incident Review Tasks

Wrapping Up

Comments

Our Engineering Tools

🧑‍💻 Developer Experience with Swarmia

More from this blog