Every failing monitor in Metaplane automatically triggers a new incident. Incidents can have a single monitor or multiple monitors associated; when your data stack is suffering from multiple data quality issues, it may be the case that the individual anomalies are actually related. In cases like these, additional insights can be gleaned by examining the related anomalies together in order to gain a higher-level perspective on what’s going on.
Incidents is a feature that automatically groups related failing monitors for you while also presenting you with easy-to-digest, interactive summaries complete with aggregate downstream impact, usage, and more.
There are a couple of benefits to Incidents:
- Viewing groups of failing monitors through the lens of an Incident may provide additional insight that may not be obvious when sifting through individual failing monitors.
- Incidents help combat alert fatigue when multiple data quality issues arise all at once. Instead of getting alerted for every individual failure, Metaplane will alert you about related groups of failing monitors.
Incidents introduce a slight adjustment in workflow. Previously, users would get an individual alert per failing monitor. Now, alerts may be summaries of multiple failing monitors while giving you the option to navigate into the app for more context.
From the Incident page, users may perform the following actions:
- Mute the Incident - If you mute an Incident then Metaplane won’t distract you with new alerts. If you’ve connected Slack then we’ll silently update the original Incident slack message and post thread replies with updates without broadcasting them to the main channel.
- Drill in to specific failing monitors - If you need to take a closer look at a monitor's chart or its details, you can click through to the test page from the Incident detail page.
- Mark all or individual monitors as normal - While individual Incident monitors can be marked as normal, you can also choose to mark all of the related tests as normal in a single click.
As we learn more about your data stack and its interdependencies, Incidents will get better at associating related failures.
Today, Incidents will group failing monitors together if the monitors meet two conditions:
- The failing monitor types match
- The failing monitor types are associated with the same part of your stack, e.g., failing monitors associated with columns of the same table, or tables that are part of the same schema, etc.
For example, if multiple table freshness monitors are failing, Incidents will group them together if the tables are part of the same schema. However, if the tables are actually part of 2 different schemas, then 2 separate Incidents will be opened.
Incidents are designed to only send you notifications when there is important information to look at. We will send a notification in the following situations:
Slack (setup instructions)
- We always send the incident alert when a new incident is opened.
- We will send at most one threaded slack update an hour if any new monitors are linked to the incident. This update is broadcasted to the channel unless the incident is muted.
- We will send a daily reminder notification for incidents that are still open.
Note: If an incident is muted, we will not broadcast any incident updates or send the daily reminder notification.
- We will create a Pagerduty incident when a Metaplane incident is opened.
- We will update the Pagerduty title when new monitors are linked to the Metaplane incident.
- We will resolve the Pagerduty incident once the Metaplane incident has been resolved.
Note: Muting an incident in Metaplane will not affect the Pagerduty incident. If you'd like to mute the Pagerduty incident, please do so through the Pagerduty application.
Updated 4 months ago