Incidents

When your data stack is suffering from multiple data quality issues, it may be the case that the individual anomalies are actually related. In cases like these, it may be insightful to examine the related anomalies together in order to gain a higher-level perspective on what’s going on.

Incidents is a feature that automatically groups related failing tests for you while also presenting you with easy-to-digest, interactive summaries complete with aggregate downstream impact, usage, and more.

There are a couple of benefits to Incidents:

  1. Viewing groups of failing tests through the lens of an Incident may provide additional insight that may not be obvious when sifting through individual failing tests.
  2. Incidents help combat alert fatigue when multiple data quality issues arise all at once. Instead of getting alerted for every individual failure, Metaplane will alert you about related groups of failing tests.

Interacting with Incidents

Incidents introduce a slight adjustment in workflow. Previously, users would get an individual alert per failing test. Now, alerts may be summaries of multiple failing tests while giving you the option to navigate into the app for more context.

From the Incident page, users may perform the following actions:

  • Mute the Incident - If you mute an Incident then Metaplane won’t distract you with new alerts. If you’ve connected Slack then we’ll silently update the original Incident slack message and post thread replies with updates without broadcasting them to the main channel.
  • Drill in to specific failing tests - If you need to take a closer look at a test’s chart or its details, you can click through to the test page from the Incident detail page.
  • Mark all or individual tests as normal - While individual Incident tests can be marked as normal, you can also choose to mark all of the related tests as normal in a single click.

Incident test grouping

As we learn more about your data stack and its interdependencies, Incidents will get better at associating related failures.

Today, Incidents will group failing tests together if the tests meet two conditions:

  1. The failing test types match
  2. The failing test types are associated with the same part of your stack, e.g., failing testing associated with columns of the same table, or tables that are part of the same schema, etc.

For example, if multiple table freshness tests are failing, Incidents will group them together if the tables are part of the same schema. However, if the tables are actually part of 2 different schemas, then 2 separate Incidents will be opened.

Incident notifications

Incidents are designed to only send you notifications when there is important information to look at. We will send a notification in the following situations:

Slack

  1. We always send the incident alert when a new incident is opened.
  2. We will send at most one threaded slack update an hour if any new tests are linked to the incident. This update is broadcasted to the channel unless the incident is muted.
  3. We will send a daily reminder notification for incidents that are still open.

Note: If an incident is muted, we will not broadcast any incident updates or send the daily reminder notification.

Pagerduty

  1. We will create a Pagerduty incident when a Metaplane incident is opened.
  2. We will update the Pagerduty title when new tests are linked to the Metaplane incident.
  3. We will resolve the Pagerduty incident once the Metaplane incident has been resolved.

Note: Muting an incident in Metaplane will not affect the Pagerduty incident. If you'd like to mute the Pagerduty incident, please do so through the Pagerduty application.