Model Troubleshooting Guide

Issues with failures and incidents

1. The monitor is failing but the current value is OK.

You have a few options to get your monitor back to a normal state:

Click a recent failing data point and choose “Mark as Normal” in the table. This pops up two options to tell the machine learning model what you want (more info here):
- “Until this happens again” tells the model to adjust the ranges to include the current value but doesn’t affect the model in the long term.
- “Until there’s a bigger anomaly” teaches the model to expect similar jumps, drops, or flatlines in the future.
Decrease the sensitivity. This increases the size of the ranges. For reference, dropping sensitivity to the minimum value will roughly double the size of the ranges.
Exclude history with outdated patterns. Update the “Include observations” field to a date and time when the new normal pattern has started. The model may re-enter training, so it may not send alerts for up to a week as it learns the new normal.

Issues with predictions

You have a few options to fine-tune your predictions:

1. The ranges are wider than I want them to be.

Increase the sensitivity. This will shrink the ranges. For reference, increasing sensitivity to its max will shrink the ranges by roughly 2/3rds.
Use the “manual” anomaly detection method. This is only advised if you have a clear idea of what values you consider to be normal and what values you’d want Metaplane to create an incident for.
Use the “stationary” machine learning model. This is only advised if your monitors oscillate up and down around a fixed point. It does not work well for monitors that:
- Trend upward or downward over time
- Have a value that never changes (i.e. a flatline)
- Have a “sawtooth” pattern, like freshness
- Have seasonality cycles that last longer than 30 days
Remove “Mark as Normal - Until there’s a bigger anomaly” annotations in the history by clicking them and using “Mark as Anomaly”. If you have any in the history they will expand the ranges forever, but removing them will undo this.
Exclude history with outdated patterns. If your metric has less volatility now than in the past, update the “Include observations” field to a date and time when the new normal pattern has started.
Create a new monitor to complement the existing one. For example:
- Pairing a freshness monitor with a row count monitor can help catch issues that a row count monitor might not catch alone.
- Pairing the existing monitor with another on a different column or table can also help. If the second monitor is of a different monitor type, or uses a different rolling time window or WHERE clause to look at a different subset of the data, this combination can surface issues that might not be visible in the original monitor.
- Adding a GROUP BY monitor that monitors subsets of the data can surface issues that may be harder to detect on the scale of a table as a whole. For example, a GROUP BY monitor that monitors the row count for individual data sources can more easily catch when one source stops sending data.

2. The predictions are fluctuating in ways that don’t make sense

Use “Mark as Normal” on failed data points in the history that were OK. This can help the model learn what you consider normal.
Use a time window to feed the monitor a more predictable subset of the data. Applying a rolling time window results in the monitor only looking at the most recent data (last day, last week, last month, etc.). This can also make the monitor more suitable for applying the Stationary model. Be sure to update the “Include data since” field to exclude any observations from before you made this change, as they will throw off the predictions.
Use a WHERE clause to feed the monitor a more predictable subset of the data. Excluding less predictable subsets can improve the predictability of the machine learning models. Be sure to update the “Include data since” field to exclude any observations from before you made this change, as they will throw off the predictions.

3. My custom SQL monitor isn’t behaving how I expect it to.

Use a different machine learning model type. Models that align with the aggregation in your query can often be more effective than the default model. For example, use the “Row Count” model type if your query returns a count.

Still having issues? Send us a message. We're constantly working to improve the behavior of the ML models that power your anomaly detection, and we'd love to hear more.