About anomaly detection in DPA
DPA uses an anomaly detection algorithm to determine if the wait times for a database instance are significantly higher than usual. In some cases, high wait times are normal and expected. With anomaly detection, DPA can alert you to unexpected increases in wait times, and help you investigate these anomalies.
How does DPA's anomaly detection work?
A machine learning algorithm uses wait time data that DPA collects to predict future wait times. DPA uses these predictions to detect wait times that are significantly higher than expected.
Step 1: Data collection |
DPA gathers the data that the algorithm will use to learn what normal is and to predict future wait times. Up to 90 days of historical hourly data is used for learning. Anomaly detection requires a minimum of three days of learning data. DPA does not show any information about anomalies until it has collected at least three days of data. Predictions improve as more data is collected. |
Step 2:
Data analysis and predictions |
Based on the learning data, the algorithm calculates:
When enough data is available, predictions include daily and weekly seasonality (patterns of predictable fluctuations):
|
Step 3:
Anomaly detection |
For each hour, DPA compares the actual amount of wait time during that hour to the predicted value. If the actual amount of wait time is above the warning or critical threshold, DPA:
|
How DPA determines the status of an incomplete hour
To determine if the wait time meter and hourly Anomaly Detection chart should show a warning or critical status for an incomplete hour, DPA uses the last 6 completed 10-minute intervals (a rolling one-hour interval). The status is updated every 10 minutes. For example, to determine the status of the 2:00 hour:
- From 2:00 to 2:09, DPA uses data from 1:00 to 1:59.
- From 2:10 to 2:19, DPA uses data from 1:10 to 2:09.
- From 2:20 to 2:29, DPA uses data from 1:20 to 2:19 (and so on).
SQL statements excluded from the trend charts
The anomaly detection algorithm uses the total wait time for the database instance, including wait time from any SQL statements that you have excluded from the trend charts. In most cases, a statement is excluded from the trend charts because it always has high wait times and the large bar dominates the charts. If the statement runs on a regular schedule with the expected amount of wait time, no anomaly would be detected during that time period, because high wait times are normal during that period. An anomaly would be detected only if wait times during that period were significantly higher than normal, in which case you might want to investigate the change.
Does anomaly detection work well for all database instances?
DPA's anomaly detection algorithm, like most algorithms associated with workloads, works best when:
-
The monitored database instances have a consistent workload executing against them.
-
Daily and weekly seasonality is consistent. For example, database wait times are similar each Monday at 10 AM.
-
DPA monitoring is always on (not shut down for hours or days at a time).
The algorithm might not work well when:
-
The workload for a database instance is sporadic (for example, QA or reporting instances with inconsistent wait times).
-
Daily and weekly seasonality is not consistent. For example, the workload on Monday at 10 AM varies from one week to the next, with no predictable pattern.
-
DPA is not monitoring the instance consistently, and so it cannot get a good understanding of what normal is.
If anomaly detection does not work well for any of your monitored instances, SolarWinds recommends disabling anomaly detection for those instances.
Large gaps in the learning data
If monitoring stops for more than 30 days, the anomaly detection algorithm does not make predictions based on the stale learning data collected before the 30-day gap. DPA collects new learning data and, after three days, begins to make predictions based on the current data.
Anomaly thresholds
Anomalies are classified as warning and critical. The threshold for each classification is based on the standard deviation of the wait times for the associated time period.
Standard deviation is a measure of how dispersed the values in a data set typically are.
The default values for the thresholds are listed below. You can edit the associated advanced option to change the default values.
Classification | Default threshold | Advanced option |
---|---|---|
Warning | The predicted wait time for the hour + 2 standard deviations | ANOMALY_DETECTION_THRESHOLD_WARNING
|
Critical | The predicted wait time for the hour + 3 standard deviations | ANOMALY_DETECTION_THRESHOLD_CRITICAL
|
Specify the learning date after the load on a database instance changes
If the load on a database instance changes significantly (for example, because of changes in the network environment), the previously collected learning data is no longer accurate. To prevent this data from being used for anomaly detection, set the advanced Support option ANOMALY_DETECTION_FORCE_LEARNING_DATE
to the date when the load change occurred. Wait time data collected before this date will not be used to predict future wait times.
Disable anomaly detection for a database instance
By default, anomaly detection is enabled for all database instances. To disable anomaly detection for a database instance that with an inconsistent workload or sporadic monitoring, set the advanced option ANOMALY_DETECTION_ENABLED
to False
for that instance.