How failovers work
This topic applies only to the following products:
SolarWinds Observability Self-Hosted
DPAIM — EOC — IPAM — LA — NAM — NCM — NPM — NTA — SAM — SCM — SRM — UDT — VMAN — VNQM — WPM
After SolarWinds Platform High Availability (HA) is enabled and you have set up a pool, each pool monitors itself for failover conditions such as:
- Inability to connect to the network
- Stopped SolarWinds services
Stopped Agent services is not a failover condition.
- Power loss
- Network connection loss to the primary server
When a monitored service is down, the SolarWinds Platform server tries to allow the service to recover before failing over to the secondary server. If the same service fails within the default self-recovery period, a failover occurs.
When a failover condition is met and failover occurs in a pool, a failover event is logged and can be viewed in the Event Summary resource or the Events view. An email is also sent to your default recipients.
For example, if the job engine service is down, the HA software attempts to start it. If the job engine fails again within 1 hour, then a failover occurs and the event is logged. If the job engine fails in 61 minutes, a failover does not occur.
Failovers with virtual hostnames
When your HA pool uses a virtual hostname, failovers may not appear to work due to caching issues. The client DNS cache can take up to one minute to redirect traffic to the new active pool member.
However, your browser's DNS cache does not respect the DNS Time to Live (TTL) value, and the DNS cache retention varies between browsers from 60 seconds to 24 hours. You must flush your browser's cache to be successfully redirected to the new active pool member.