Policy: Internal Service Level Agreement

Device recovery actions

The following matrix shows the sequence of WhatsUp Gold managed controls (thresholds and states coupled with actions) rolled into an example action policy called "Internal Service Level Agreement."

Event

Actions

Notification

Device crash and reboot or forced reboot after n seconds.

Run recovery action scripts.

  • Verify network (NIC) connectivity.
  • Verify application service connectivity.
  • If application service verification fails, trigger failed web service role policy.
  • Data center team notification.
  • On call services engineers notification.
  • Notification attachments
    • Event log report.
    • Follow on with availability reports.

Failed web service role policy.

(No response or after n milliseconds to application service active monitors from WhatsUp Gold pollers.)

Remove web service role endpoint from the "load balancer" configuration.

Add to WhatsUp Gold maintenance mode.

  • Data center team notification.
  • On call services engineers notification.
  • Product owner notification with availability reports.

Failed node.

Symptoms: External NIC or management NIC not responding, remote execution failure, remote write failure, kernel panic.

Run failed node recovery action.

  • Data center team notification.
  • Notify closest data center engineer.