Home > Articles > Web Services > Cloud Computing

  • Print
  • + Share This
This chapter is from the book

This chapter is from the book

Dynamic Failure Detection and Recovery

p0027_01.jpg

How can the notification and recovery of IT resource failure be automated?

Problem

When cloud-based IT resources fail, manual intervention may be unacceptably ineffi cient.

Solution

A watchdog system is established to monitor IT resource status and perform notifi cations and/or recovery attempts during failure conditions.

Application

Different intelligent monitoring and recovery technologies can be used to establish the automation of failure detection and recovery tasks with a focus on watching, deciding upon, acting upon, reporting, and escalating IT resource failure conditions.

Mechanisms

Audit Monitor, Cloud Usage Monitor, Failover System, SLA Management System, SLA Monitor

Problem

Cloud environments can be comprised of vast quantities of IT resources being accessed by numerous cloud consumers. Any of those IT resources can experience predictable failure conditions that require intervention to resolve. Manually administering and solving standard IT resource failures in cloud environments is generally inefficient and impractical.

Solution

A resilient watchdog system is established to monitor and respond to a wide range of pre-defined failure scenarios. This system is further able to notify and escalate certain failure conditions that it cannot automatically solve itself.

Application

The resilient watchdog system relies on a specialized cloud usage monitor (that can be referred to as the intelligent watchdog monitor) to actively monitor IT resources and take pre-defined actions in response to pre-defined events (Figures 4.16 and 4.17).

Figure 4.16

Figure 4.16 The intelligent watchdog monitor keeps track of cloud consumer requests (1) and detects that a cloud service has failed (2).

Figure 4.17

Figure 4.17 The intelligent watchdog monitor notifies the watchdog system (3), which restores the cloud service based on predefined policies (4).

The resilient watchdog system, together with the intelligent watchdog monitor, performs the following five core functions:

  • watching
  • deciding upon an event
  • acting upon an event
  • reporting
  • escalating

Sequential recovery policies can be defined for each IT resource to determine how the intelligent watchdog monitor should behave when encountering a failure condition (Figure 4.18). For example, a recovery policy may state that before issuing a notification, one recovery attempt should be carried out automatically.

Figure 4.18

Figure 4.18 In the event of any failures, the active monitor refers to its predefined policies to recover the service step by step, escalating the processes as the problem proves to be deeper than expected.

When the intelligent watchdog monitor escalates an issue, there are common types of actions it may take, such as:

  • running a batch file
  • sending a console message
  • sending a text message
  • sending an email message
  • sending an SNMP trap
  • logging a ticket in a ticketing and event monitoring system

There are varieties of programs and products that can act as an intelligent watchdog monitor. Most can be integrated with standard ticketing and event management systems.

Mechanisms

  • Audit Monitor – This mechanism may be required to ensure that the manner in which this pattern is carried out at runtime is in compliance with any related legal or policy requirements.
  • Cloud Usage Monitor – Various specialized cloud usage monitors may be involved with monitoring and collecting IT resource usage data as part of failure conditions and recovery, notification, and escalation activity.
  • Failover System – Failover is fundamental to the application of this pattern, as the failover system mechanism is generally utilized during the initial attempts to recover failed IT resources.
  • SLA Management System and SLA Monitor – The functionality introduced by the application of the Dynamic Failure Detection and Recovery pattern is closely associated with SLA guarantees and therefore commonly relies on the information managed and processed by these mechanisms.
  • + Share This
  • 🔖 Save To Your Account