High Availability

What is High Availability?
- Inherent HA
Standard or custom "hot plugging"
CompactPCI Hot Plug support
Custom hardware support
Client library
High Availability Manager
HAM and the Guardian
HAM hierarchy
HAM as a "filesystem"
Multistage recovery
HAM API

What is High Availability?

The term High Availability (HA) is commonly used in telecommunications and other industries to describe a system's ability to remain up and running without interruption for extended periods of time. The celebrated "five nines" availability metric refers to the percentage of uptime a system can sustain in a year -- 99.999% uptime amounts to about five minutes downtime per year.

Obviously, an effective HA solution involves various hardware and software components that conspire to form a stable, working system. Assuming reliable hardware components with sufficient redundancy, how can an OS best remain stable and responsive when a particular component or application program fails? And in cases where redundant hardware may not be an option (e.g. consumer appliances), how can the OS itself support HA?

An OS for HA

If you had to design an HA-capable OS from the ground up, would you start with a single executable environment? In this simple, high-performance design, all OS components, device drivers, applications, the works, would all run without memory protection in kernel mode.

On second thought, maybe such an OS wouldn't be suited for HA, simply because if a single software component were to fail, the entire system would crash. And if you wanted to add a software component or otherwise modify the HA system, you'd have to take the system out of service to do so. In other words, the conventional realtime executive architecture wasn't built with HA in mind.

Suppose, then, that you base your HA-enabled OS on a separation of kernel space and user space, so that all applications would run in user mode and enjoy memory protection. You'd even be able to upgrade an application without incurring any downtime.

So far so good, but what would happen if a device driver, filesystem manager, or other essential OS component were to crash? Or what if you needed to add a new driver to a live system? You'd have to rebuild and restart the kernel. Based on such a monolithic kernel architecture, your HA system wouldn't be as available as it should be.

Inherent HA

A true microkernel that provides full memory protection is inherently the most stable OS architecture. Very little code is running in kernel mode that could cause the kernel itself to fail. And individual processes, whether applications or OS services, can be started and stopped dynamically, without jeopardizing system uptime.

QNX inherently provides several key features that are well-suited for HA systems:

System stability through full memory protection for all OS and user processes.
Dynamic loading and unloading of system components (device drivers, filesystem managers, etc.).
Separation of all software components for simpler development and maintenance.

While any claims regarding "five nines" availability on the part of an OS must be viewed only in the context of the entire hardware/software HA system, one can always ask whether an OS truly has the appropriate underlying architecture capable of supporting HA.

HA-specific modules

Apart from its inherently robust architecture, QNX also provides several components to help developers simplify the task of building and maintaining effective HA systems:

HA client-side library -- cover functions that allow for automatic and transparent recovery mechanisms for failed server connections.
HA Manager -- a "smart watchdog" that can perform multistage recovery whenever system services or processes fail.
Hot-plug Manager -- supports CompactPCI as well as custom hardware.

Standard or custom "hot plugging"

In an HA environment, you may need to be able to replace or add certain hardware components dynamically, without suspending a server or your running applications. QNX not only supports the standard hardware-specific hot-plugging method (i.e. PCI Hot Plug), but also delivers the inherent flexibility you'll need if you're using a non-PCI chassis in your HA system.

PCI Hot Plug support

PCI Hot Plug is an industry-standard technology for removing, replacing, or adding PCI adapters on a live system. To benefit from PCI Hot Plug, you need the following:

A PCI Hot Plug-capable server.
A PCI Hot Plug-enabled OS.
A PCI Hot Plug-capable driver for each adapter that needs to be hot-plugged.

Custom hardware support

While many operating systems provide HA support in a hardware-specific way (e.g. via PCI Hot Plug), QNX isn't tied to PCI. Your particular HA system may be built on a custom chassis, in which case an OS that offers a PCI-based HA "solution" may not address your needs at all.

Client library

The HA client-side library provides a drop-in enhancement solution for many standard C Library I/O operations. The HA library's cover functions allow for automatic and transparent recovery mechanisms for failed connections that can be recovered from in an HA scenario. Note that the HA library is both thread-safe and cancellation-safe.

The main principle of the client library is to provide drop-in replacements for all the message-delivery functions (i.e. MsgSend*). A client can select which particular connections it would like to make highly available, thereby allowing all other connections to operate as ordinary connections (i.e. in a non-HA environment).

Normally, when a server that the client is talking to fails, or if there's a transient network fault, the MsgSend* functions return an error indicating that the connection ID (or file descriptor) is stale or invalid (EBADF). But in an HA-aware scenario, these transient faults are recovered from almost immediately, thus making the services available again.

Recovery example

The following example demonstrates a simple recovery scenario, where a client opens a file across a network file system. If the NFS server were to die, the HA Manager would restart it and remount the filesystem. Normally, any clients that previously had files open across the old connection would now have a stale connection handle. But if the client uses the HA_attach functions, it can recover from the lost connection.

The HA_attach functions allow the client to provide a custom recovery function that's automatically invoked by the cover-function library. This recovery function could simply reopen the connection (thereby getting a connection to the new server), or it could perform a more complex recovery (e.g. adjusting the file position offsets and reconstructing its state with respect to the connection). This mechanism thus lets you develop arbitrarily complex recovery scenarios, while the cover-function library takes care of the details (detecting a failure, invoking recovery functions, and retransmitting state information).

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>
#include <halibc.h>

#define TESTFILE "/net/machine99/home/test/testfile"

typedef struct handle {
  int nr;
  int curr_offset;
} Handle ;

int recover_conn(int oldfd, void *hdl)
{
  int newfd;
  Handle *thdl;
  thdl = (Handle *)hdl;
  newfd = HA_reopen(oldfd, TESTFILE, O_RDONLY);
  if (newfd >= 0) {
    // adjust file offset to previously known point
    lseek(newfd, thdl->curr_offset, SEEK_SET); 
    // increment our count of successful recoveries
    (thdl->nr)++;
  }
  return(newfd);
}

int main(int argc, char *argv[])
{
  int status;
  int fd;
  int fd2;
  Handle hdl;
  char buf[80];

  hdl.nr = 0;
  hdl.curr_offset = 0;
  // open a connection
  // recovery will be using "recovery_conn", and "hdl" will
  // be passed to it as a parameter
  fd = HA_open(TESTFILE, O_RDONLY, recover_conn, (void *)&hdl, 0);
  if (fd < 0) {
    printf("could not open file\n");
    exit(-1);
  }
  status = read(fd,buf,15);
  if (status < 0) {
    printf("error: %s\n",strerror(errno));
    exit(-1);
  }
  else {
    hdl.curr_offset += status;
  }
  fd2 = HA_dup(fd);
  // fs-nfs2 fails, and is restarted, the network mounts
  // are re-instated at this point.   
  // Our previous "fd" to the file is stale
  sleep(18);
  // reading from dup-ped fd
  // will fail, and will recover via recover_conn
  status = read(fd,buf,15);
  if (status < 0) {
    printf("error: %s\n",strerror(errno));
    exit(-1);
  }
  else {
    hdl.curr_offset += status;
  }
  printf("total recoveries, %d\n",hdl.nr);
  HA_close(fd);
  HA_close(fd2);
  exit(0);
}

Since the cover-function library takes over the lowest MsgSend*() calls, most standard library functions (read(), write(), printf(), scanf(), etc.) are also automatically HA-aware. The library also provides an HA-dup() function, which is semantically equivalent to the standard dup() function in the context of HA-aware connections. You can replace recovery functions during the lifetime of a connection, which greatly simplifies the task of developing highly customized recovery mechanisms.

High Availability Manager

The QNX High Availability Manager (HAM) provides a mechanism for monitoring system services and processes. The goal is to establish a highly resilient manager (or "smart watchdog") process that can perform multistage recovery whenever system services or processes fail or no longer respond.

HAM and the Guardian

As a self-monitoring manager, HAM is resilient to internal failures. If, for whatever reason, HAM itself is stopped abnormally, it can immediately and completely reconstruct its own state. A mirror process called the Guardian perpetually stands ready and waiting to take over HAM's role. Since all state information is maintained in shared memory, the Guardian can assume the exact same state that the original HAM was in before the failure.

But what happens if the Guardian terminates abnormally? The Guardian (now the new HAM) creates a new Guardian for itself before taking the place of the original HAM. Practically speaking, therefore, one can't exist without the other.

Since the HAM/Guardian pair monitor each other, the failure of either one can be completely recovered from. The only way to stop HAM is to explicitly instruct it to terminate the Guardian and then to terminate itself.

HAM hierarchy

HAM consists of three main components:

Entities
Conditions
Actions

Entities

Entities are the fundamental units of observation/monitoring in the system. Essentially, an entity is a process (pid). As processes, all entities are uniquely identifiable by their pids. Associated with each entity is a symbolic name that can be used to refer to that specific entity. Again, the names associated with entities are unique across the system. Managers are currently associated with a node, so uniqueness rules apply to a node. As we'll see later, this uniqueness requirement is very similar to the naming scheme used in a hierarchical filesystem.

There are two fundamental entity types:

Self-attached entities -- processes that choose to send heartbeats to the HAM, which will then monitor them for failure. Self-attached entities can, on their own, decide at exactly what point in their lifespan they want to be monitored, what conditions they want acted upon, and when they want to stop the monitoring. In other words, this is a situation where a process says, "Do the following if I die."
Externally attached entities -- generic processes in the system that are being monitored. These could be arbitrary daemons/service providers whose health is deemed important. This method is useful for the case where Process A says, "Tell me when Process B dies" but Process B needn't know about this at all.

Conditions

Conditions are associated with entities. These conditions represent the state of the entity. Here are some examples of conditions:

Condition	Description
ConditionDeath	The entity has died.
ConditionUnresponsive	The entity is no longer responding.
ConditionHeartbeatMissed	The entity was supposed to send "heartbeat" messages at specific intervals, but has missed sending one or more heartbeats.
ConditionResourceHog	The entity is consuming too many specific resources.
ConditionDetach	The entity that was being monitored is detaching. This ends HAM's monitoring of that entity.
ConditionRestart	The entity was restarted. This condition is true after the entity is successfully restarted.

Conditions are also associated with symbolic names, which also need to be unique within an entity.

Actions

Actions are associated with conditions. Actions are executed when the appropriate conditions are true with respect to a specific entity. There are several different kinds of actions:

Action	Description
ActionRestart	This action restarts the entity.
ActionExecute	Executes an arbitrary command (e.g. to start a process).
ActionNotifyPulse	Notifies some process that this condition has occurred. This notification is sent using a specific pulse with a value specified by the process that wished to receive this notify message.
ActionNotifySignal	Notifies some process that this condition has occurred. This notification is sent using a specific realtime signal with a value specified by the process that wished to receive this notify message.
ActionWaitfor	This actions lets you insert delays between consecutive actions in a sequence. You can also wait for certain names to appear in the namespace.
ActionLog	Report this condition to a logging mechanism.

Actions are also associated with symbolic names, which are unique within a specific condition.

HAM as a "filesystem"

Effectively, HAM's internal state is like a hierarchical filesystem, where entities are like directories, conditions associated with those entities are like subdirectories, and actions inside those conditions are like leaf nodes of this tree structure.

HAM also presents this state as a read-only filesystem under /proc/ham. As a result, arbitrary processes can also view the current state (e.g. you can do ls /proc/ham).

The /proc/ham filesystem presents a lot of information about the current state of the system's entities. It also provides useful statistics on heartbeats, restarts, and deaths, giving you a snapshot in time of the system's various entities, conditions, and actions.

Multistage recovery

HAM can perform a multistage recovery, executing several actions in a certain order. This technique is useful whenever strict dependencies exist between various actions in a sequence. In most cases, recovery requires more than a single restart mechanism in order to properly restore the system's state to what it was before a failure.

For example, suppose you've started fs-nfs2 (the NFS filesystem) and then mounted a few directories from multiple sources. You can instruct HAM to restart fs-nfs2 upon failure, and also to remount the appropriate directories as required after restarting the NFS process.

As another example, suppose io-net (network I/O manager) were to die. We can tell HAM to restart it and also to load the appropriate network drivers (and maybe a few more services that essentially depend on network services in order to function).

HAM API

The basic mechanism to talk to HAM is to use its API. This API is implemented as a library that you can link against. The library is thread-safe as well as cancellation-safe.

To control exactly what/how you're monitoring, the HAM API provides a collection of functions, including:

Function	Description
HamConnect()	Open a connection to HAM.
HamDisconnect()	Close a connection to HAM.
HamAttachSelf()	Attach to HAM (used by an entity that wants to be monitored).
HamDetachSelf()	Detach from HAM (used when a self-attached entity wants to stop being monitored).
HamHeartbeat()	Send a heartbeat event to HAM (used by self-attached entities).
HamAttach()	Attach to HAM (used by an entity to monitor a different entity).
HamDetach()	Detach from HAM (used to stop monitoring an entity).
HamConditionDeath()	Perform certain actions when a monitored entity terminates abnormally.
HamConditionDetach()	Perform certain actions when a monitored entity properly detaches from HAM.
HamConditionHBeatMissedHigh()	Perform certain actions when a heartbeating entity misses a predefined number of heartbeats specified for a condition of "high" severity.
HamConditionHBeatMissedLow()	Perform certain actions when a heartbeating entity misses a predefined number of heartbeats specified for a condition of "low" severity.
HamConditionRestart()	Indicates that an entity died and has been restarted.
HamActionRestart()	Restart an entity when it dies.
HamActionExecute()	Execute a command line upon a certain condition.
HamActionNotifyPulse()	Send a pulse to a process upon a certain condition.
HamActionNotifySignal()	Send a realtime signal to a process upon a certain condition.
HamActionWaitfor()	Insert arbitrary delays into a sequence of actions.
HamStop()	Instruct HAM to stop. HAM first terminates the Guardian and then terminates itself.