Feedback

  • Contents
 

Primary Monitor

Purpose

The purpose of the primary monitor is to:

  • Continually ping the notifier connections to the primary

  • Process events from the module monitor

  • Verify that the module monitor is running

  • Provide diagnostics of the remote notifier connections

  • Determine the network connection status of the backup

  • Determine the network connection status to the primary

Responsibilities

The primary monitor continually monitors the main and auxiliary remote notifier connections to the primary by sending pings to the notifier on the primary. If any of the pings timeout or are otherwise not acknowledged, the primary monitor starts to diagnose the state of the network and connections. It posts events to the switchover state machine if there are issues with the connections or the network.

It also receives notifications from the module monitor that routinely pings the module being monitored on the primary. The module monitor will monitor the TsServer module or the IP module depending on the type of installation. The TsServer module is typically the module being monitored in a PureConnect Cloud environment.

Note:
In this document, the term module refers to either of the modules.

The primary monitor tracks the time between events received from the module monitor. It does this in order to provide verification that the module monitor has not stopped for an unintended reason. When any event is received from the module monitor, the primary monitor resets the internal value that tracks the time of the last event. The state machine will poll the primary monitor at regular intervals to get the status of the module monitor. When the primary monitor receives this poll request from the state machine, it calculates the amount of time since the last module monitor event and determines if the duration is past the maximum allowed. The calculation of the maximum event interval is internal and based on the frequency of the pings sent by the module monitor to the primary. The primary monitor suspends tracking the intervals between module monitor events when reconnecting and resumes it when the connections are re-established.

If the module monitor indicates that the module is down, meaning that the module monitor exhausted its number of retries, the primary monitor will evaluate the network and connections.

When the primary monitor detects that there are issues with the network or connections, it will stop the module monitor in order to avoid receiving continual notifications from the module monitor and to prevent the module monitor from causing a switchover event before the primary monitor has been able to analyze the situation.

The primary monitor is only used on the backup.

Startup

The primary monitor is started by the system monitor (SysMonitor2), which passes the service parameters queried during switchovers initialization as a backup. The parameters include:

  • Timeout, delays, and ping counts

  • The name of the module being monitored

  • The name of the primary

  • A callback into the system monitor

  • The list of NetTest addresses, if configured

During startup, the primary monitor queries the system to obtain the local gateway addresses of the backup. It uses standard Windows APIs to query information about all network adapters. Each adapter query returns the interface and gateway addresses. An adapter may have no gateway address, a single gateway address, or multiple gateway addresses. The primary monitor will store all gateway addresses returned and ping each one when determining network connectivity of the backup. Local host or empty addresses are excluded from the list.

The primary monitor will also attempt to get the address of the server that the primary is running on. It uses this address to test if the primary's server is reachable. If the address cannot be retrieved during startup, the primary monitor will try to get it when checking the network connectivity.

The following diagram shows the startup process.

How the Primary Monitor Determines the Status of the Network and the Connection

When the primary monitor does not receive a ping reply from the notifier on the primary, it checks the following to detect the location of the problem:

  • Status of the remote notifier connection

  • Network status of the backup

  • Network status of the primary

Detecting Status of the Remote Notifier Connections

The primary monitor first examines both of the remote notifier connections. Determining the status of either connection is made by checking its connected state directly. Notifier on the backup maintains state information on each connection and the primary monitor (as well as any other switchover component) can check the current status of any connection. The connected state value is used to evaluate if either connection is up or down.

If either connection is up, the primary monitor attempts to use them to check the status of notifier on the primary. If a closed connection is not re-established or if a connection was up and then goes down, the primary monitor then examines the network.

Detecting Network Connection Status of the Backup and Primary

To identify where there is a network issue, the primary monitor attempts the following steps:

  1. First, the primary monitor pings NetTest addresses (if configured).

  2. Next, the primary monitor pings the backup's local gateway (if enabled in the configuration).

  3. Throughout the entire process, the primary monitor pings the server that the primary is running on (not any CIC products).

The primary monitor pings the local gateway to determine if the backup is still connected to the network. There can be multiple gateways configured for each adapter on the backup. During startup, the primary monitor attempts to enumerate all of the gateways for the adapters on the backup. The primary monitor stores all gateway addresses to use when pinging. Pinging the backup's gateway is optional; by default, it is disabled. To enable the gateway ping, set the Switchover Disable Gateway Ping server parameter to No or 0. To disable the gateway ping, set the parameter to Yes or 1. If you do not set the parameter, the gateway ping is disabled because some gateways block responses to pings (ICMP echoes) as a matter of security. If gateway pings are enabled and the primary monitor attempts to ping the primary's server but ping replies are blocked by the gateway, the primary monitor receives a false positive that the backup is not connected to the network. It is important to disable the primary monitor's gateway pinging if the gateway does not allow ping responses because there is no way for the primary monitor to know how the gateway is configured. However, gateway pinging is extremely valuable in determining the network connection state of the backup. If the backup's gateway responds to pings on the local network, it is strongly suggested that the backup be configured to enable gateway pinging. It is not unreasonable for the backup's gateway to respond to local pings since the backup should be on a private network.

The steps to check the network connections are taken in the following order based on the configuration of switchover on the backup. Once any step is taken, detection of the backup's network connection status is halted and no further steps are taken.

The steps to check the network connections are taken in the following order based on the configuration of switchover on the backup. Once any step is taken, detection of the backup's network connection status is halted and no further steps are taken. The steps are:

  • Gateway pings are enabled: The primary monitor determines the backup's network connection status based on the result of pinging each gateway address. The primary monitor stops pinging the gateway addresses once a response is received or all addresses have been pinged without a response. At this point, the primary monitor stops checking the network condition and sets the network status.

    • Ping each gateway address until:

      • A ping response is received from the gateway, the network connection status is set to Gateway Reached, and the backup's network connection is good.

      • All gateway addresses have been pinged and no responses were received, the network connection status is set to Gateway Unreachable and the backup's network connection is not good.

    • Network connection evaluation is stopped by the primary monitor.

  • NetTest addresses are configured: The primary monitor makes a determination of the backup's network connection status based on the result of pinging each NetTest address. The primary monitor stops pinging the NetTest addresses once a response is received or all addresses have been pinged without a response. At this point, the primary monitor stops checking the network condition and sets the network status.

    • Ping each NetTest address until:

      • A ping response is received from a NetTest address, the network connection status is set to NetTest Reached, and the backup's network connection is good.

      • All NetTest addresses have been pinged and no responses were received, the network connection status is set to NetTest Unreachable, and the backup's network connection is not good.

    • Network connection evaluation is stopped by the primary monitor.

  • Ping the address of the primary's server: The primary monitor stores the address of the primary's server during startup. If it was not able to get the primary server's address, it attempts to do so in this step. This is a standard ICMP echo sent to the IP address of the primary and is independent of any CIC application. This ping is used by the primary monitor to detect if the server running the primary is reachable.

    • If the primary server's address is not stored and the primary monitor is able to get it, ping the primary's server address and if:

      • A ping response is received from the primary's server address, the network connection status is set to Primary Reached and the backup's network connection is good. This indicates that there is an issue with the primary (CIC).

      • A ping response is not received from the primary's server address, the network connection status is set to Primary Unreachable and the backup's network connection is not good.

        • This indicates one or both of the following conditions:

          • There is a connection issue between the backup's network and the primary's network (when they're on separate networks such as in the case of a WAN configuration).

          • The primary's server is down.

        • When this happens, the primary monitor cannot determine if any of the primary's components have failed and will do one of two things:

          • Wait for a period of time or indefinitely (depending on the configuration parameters) for the primary to become reachable again.

          • Immediately switchover to become the primary.

        • The action taken when the primary is unreachable can be selected using server parameters. The default is to switch over immediately. For more information on the Unreachable Primary Ping Count and Switchover Unreachable Primary Ping Delay parameters, see Switchover Server parameters. (The Switchover Unreachable Primary Ping Delay parameter is used to select immediate switchover, timed switchover, or no switchover if the primary is not reachable.)

    • Network connection evaluation is stopped by the primary monitor.

The action taken when the primary is unreachable can be selected using server parameters. The default is to switch over immediately. For more information on the Unreachable Primary Ping Count and Switchover Unreachable Primary Ping Delay parameters, see Switchover Server parameters. (The delay parameter is used to select immediate switchover, timed switchover, or no switchover if the primary is not reachable.)

If there is a connection issue and any of the following conditions exist:

  • The primary is reachable

  • The server parameters aren't configured to switch over immediately when the primary is unreachable

  • Pinging the gateway is disabled

The following actions are taken:

  1. The primary monitor stops the module monitor.

  2. The switchover state machine enters the reconnect state.

  3. A timer is scheduled that, upon execution, posts an event to the switchover state machine to initiate a switchover. This timer interval is the maximum amount of time that the Switchover system tries to reconnect to the primary before initiating a switchover. This value is set with the Switchover Reconnect Timeout server parameter. For more information, see Switchover Server parameters.

  4. The switchover state machine attempts to restore the connections to the primary.

If the module monitor has notified the primary monitor that it cannot communicate with the monitored module, the primary monitor takes steps to examine the network and remote notifier connections as outlined above. If they are good, it indicates that the monitored module is down. In this case, the primary monitor posts an event to the switchover state machine, which initiates a switchover.

Network Status versus Connection Status

Network and connection status are determined by the primary monitor, as described previously. They are distinct and defined, in terms of switchover, as:

  • Network Status: The status of the physical network connection for the server running switchover in the backup state. It is independent of any switchover or CIC applications.

  • Connection Status: The status of the both of the notifier connections between the backup and the primary. It is reflective of the network status since it depends on the physical network connection.

Whenever the primary monitor attempts to assess whether communications can occur between the backup and primary, the physical network connection is the mitigating factor; without the network connection, the notifier connections are unavailable. If the network connection is available, then the connection status provides further information about the state of the notifier connections. The connection status is one of the following:

  • Both connections are up.

  • The main connection is up (auxiliary connection is down).

  • The auxiliary connection is up (main connection is down).

  • Both connections are down.

Detection of the connection status includes:

  • Checking the network connections

  • Checking the main and auxiliary connections

  • Determining if the primary's server is reachable

  • Checking for no-response condition from the monitored module

If the network connection is good, the connections are checked. When the connections are up, the monitor's response status is evaluated. The primary reachable status is examined if there is an issue with the network connections to determine if there's a general network problem or just the path to the primary's server. Depending on the results of all of these checks, an event may be posted to the state machine indicating a condition that requires further reconnect processing. The diagram below illustrates the logic that determines the overall connection status.

Note that the primary is reachable even if the network connections are not good. This is because the definition of a good network connection includes whether or not the primary's server is reachable. If it is not, the network connection's status is primary unreachable and this is not a good condition.

This processing occurs when the primary monitor detects that it did not receive a ping from the notifier on the primary within the allotted amount of time or it detected a connection loss. When that happens, the primary monitor uses this logic to make a preliminary decision about what to do next. You can see in the backup state diagram how these events are handled. In the case of a primary unreachable determination, the state machine transitions to the reconnect state and tries to restore connections with the primary.

Handling Module Monitor Notifications

The primary monitor receives all notifications from the monitor module. If a notification indicates an issue with the module, network, or connections, then the primary monitor posts an event to the state machine, if required. The primary monitor can then make one of the following determinations:

  • Network NOT OK:

    • Primary Reachable: The server running the primary can be reached, so the remote notifier is down.

    • Primary Unreachable: The server running the primary cannot be pinged, so the primary is unreachable.

  • Network OK:

    • Both Connections Down: The main and auxiliary remote connections are down and the primary can be reached so the remote notifier is down.

    • Both Connections Up:

      • The Module is Down: The module cannot be pinged so the module is down.

      • The Module is Up: The module monitor indicated that the module was down but subsequent pings were answered so the module is considered to be up and there is no error.

The following flowchart illustrates this decision-making process.

Sending Ping Requests

The primary monitor sends ping requests to the notifier on the primary at the specified ping delay interval. The primary monitor schedules a callback whose timeout is the ping delay interval. During that callback, a request is sent to the remote notifier which acts like a network ping except that the response is from the remote notifier itself.

Successful Ping Response

A successful ping response occurs when the notifier on the primary sends a response within the timeout specified in the ping request. An asynchronous ping request callback of the primary monitor is called when the response is received successfully and before the timeout. The retry count is reset when a successful ping response is retrieved and the next ping is scheduled.

After a ping has been received successfully, another ping request callback is scheduled to send the next ping after the required delay.

The primary monitor method that schedules the next ping also acts as a receiver for requests to asynchronously restart the module monitor. There are conditions where the primary monitor cannot synchronously restart the module monitor. In those cases, a request is scheduled and this method will identify those requests and restart the module monitor. The method performs no other processing after the restart.

Non-Successful Ping Responses

Other responses are considered to be non-successful and appropriate action is taken if applicable. There are callbacks for each of the unsuccessful responses:

  • Timeout: This response represents the potential loss of a remote notifier connection or an issue with the notifier on the primary. It triggers the retry logic that begins the process of sending retry pings.

  • Rejected: Similar to the timeout response. The diagram below represents both the timeout and rejected response handling.

  • Connection Loss: This response represents the potential loss of a connection to the notifier on the primary or an issue with the notifier on the primary.

    • If the main connection is up, this response could indicate that the auxiliary connection went down or there was a momentary drop of the main connection but it was recovered. In this case, a standard ping is sent.

    • If the main connection is down, the primary monitor treats this as a potential reconnect condition and sends a retry ping.

    The following diagram illustrates how this response is handled.

  • Cancelled: No operations are performed upon receipt of this response because the primary monitor will cancel any outstanding pings when it is stopping (which generates this response).

A retry ping is sent for an unsuccessful response. If the retry count has reached the maximum, no retry ping is sent and error processing starts to diagnose the state of the local network and remote connections to the primary.