|   |
##################################################
#
# NodeWatch Node Database
#
#
#
# The basic format for an entry is
#
# node class count timeout action[:period[(d)]][[,action[:period[(d)]]] ...]
#
# The node field is the name of the node to monitor.
#
# The class field gives information about the node. The class field must
# be "r" for members of a redundant set, "c" for a core device, and "h" for
# everything else. Class "c" indicates that if that node goes down, many
# other nodes will become unreachable. When a core device is down, then
# only core device status changes are noticed. NodeWatch automatically
# checks for the presence of redundant partners when a core device changes
# state and escalates to the 'special' action if so instructed.
#
# The count field is an attribute of the status algorithm used by NodeWatch.
# A node is up if it has count consecutive reports of being alive; a node is
# down if it has count consecutive reports of being unreachable. A node is
# in flux if it has less than count consecutive reports of being either alive
# or unreachable. This algorithm takes into account the possibility of a node
# being reported as unreachable when it is in fact alive, but the query got
# lost somewhere on the network. That is not the whole algorithm, but enough
# of it to make sense of the count field.
#
# The timeout field is how long to wait for a reply on a node status query.
# It is measured in seconds.
#
# The action with valid time period field tells NodeWatch what command to
# execute and when it is okay to execute it when it has a message to deliver
# pertaining to that node. An action is an identifier from the action
# database. A time period identifier is an identifier from the time period
# database. A single entry to this field is an action identifier, and if
# there is a time period identifier, the action identifier and time period
# identifier are separated with a colon (':'). If no time period identifier
# is specified, then any time is assumed. The time period identifier can have
# a '(d)' appended to it. In that case, then the action identifier and time
# period identifier associated with the '(d)' indicates a scheduled down time
# with the action if the node violates the scheduled down time. Entries are
# separated by a comma.
#
# A scheduled down time is a window in which a node can change its status
# (go up, go down) without the action being triggered. If, however, the
# node is down when it emerges from the scheduled down time, then the
# action is triggered. There are two events which, in all likelyhood,
# are independent of a scheduled down time. The first of such events
# is when a node is down entering and throughout a scheduled down time.
# Scheduled down time actions are ignored in such cases. The second case
# occurs when a node is down upon entering a scheduled down time but comes
# up. Since a node action was probably executed when the node went down,
# one should probably be executed when it goes up. Also, it was probably
# a coincidental event and so normal node (non-scheduled down time) actions
# are processed if conditions allow.
#
##################################################
##################################################
#
# Node Class Count Timeout Action with Valid Time Period
#------------- ----- ----- ------- -----------------------------
# Example of standard nodes:
#
# Notify techs if node0 changes state, 7x24
node0 h 3 1 techs
# Notify techs and admin if node1 changes state, 7x24
node1 h 3 1 techs,admin
# Notify techs if node2 changes state 7x24 ... except during
# the 'maint' period. During the 'maint' period, don't notify
# techs ... but when the 'maint' period ends, if node2's state
# is different than when the offhour period started, then notify
# admin of the change
node2 h 3 1 techs:maint(d)
# Notify techs if node3 changes state during the workday, notify
# admins if node3 changes state offhour.
node3 h 3 1 techs:workday,admins:offhour
# These nodes are reachable only via a WAN link with tons of delay,
# so wait three seconds before declaring a ping lost. Only
# notify techs during the workday, for node5 and node6. For node7
# notify techs 7x24 ... but ... if the box changes state during the
# 'offhour' period, don't notify techs ... unless the box's state,
# at the end of the 'offhour' period is different than it's state
# was when the box entered the 'offhour' period
node5 h 3 3 techs:workday
node6 h 3 3 techs:workday
node7 h 3 3 techs:offhour(d)
# These three nodes are members of a redundant set; notify
# techs of state changes in all three ... and if all three
# are down simultaneously, then also perform the 'special'
# action.
server-a-dns h 3 1 techs
server-b-dns h 3 1 techs
server-c-dns h 3 1 techs
# This is a core router -- a whole building depends on it. If
# it goes down, NodeWatch will enter Partition mode, in which
# NodeWatch will only perform actions for other Class "c"
# devices
router1 c 3 1 techs
# This is a pair of redundant routers; notify techs of state changes
# to both ... if one goes down, don't enter Partition mode ... only enter
# Partition mode if both are down. And if both go down, in addition
# to performing the "tech" action, also perform the "special" action
bldg-a-rtr c 2 1 techs
bldg-b-rtr c 2 1 techs
#
##################################################
|   |