NodeWatch.Nodes


 
##################################################
#
# NodeWatch Node Database
#
#
#
# The basic format for an entry is
#
#   node class count timeout action[:period[(d)]][[,action[:period[(d)]]] ...]
#
# The node field is the name of the node to monitor.
#
# The class field gives information about the node.  The class field must 
# be "r" for members of a redundant set, "c" for a core device, and "h" for 
# everything else.  Class "c" indicates that if that node goes down, many 
# other nodes will become unreachable.  When a core device is down, then 
# only core device status changes are noticed.  NodeWatch automatically 
# checks for the presence of redundant partners when a core device changes 
# state and escalates to the 'special' action if so instructed.
#
# The count field is an attribute of the status algorithm used by NodeWatch.
# A node is up if it has count consecutive reports of being alive; a node is 
# down if it has count consecutive reports of being unreachable.  A node is 
# in flux if it has less than count consecutive reports of being either alive 
# or unreachable.  This algorithm takes into account the possibility of a node 
# being reported as unreachable when it is in fact alive, but the query got 
# lost somewhere on the network.  That is not the whole algorithm, but enough 
# of it to make sense of the count field.
#
# The timeout field is how long to wait for a reply on a node status query.
# It is measured in seconds.
#
# The action with valid time period field tells NodeWatch what command to
# execute and when it is okay to execute it when it has a message to deliver
# pertaining to that node.  An action is an identifier from the action
# database.  A time period identifier is an identifier from the time period
# database.  A single entry to this field is an action identifier, and if
# there is a time period identifier, the action identifier and time period
# identifier are separated with a colon (':').  If no time period identifier
# is specified, then any time is assumed.  The time period identifier can have
# a '(d)' appended to it.  In that case, then the action identifier and time
# period identifier associated with the '(d)' indicates a scheduled down time
# with the action if the node violates the scheduled down time.  Entries are
# separated by a comma.
#
# A scheduled down time is a window in which a node can change its status
# (go up, go down) without the action being triggered.  If, however, the
# node is down when it emerges from the scheduled down time, then the
# action is triggered.  There are two events which, in all likelyhood,
# are independent of a scheduled down time.  The first of such events
# is when a node is down entering and throughout a scheduled down time.
# Scheduled down time actions are ignored in such cases.  The second case
# occurs when a node is down upon entering a scheduled down time but comes
# up.  Since a node action was probably executed when the node went down,
# one should probably be executed when it goes up.  Also, it was probably
# a coincidental event and so normal node (non-scheduled down time) actions
# are processed if conditions allow.
#
##################################################

##################################################
#
# Node          Class   Count   Timeout Action with Valid Time Period
#-------------  -----   -----   ------- -----------------------------

# Example of standard nodes:
#
# Notify techs if node0 changes state, 7x24
node0		h	3	1	techs

# Notify techs and admin if node1 changes state, 7x24
node1		h	3	1	techs,admin

# Notify techs if node2 changes state 7x24 ... except during 
# the 'maint' period.  During the 'maint' period, don't notify 
# techs ... but when the 'maint' period ends, if node2's state 
# is different than when the offhour period started, then notify 
# admin of the change
node2		h	3	1	techs:maint(d)

# Notify techs if node3 changes state during the workday, notify 
# admins if node3 changes state offhour.
node3		h	3	1	techs:workday,admins:offhour

# These nodes are reachable only via a WAN link with tons of delay, 
# so wait three seconds before declaring a ping lost.  Only 
# notify techs during the workday, for node5 and node6.  For node7
# notify techs 7x24 ... but ... if the box changes state during the
# 'offhour' period, don't notify techs ... unless the box's state, 
# at the end of the 'offhour' period is different than it's state 
# was when the box entered the 'offhour' period
node5		h	3	3	techs:workday
node6		h	3	3	techs:workday
node7		h	3	3	techs:offhour(d)

# These three nodes are members of a redundant set; notify 
# techs of state changes in all three ... and if all three 
# are down simultaneously, then also perform the 'special' 
# action.
server-a-dns	h	3	1	techs
server-b-dns	h	3	1	techs
server-c-dns	h	3	1	techs

# This is a core router -- a whole building depends on it.  If 
# it goes down, NodeWatch will enter Partition mode, in which 
# NodeWatch will only perform actions for other Class "c" 
# devices
router1		c	3	1	techs

# This is a pair of redundant routers; notify techs of state changes 
# to both ... if one goes down, don't enter Partition mode ... only enter 
# Partition mode if both are down.  And if both go down, in addition 
# to performing the "tech" action, also perform the "special" action
bldg-a-rtr	c	2	1	techs
bldg-b-rtr	c	2	1	techs

#
##################################################