OIT VM Hosting – Monitoring and Alarm Remediation


OIT offers monitoring for any of its hosted VMs upon request.  The tables below list what is included in the basic, default monitoring that is configured if monitoring is requested during a VM’s creation in Clockworks:

Default Linux Monitoring

   Monitor Type       Threshold        Severity       Alerts       Notification       Origination       Alarm Type   
CPU Load (15min average) 30 Major page SSI oncall Spectrum LoadAverage
File System Usage 90% Minor email SSI email Spectrum DISK THRESHOLD EXCEEDED
  94% Major page SSI oncall
Virtual Memory Usage (swap) 65% Minor email SSI email Spectrum DISK THRESHOLD EXCEEDED
  70% Major page SSI oncall
Host Availability down Critical page SSI oncall Spectrum DEVICE HAS STOPPED RESPONDING TO POLLS
SSH down Major page SSI oncall Nagios SSH
SNMP down Major page SSI oncall Spectrum MANAGEMENT AGENT LOST

 

Default Windows Monitoring

   Monitor Type       Threshold       Severity    Notification Method       Origination       Alarm Type   
disk [c:d:] 90% Minor email Spectrum DISK THRESHOLD EXCEEDED
  95% Major page
microsoft-ds down Minor email Nagios PORT
netbios-ssn down Major page Nagios PORT
Host Availability down Critical page Spectrum DEVICE HAS STOPPED RESPONDING TO POLLS
Remote Desktop down Major page Nagios PORT
SNMP down Major page Spectrum MANAGEMENT AGENT LOST
Physical Memory 98% Major page Spectrum HIGH MEMORY UTILIZATION

Monitoring is setup through a combination of Nagios, SCOM and Spectrum.  Additional monitoring of specific processes and transactions can be put in place if requested. To request additional monitoring (or monitoring removal) use the OIT – Monitoring Request form which can be found in ServiceNow in the Service Request Catalog.

Log into Support@Duke > On the left under Self Service, click Service Request Catalog > Under the section Hosted Computing, click OIT – Monitoring Request > Complete the form and click Request Now on the top right.

 

Maintenance Mode

Placing a host in Maintenance Mode in Spectrum will suppress any alarms. 

If work is being done on a host then it should be placed in Maintenance Mode.  This can also be done via Enso or by contacting the Operations-Service Ops Center-OIT (either by a ServiceNow request or by phone).

If a host requires a recurring maintenance schedule submit a request to Monitoring-OIT to have that configured.

 

Alarm Remediation

Alarms from monitored hosts are responded to by the Service Operations Center (SOC).  Notifications can be adjusted as necessary – submit a ServiceNow Request to Monitoring-OIT.  If there are particular steps that the SOC should take for a given alarm then the Service Management-OIT team will document those.

 

Cacti Graphs

Cacti offers graphing and trending data based on the information collected from each host.  Read access to Cacti is granted when requesting a VM through Clockworks. The following is the standard information that is available in Cacti:

Linux

  • Processes+ 
  • SNMP – TCP Connection Status 
  • ucd/net – CPU Usage (enhanced) 
  • ucd/net – Load Average (enhanced)
  • ucd/net – Memory Usage (enhanced) 
  • ucd/net – TCP Counters 
  • ucd/net – TCP Current Established
  • iostat – kBytes/sec 
  • iostat – Queue size
  • iostat – Request size 
  • iostat – Requests/sec 
  • iostat – Times 
  • iostat – Utilisation 
  • RFC1213 Statistics 
  • SNMP – Get Mounted Partitions 
  • SNMP – Interface Statistics 
  • ucd/net – Get Monitored Partitions 

Windows

  • Host MIB – Processes 
  • SNMP – TCP Connection Status 
  • SNMP – Get Mounted Partitions 
  • SNMP – Get Processor Information 
  • SNMP – Interface Statistics 

The following steps describe the basic steps for accessing this information in Cacti:

  1. Log into Cacti – https://cacti.oit.duke.edu/cacti/graph_view.php?action=list
  2. From this view enter the host name in the search box:

  3. Click the Graph Title from the list presented to view the appropriate graphs