ConSol* Consulting & Solutions Sofware GmbH Deutschland
ConSol* Consulting & Solutions Software GmbH DeutschlandConSol* Consulting & Solutions Software GmbH Deutschland
 
    Home  Open Source  Nagios  check_hpasm  

Description

check_hpasm is a plugin for Nagios which checks the hardware health of Hewlett-Packard Proliant Servers. To accomplish this, you must have installed the hpasm package. The plugin checks the health of

  • Processors
  • Power supplies
  • Memory modules
  • Fans
  • CPU- and board-temperatures

and alerts you if one of these components is faulty or operates outside its normal parameters.

Documentation

Usually the plugin is called without any parameters:

nagios:~> check_hpasm
OK - hardware working fine

For debugging purposes it can be called with the -v option. It will then output the detailed status of each checked component.

nagios:~> check_hpasm -v
checking hpasmd process
System        :proliant dl580 g3
Serial No.    :GB86239BST
ROM version   :P38 04/28/2006
cpu 0 is ok
cpu 1 is ok
powersupply 1 is ok
powersupply 2 is ok
fan 1 speed is normal
fan 2 speed is normal
fan 3 speed is normal
fan 4 speed is normal
fan 5 speed is normal
fan 6 speed is normal
fan 7 speed is normal
fan 8 speed is normal
1 cpu#1 temparature is 58 (85 max)
2 cpu#2 temparature is 48 (85 max)
3 i/o_zone temparature is 34 (60 max)
4 ambient temparature is 25 (40 max)
5 system_bd temparature is 38 (60 max)
dimm 1@1 is ok
dimm 2@1 is ok
dimm 3@1 is ok
dimm 4@1 is ok
dimm 1@2 is ok
dimm 2@2 is ok
dimm 3@2 is ok
dimm 4@2 is ok
dimm 1@3 is ok
dimm 2@3 is ok
dimm 3@3 is ok
dimm 4@3 is ok
dimm 1@4 is ok
dimm 2@4 is ok
dimm 3@4 is ok
dimm 4@4 is ok
logical drive 1 is ok
physical drive 2:0 is ok
physical drive 2:1 is ok
physical drive 2:2 is ok
physical drive 2:3 is ok
controller in slot 0 is ok
OK - hardware working fine

If you want checks of failed/missing components to be skipped, so alerts caused by these are suppressed, then use the option -b to blacklist them. With this option you give the plugin a list of items separated by / having the following format:

<type>:<no>[,<no>...][/<type>:<no>[,<no>...]]...

where <type> can take the values: p for power supplies, f for fans, c for cpus, t for temperatures, d for memory modules, l for logical drives, y for physical drives, co for Raid-controller, cc for Raid-controller-cache and cb for Raid-controller-battery. <no> is the number of the component (or a comma-separated list of numbers). Memory modules' numbers are composed of <no of cartridge>-<no of dimm>

Example: Power supplies #2 and #3 have been left out on purpose and nobody cares for the failed memory module #3 in cartridge #0.

nagios:~> ./check_hpasm -v -b d:0-3/p:2,3
checking hpasmd process
System        :proliant dl380 g3
Serial No.    :8312LDN11121
ROM version   :P29 01/08/2003
checking cpus
 cpu 0 is ok
 cpu 1 is ok
checking power supplies
 powersupply 1 is ok
 powersupply 2 is missing (blacklisted)
 powersupply 3 is missing (blacklisted)
checking fans
 fan 1 speed is normal
 fan 2 speed is normal
 fan 3 speed is normal
 fan 4 speed is normal
 fan 5 speed is normal
 fan 6 speed is normal
 fan 7 speed is normal
 fan 8 speed is normal
checking temperatures
 1 processor_zone temperature is 36 (62 max)
 2 cpu#1 temperature is 36 (73 max)
 3 i/o_zone temperature is 47 (68 max)
 4 cpu#2 temperature is 37 (73 max)
 5 power_supply_bay temperature is 30 (55 max)
checking memory modules
 dimm 1@0 is ok
 dimm 2@0 is ok
 dimm 3@0 is dimm is degraded (blacklisted)
 dimm 4@0 is ok
OK - hardware working fine

As an alternative you can write the list of blacklisted devices into the first line of a file and give the filename to the --blacklist option.

 

If the system-default temperature thresholds should be overridden, use the --customthres option.

--customthres=1:60/5:50

means, set the limit for temperature 1 to 60 degrees and the limit for temperature 3 to 50 degrees.

 

With the option -p you can switch on the output of performance data, if not already set as the default during installation. Should the perfdata string become too long, then use --perfdata=short which outputs a short form of the temperature tags.

nagios:~> check_hpasm -p --customthres=5:44
OK - hardware working fine| fan_1=8%;0;0 fan_2=8%;0;0 → ...
    → fan_3=15%;0;0 fan_4=15%;0;0 fan_5=8%;0;0 → ...
    → fan_6=8%;0;0 fan_7=20%;0;0 fan_8=20%;0;0 → ...
    → "temp_1_processor_zone"=38;62;62 → ...
    → "temp_2_cpu#1"=37;73;73 "temp_3_i/o_zone"=49;68;68 → ...
    → "temp_4_cpu#2"=40;73;73 "temp_5_power_supply_bay"=36;44;44

With some Bios releases hpasmcli doesn't display the memory modules correctly. The command SHOW DIMM shows only a list of modules with status n/a which is counted as a Warning. Using the -i option or --ignore-dimms you can skip memory checking without using a blacklist to avoid this warning.


Installation

  • After unpacking the Archive, call the ./configure command. Attention should be paid to the --with-noinst-level option which defines the exit code of the plugin if no hpasm rpm was installed. With the option --with-degrees you tell the plugin wether you want temperature values displayed in celsius or fahrenheit. With the option --enable-perfdata you tell check_hpasm to add performance data to it's output by default. If you don't want to see type, serial number and biosrelease in the output, you can switch this off by using --disable-hwinfo. With --enable-hpacucli you activate checking of raid controllers.
  • Grab the hpasm package suitable for your Linux distribution and install it. See the list of links below where to find it.
  • If you run check_hpasm as a non-root user you will need sudo-privileges which allow you to call /sbin/hpasmcli as root without providing a password.

Examples

More examples for different error conditions:

memory module failed:
nagios:~> check_hpasm 
CRITICAL - dimm module 2 @ cartridge 2 needs attention → ...
    →(dimm is degraded)
nagios:~> check_hpasm -v
checking hpasmd process
System        :proliant dl580 g3
Serial No.    :GB8632FB7V
ROM version   :P38 04/28/2006
checking cpus
 cpu 0 is ok
 cpu 1 is ok
 cpu 2 is ok
 cpu 3 is ok
checking power supplies
 powersupply 1 is ok
 powersupply 2 is ok
checking fans
checking temperatures
 1 cpu#1 temparature is 36 (80 max)
 2 cpu#2 temparature is 34 (80 max)
 3 cpu#3 temparature is 33 (80 max)
 4 cpu#4 temparature is 37 (80 max)
 5 i/o_zone temparature is 32 (60 max)
 6 ambient temparature is 23 (40 max)
 7 system_bd temparature is 34 (60 max)
checking memory modules
 dimm 1@1 is ok
 dimm 2@1 is ok
 dimm 3@1 is ok
 dimm 4@1 is ok
 dimm 1@2 is ok
 dimm 2@2 is dimm is degraded
 dimm 3@2 is ok
 dimm 4@2 is ok
CRITICAL - dimm module 2 @ cartridge 2 needs attention → ...
    → (dimm is degraded)
power supply module failed:
nagios:~> ./check_hpasm 
CRITICAL - powersuply #2 needs attention (failed), → ...
    → powersuply #1 is not redundant
nagios:~> ./check_hpasm -v
checking hpasmd process
System        :proliant dl580 g4
Serial No.    :GB8637M8TH
ROM version   :P59 09/08/2006
checking cpus
 cpu 0 is ok
 cpu 1 is ok
 cpu 2 is ok
 cpu 3 is ok
checking power supplies
 powersupply 1 is ok
 powersupply 2 is failed
checking fans
checking temperatures
 1 cpu#1 temparature is 42 (85 max)
 2 cpu#2 temparature is 46 (85 max)
 3 cpu#3 temparature is 44 (85 max)
 4 cpu#4 temparature is 44 (85 max)
 5 i/o_zone temparature is 39 (60 max)
 6 ambient temparature is 27 (40 max)
 7 system_bd temparature is 41 (60 max)
checking memory modules
 dimm 1@1 is ok
 dimm 2@1 is ok
 dimm 3@1 is ok
 dimm 4@1 is ok
 dimm 1@2 is ok
 dimm 2@2 is ok
 dimm 3@2 is ok
 dimm 4@2 is ok
 dimm 1@3 is ok
 dimm 2@3 is ok
 dimm 3@3 is ok
 dimm 4@3 is ok
 dimm 1@4 is ok
 dimm 2@4 is ok
CRITICAL - powersuply #2 needs attention (failed), → ...
    → powersuply #1 is not redundant
power supply module pulled:
nagios:~> ./check_hpasm 
CRITICAL - powersuply #2 is missing, powersuply #1 is not redundant
nagios:~> ./check_hpasm -v
checking hpasmd process
System        :proliant dl580 g4
Serial No.    :GB8637M8TH
ROM version   :P59 09/08/2006
checking cpus
 cpu 0 is ok
 cpu 1 is ok
 cpu 2 is ok
 cpu 3 is ok
checking power supplies
 powersupply 1 is ok
 powersupply 2 is n/a
checking fans
checking temperatures
 1 cpu#1 temparature is 42 (85 max)
 2 cpu#2 temparature is 46 (85 max)
 3 cpu#3 temparature is 44 (85 max)
 4 cpu#4 temparature is 44 (85 max)
 5 i/o_zone temparature is 39 (60 max)
 6 ambient temparature is 27 (40 max)
 7 system_bd temparature is 41 (60 max)
checking memory modules
 dimm 1@1 is ok
 dimm 2@1 is ok
 dimm 3@1 is ok
 dimm 4@1 is ok
 dimm 1@2 is ok
 dimm 2@2 is ok
 dimm 3@2 is ok
 dimm 4@2 is ok
 dimm 1@3 is ok
 dimm 2@3 is ok
 dimm 3@3 is ok
 dimm 4@3 is ok
 dimm 1@4 is ok
 dimm 2@4 is ok
CRITICAL - powersuply #2 is missing, powersuply #1 is not redundant
Hpasm daemon is not running:
nagios:~> check_hpasm 
CRITICAL - hpasmd needs to be started
Hpasm software is not installed:
nagios:~> check_hpasm   
OK - hardware working fine, at least i hope so → ...
    → because hpasm is not installed

Please run check_hpasm -v on as many as possible different platforms. Chances are you have a rare Proliant model whose components are not detected completely. You will then see instructions on how to report this to the author.

The following line appears frequently but can be considered harmless::

 #0 SYSTEM_BD - -


Download

check_hpasm-2.0.3.1.tar.gz


External links


Changelog

  • 2008-04-16 2.0.3.1 configure-Bug fixed. (--with-perl, --with-perfdata)
  • 2008-04-09 2.0.3 Blacklisting for Controllers. Dimm-Bug fixed.
  • 2008-02-11 2.0.2 empty cpu&fan sockets are now properly handled
  • 2008-02-08 2.0.1 multiline output for nagios 3.x
  • 2008-02-08 2.0 complete code redesign, integrated raid checking with hpacucli
  • 2008-01-18 1.6.2.2 Fixed misleading message under Debian 3.1
  • 2007-12-12 1.6.2.1 Bugfix. Fans were overseen.
  • 2007-11-16 1.6.2 New option -i, output of model, biosrelease and serial number by default (Thanks Marcus Fleige).
  • 2007-11-07 1.6.1 Bugfix. Failed fans were possibly overseen. Perfdata use single quotes.
  • 2007-07-27 1.6 Performance data.
  • 2007-06-14 1.5 New option supports user-defined temperature thresholds.
  • 2007-05-22 1.4 Support for hpasmxld and hpasmlited.
  • 2007-04-18 1.3 Added --with-degrees to configure. Added --blacklist
  • 2007-04-16 1.2 Added --with-noinst-level option to configure.
  • 2007-04-14 1.1 First published release.

Copyright

2007 Gerhard Laußer

Check_hpasm is released under the GNU General Public License. GPL


Author

Gerhard Laußer (gerhard.lausser@consol.de) will gladly answer your questions.