/sys/net Adventures: 2014

Monday, April 14, 2014

Zabbix : Create a production network interface trigger

Following my two previous posts on how to add interface's description in Zabbix graphs [1] and triggers [2], I will finish this serie of Zabbix posts with the creation of a production interface trigger.

By default Zabbix includes the "Operational status was changed ..." trigger which is (from my opinion) a big joke :

The trigger disappears (status "OK") after the next ifOperStatus check (60 seconds by default)
The trigger is raised when an equipment is plugged in. This is a "good to know information" but I can't rise a high severity trigger each time something is plugged !
I can't tell if the interface was up and went down OR if the interface was down and went up.
If I want to have a "Something was plugged in on GEX/X/X" trigger, I would make a special trigger for that purpose.
The trigger doesn't include the interface's description (which is extremely irritating and makes me want to kill little kittens). Check my previous post [2] if you care about kitten's survival.

This new trigger will have the following properties :

Raise ONLY if the interface was up (something was plugged in) and went down (equipment stopped, interface shut or somebody removed the cable).
Will disappear if the interface come back up.
A "high" severity and will include interface's description.

Go to "Configuration -> Templates -> Template SNMP Interfaces -> Discovery -> Trigger prototypes" and click on "Create trigger prototype".

Use the following line as trigger's name :

 Production Interface status on {HOST.HOST}: {#SNMPVALUE}, {ITEM.VALUE2} : {ITEM.VALUE3}

Use this as trigger's expression :

 {Template SNMP Interfaces:ifOperStatus[{#SNMPVALUE}].avg(3600)}<2&{Template SNMP Interfaces:ifOperStatus[{#SNMPVALUE}].last(0)}=2&{Template SNMP Interfaces:ifAlias[{#SNMPVALUE}].str(this_does_not_exist)}=0

This expression means, raise if interface was up "avg(3600)}<2" AND went down "last(0)}=2". The 3600 value specify how long the trigger will stay up; After 3600s "avg(3600)" will equals 2 and the trigger will disappear.
The .str(this_does_not_exist)}=0 expression is used to show the interface's description and is explained in my previous post [2].

Use this as trigger's description :

 Interface status went up to down !!!  
 Interface : {#SNMPVALUE}, {ITEM.VALUE1} = {ITEM.VALUE3}

Set the severity to "high" (or whatever is your concern), you can override severity for each of your interface/equipement.

Wait until the discovery rule is refreshed (default is 3600s) or temporarily set it to 60s. We can now try to disable an interface to check the results, let's do this on bccsw02 ge/0/0/3 :

The trigger is raised as expected with the hostname, interface name and description, if you configured Zabbix actions, the alert message will look like

"Production Interface status ev-bccsw02: ge-0/0/3, down (2) : EV-ORADB01 - BACK_PROD"

Let's renable the interface :

Trigger goes green as the interface went up, you should receive a message saying :

"Production Interface status ev-bccsw02: ge-0/0/3, up (1) : EV-ORADB01 - BACK_PROD"

Be aware that you can also use SNMP traps for that purpose.

Hope that helps !

[1] : http://sysnet-adventures.blogspot.fr/2014/02/zabbix-display-network-interface.html
[2] : http://sysnet-adventures.blogspot.fr/2014/04/zabbix-display-network-interface.html

Zabbix : Display network interface description in triggers

In a previous post [1], I explained how to solve a very fustrating thing about Zabbix : "How add network interface's description in your graph names."

In this post, I'll explain how to fix another very fustrating thing about Zabbix : "How to add network interface description in your trigger names"

Zabbix has a default interface trigger which is raised when an interface status changes.
Good thing it would have been if we didn't have the same issue we had with the graphs; you don't have the interface description neither in the trigger's name nor in the comment. This is very annonying, especially if you receive alerts during the night.

Below an example of the default Zabbix trigger alert :

Seems like Ge1 operational status changed, good to know, but again what the hell is "ge1" ???
Message to Zabbix team : Do you really think I learnt all my switches port allocations by heart ???

The good news here is you can solve this stupidity with à "crafty" trick !

Trigger names/descriptions don't interpret items so using the "Zabbix Graph" trick [1] won't work...
To get your interface's description, you'll need to insert a "interface alias" item (ifAlias) in your trigger expression and reference it in the trigger name with the Zabbix standard macro "{ITEM.VALUEX}"

Go to "Configuration -> Templates -> Template SNMP Interfaces -> Discovery -> Trigger prototypes"

You should have a trigger named "Operational status was changed on {HOST.NAME} interface {#SNMPVALUE}" which matches the screenshot above.
To get the interface description, we first add a trigger expression that checks if the interface alias (i.e description) equals (str() function) a string that will NEVER match for example "this_does_not_exist" :

 {Template SNMP Interfaces:ifAlias[{#SNMPVALUE}].str(this_does_not_exist)}=0

This line means, the network interface description is NOT "this_does_not_exist" which is always true. Finally we add an AND operator (&) between the original expression and the string comparison which gives us the final trigger expression :

 {Template SNMP Interfaces:ifOperStatus[{#SNMPVALUE}].diff(0)}=1&{Template SNMP Interfaces:ifAlias[{#SNMPVALUE}].str(this_does_not_exist)}=0

EDIT: From an user's comment, it appears that newer versions of Zabbix require to replace the "&" sign by an "AND" string.

This line means there were a interface operational status change AND the interface's alias is NOT "this_does_not_exist".
This alias comparaison is just a trick so we can reference the interface's alias (i.e description) with the "{ITEM.VALUEX}" standard macro.

Now change the trigger name with the following string :

  Operational status was changed on {HOST.NAME} interface {#SNMPVALUE} : {ITEM.VALUE2}

As you can see, I added the macro {ITEM.VALUE2} that returns the name of the second item in the trigger's expression which is, you guessed it, the interface alias !

Wait until the discovery rule is refreshed (default is 3600s) or temporarily set it to 60s and enjoy the happiness of the result :

You can also use the {ITEM.VALUE2} macro in the trigger's description, very handy if you want to include additional information for the on-call guy.

In the next post [2], I'll show how to create a real interface trigger; from my point of view this default trigger is completely useless :

The trigger disappears after the next ifOperStatus check (60 seconds by default)
The trigger is raised when an equipment is plugged in. This is a "good to know information" but I can't rise a high severity trigger each time something is plugged !
I can't tell if the interface was up and went down OR if the interface was down and went up.
If I want to have a "Something was plugged in on GEX/X/X" trigger, I would make a special trigger for that purpose.

[1] http://sysnet-adventures.blogspot.fr/2014/02/zabbix-display-network-interface.html
[2] http://sysnet-adventures.blogspot.fr/2014/04/zabbix-create-production-network.html

Wednesday, February 26, 2014

Zabbix : Display network interface description in graph names

One very fustrating thing about Zabbix is the inability to display network interface's description in graph names. All your network equipment interfaces graphs look like :

Nice graph, hummm wait ! What the hell is "ge-0/0/8" ?
If you only have one or two switches it's fine but otherwise it becomes very confusing.

Since Zabbix 2.2, they added the possibitlity to interpret items in graph names which means you can add interface's descriptions to your graph names !

To add your interface's descriptions go to "Configuration -> Templates -> Template SNMP Interfaces -> Discovery -> Graph prototypes", choose the "Traffic graph" and use this string as graph's name :

 Traffic on interface {#SNMPVALUE} : {{HOSTNAME}:ifAlias[{#SNMPVALUE}].last(0)}

The result is pure happiness :

You now have your interface's description in your graph names !

Hope that helps !

UPDATE : The same frustrating bug (yes, yes bug) is also present in triggers, by default you can't have the interface's description in your triggers... This post explains how to solve this issue and get happy !
http://sysnet-adventures.blogspot.fr/2014/04/zabbix-display-network-interface.html

Tuesday, February 25, 2014

Manage your backup retentions policies with retdo

You have folders with all your backups but storage capacity is starting to become low ? Need to clean up all these old files but still want to keep some of them just in case ? Then retdo is the perfect tool !

Retdo is a little script I wrote that allows administrators to clean up files on a custom retention basics.
Retdo can be used to implement production's backups retention plans.

retdo can resolve the following queries :

- I want to keep only one file per week if files are older than 3 months up to 6 months.
- I want to keep only one file per month if files are older than 6 months up to 1 year.
- I want files older than 1 year to be moved to another machine.
- I want a cup of tea (feature in progress)

Code and instructions are available for free at https://github.com/gcharot/retdo

Example :

Let's say I have my January daily backups in /data/backup/db/dbname :

#  ll /data/backup/db/dbname/   
 -rw-r--r-- 1 root root 0 Jan 1 12:00 jan01.tgz  
 -rw-r--r-- 1 root root 0 Jan 2 12:00 jan02.tgz  
 -rw-r--r-- 1 root root 0 Jan 3 12:00 jan03.tgz  
 -rw-r--r-- 1 root root 0 Jan 4 12:00 jan04.tgz  
 -rw-r--r-- 1 root root 0 Jan 5 12:00 jan05.tgz  
 -rw-r--r-- 1 root root 0 Jan 6 12:00 jan06.tgz  
 -rw-r--r-- 1 root root 0 Jan 7 12:00 jan07.tgz  
 -rw-r--r-- 1 root root 0 Jan 8 12:00 jan08.tgz  
 -rw-r--r-- 1 root root 0 Jan 9 12:00 jan09.tgz  
 -rw-r--r-- 1 root root 0 Jan 10 12:00 jan10.tgz  
 -rw-r--r-- 1 root root 0 Jan 11 12:00 jan11.tgz  
 -rw-r--r-- 1 root root 0 Jan 12 12:00 jan12.tgz  
 -rw-r--r-- 1 root root 0 Jan 13 12:00 jan13.tgz  
 -rw-r--r-- 1 root root 0 Jan 14 12:00 jan14.tgz  
 -rw-r--r-- 1 root root 0 Jan 15 12:00 jan15.tgz  
 -rw-r--r-- 1 root root 0 Jan 16 12:00 jan16.tgz  
 -rw-r--r-- 1 root root 0 Jan 17 12:00 jan17.tgz  
 -rw-r--r-- 1 root root 0 Jan 18 12:00 jan18.tgz  
 -rw-r--r-- 1 root root 0 Jan 19 12:00 jan19.tgz  
 -rw-r--r-- 1 root root 0 Jan 20 12:00 jan20.tgz  
 -rw-r--r-- 1 root root 0 Jan 21 12:00 jan21.tgz  
 -rw-r--r-- 1 root root 0 Jan 22 12:00 jan22.tgz  
 -rw-r--r-- 1 root root 0 Jan 23 12:00 jan23.tgz  
 -rw-r--r-- 1 root root 0 Jan 24 12:00 jan24.tgz  
 -rw-r--r-- 1 root root 0 Jan 25 12:00 jan25.tgz  
 -rw-r--r-- 1 root root 0 Jan 26 12:00 jan26.tgz  
 -rw-r--r-- 1 root root 0 Jan 27 12:00 jan27.tgz  
 -rw-r--r-- 1 root root 0 Jan 28 12:00 jan28.tgz  
 -rw-r--r-- 1 root root 0 Jan 29 12:00 jan29.tgz  
 -rw-r--r-- 1 root root 0 Jan 30 12:00 jan30.tgz  
 -rw-r--r-- 1 root root 0 Jan 31 12:00 jan31.tgz

Now I need to free some space up so I'd like to keep only one file per week :

 # retdo -p /data/backup/db/dbname -r "*.tgz" -b 1 -e 92 -d 7  
 26 file(s) processed - 0 file(s) in error  
 # ll /data/backup/db/dbname  
 total 0  
 -rw-r--r-- 1 root root 0 Jan 5 12:00 jan05.tgz  
 -rw-r--r-- 1 root root 0 Jan 12 12:00 jan12.tgz  
 -rw-r--r-- 1 root root 0 Jan 19 12:00 jan19.tgz  
 -rw-r--r-- 1 root root 0 Jan 26 12:00 jan26.tgz  
 -rw-r--r-- 1 root root 0 Jan 31 12:00 jan31.tgz

As you can see only one file per week (7 days) has been kept, 26 files were deleted.

This commands means : "find all files matching regexp *.tgz in /data/backup/db/dbname which are older than 1 days up to 92 days (3 months) and keep only one file every week (7 days)"

Hope that helps !

Monday, February 24, 2014

Postfix : Show next deferred delivery attempt

Deferred mails happen, that's the hard life of the Internet. If you want to know when these mails will be requeued, you need to look at the Postfix's spool directory files.

First of all, get the mail's queue ID with the mailq, postqueue commands or the maillog file.
Once you have your queue ID type the following command :

 # find /var/spool/postfix/deferred/ -name QUEUE_ID -exec stat {} \;

This will find your mail in the deferred spool and show its file's properties.
Postfix stamps mail's files with an access time in the future. This time is the time Postfix will requeue the mail for delivery.

Below a concrete example with the queue ID 3EC905800CB :

 # date  
 Mon Feb 24 18:32:27 CET 2014  
 # find /var/spool/postfix/deferred/ -name 7A5F2580101 -exec stat {} \;  
  File: `/var/spool/postfix/deferred/7/7A5F2580101'  
  Size: 28779      Blocks: 64     IO Block: 4096  regular file  
 Device: 811h/2065d   Inode: 5767425   Links: 1  
 Access: (0700/-rwx------) Uid: (  89/ postfix)  Gid: (  89/ postfix)  
 Access: 2014-02-24 19:22:30.000000000 +0100  
 Modify: 2014-02-24 19:22:30.000000000 +0100  
 Change: 2014-02-24 18:15:50.156029044 +0100

As you can see the access/modify time is a time in the future, that means postfix will requeue the mail at 19:22:30.

Note : Requeuing doesn't mean that the mail will be sent at this exact time, it will depend on your active queue load.

Note 2: You can force a requeuing with the "postsuper -r" command

Friday, February 7, 2014

Smartctl : Linux disk I/O scheduler is reseted back to default's CFQ

Got a weird issue recently, I'm monitoring my SSD's life time with smartctl + Zabbix and realized that my scheduler settings are reseted each time smartctl was executed !

 # echo noop > /sys/block/sda/queue/scheduler  
   
 # cat /sys/block/sda/queue/scheduler  
 [noop] anticipatory deadline cfq  
   
 # smartctl -A --device=sat+megaraid,0 /dev/sda  
 smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.23.2.el6.x86_64] (local build)  
 Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net  
 === START OF READ SMART DATA SECTION ===  
 ...  
 ...  
   
 # cat /sys/block/sda/queue/scheduler  
 noop anticipatory deadline [cfq]

There is no real solution, but you can work around by specifying the generic SCSI name i.e "sgX "instead of sdX.

 # echo noop > /sys/block/sda/queue/scheduler  
   
 # cat /sys/block/sda/queue/scheduler  
 [noop] anticipatory deadline cfq  
   
 # smartctl -A --device=sat+megaraid,0 /dev/sg0  
 smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.23.2.el6.x86_64] (local build)  
 Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net  
 === START OF READ SMART DATA SECTION ===  
 ...  
 ...  
   
 # cat /sys/block/sda/queue/scheduler  
  [noop] anticipatory deadline cfq

And voila ! Problem not really solved but that does the job !

You can use sg_map (part of the sg3_utils package) to check the sdX -> sgX mappings :

 # sg_map -a  
 /dev/sg0 /dev/sda  
 /dev/sg1 /dev/sdb  
 /dev/sg2 /dev/scd0

Wednesday, February 5, 2014

Omreport fails : object not found

If you get the following message while using omreport :

 $ omreport chassis memory  
 Memory Information  
 Error : Memory object not found  
 $ omreport chassis hwperformance  
 Error! No Hardware Peformance probes found on this system.

The first thing to do is to restart the srvadmin services :

 # srvadmin-services.sh restart  
 # service ipmi restart

Check that the services are properly started.

If that doesn't solve the problem, you might have a semaphore issue. In my case Zabbix agent/scripts became nuts and didn't close its semaphores.

To list the current semaphore's arrays use the following command :

 # ipcs -s

To show the current system limits

 # ipcs -sl

You can use the following command to count the current number of semaphore's arrays

 # ipcs -us

If you reached the system limit, it will certainly explain the omreport issue. From now on, you have two possibilities :

You've reached the limit because there is an issue on your system (semaphores not closed or whatever reason). You need to cleanup your semaphores with the following command :

 # ipcrm -s semaphore_id  
 To clean all semaphores from a particular user :  
 # ipcs -s | awk '/username/ {system("ipcrm -s" $2)}'

Important : You need to stop attached process before removing the semaphores.

All your semaphores are legit, you need to increase the system limits :

https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Tuning_and_Optimizing_Red_Hat_Enterprise_Linux_for_Oracle_9i_and_10g_Databases/sect-Oracle_9i_and_10g_Tuning_Guide-Setting_Semaphores-Setting_Semaphore_Parameters.html

Hope that helps !

Wednesday, January 29, 2014

Force Postfix to read/use your hosts file

By default postfix use DNS to resolv names, if by any chance you need postfix to use your hosts file, you will need to configure the smtp_host_lookup option to you main.cf.

Postfix : 451 Error in processing Number of messages exceeds maximum per connection

If you have deferred mails with the following reason :

 451 Error in processing Number of messages exceeds maximum per connection

You might have an issue with your default_destination_recipient_limit option which is set to 50 by default.

In my case it was a bit more complicated, I had a custom transport for this destination (@hsbc.fr) with the destination_recipient_limit set to 5 and even with a setting of 2, I still had the same error coming from their servers...

The solution was to turn off on demand smtp connection caching, which is activated by default.

The postfix documentation says :
"Temporarily enable SMTP connection caching while a destination has a high volume of mail in the active queue. With SMTP connection caching, a connection is not closed immediately after completion of a mail transaction. Instead, the connection is kept open for up to $smtp_connection_cache_time_limit seconds. This allows connections to be reused for other deliveries, and can improve mail delivery performance."
http://www.postfix.org/postconf.5.html#smtp_connection_cache_on_demand

"high volume of mail" is unfortunately not specified. In my case, I had to send about 6K mails, the destination_recipient_limit was effective for this destination but Postfix reused the same SMTP connection which triggered HDBC mailserver limit.

Conclusion :

You can either disable SMTP connection caching globally in main.cf :

 smtp_connection_cache_on_demand = no

Or by smtp transport in master.cf (NOT TESTED) :

 smtpslow unix  -    -    n    -    -    smtp  
 ...  
 ...  
  -o destination_recipient_limit=5  
  -o smtp_connection_cache_on_demand=off

Hope that helps !