Central Manager in HA configuration

Central Management is one of the key functionality which simplifies Guardium implementation and lowers TCO. Possibility to patch, update, reconfigure and report across hundreds monitored databases is strong advantage.

Guardium implements this feature by selection one of the aggregators as a Central Manager (CM). All other Guardium infrastructure units communicate with it and synchronize information. However the CM inaccessibility disrupts this process and does not allow normal environment management. To cover these problems from version 9 the Guardium introduced the CM backup feature.

It covers two main problems:

  • planned CM shutdown (patching, upgrade)
  • CM failure

The CM backup configuration and switching between primary and secondary units need to be managed correctly to avoid problems on collector and aggregator layer.

General consideration for backup CM:

  • main CM (primary) and CM backup (secondary) need to be accessible by all appliances in the administration domain
  • quick search and outlier detection configuration should be checked after changes on CM level
  • switching between CM’s sometimes requires reassigning licenses

Note: Examples in this article refer to simple Guardium infrastructure with 4 units:

  • CM Primary (cmp, 192.168.0.80)
  • CM Backup (cmb, 192.168.0.79)
  • Collector 2 (coll2, 192.168.0.82)
  • Collector 3 (coll3, 192.168.0.83)

CM Backup registration

This procedure sets one of the aggregators belonging to Guardium management domain as a backup CM and sends this information to all units.

Only aggregator with this same patch level as primary CM can be defined as backup CM. It means that the same general, hotfix, sniffer and security patches should be installed on both machines.

2016-05-21_09-53-15

Patch list on CM primary (cmp)

2016-05-21_09-56-26

Patch list on aggregator (cmb)

Screenshots above present that both units have exactly this same patches on board. If the patch level will not be this same the aggregator cannot be promoted to backup CM role.

Note: Patch level refers to this same version of Guardium services, MySQL, Redhat and  sniffer. If one unit was patched in sequence – 1,4,20,31,34 and the second – 20,31,34 they are on this same patch level because patches 1 and 4 are included in patch 20

To point aggregator as a backup CM on primary CM go to Manage->Central Management->Central Management and push Designate Backup CM button

2016-05-21_10-46-23

Central Management view (cmp)

The pop-up window will display all aggregators which covers this same patch level with CM. Then select an aggregator and push Apply button

2016-05-21_10-52-22

backup CM selection (cmp)

Simple message will inform that task tied with backup CM started and process can be monitored

Unfortunately “Guardium Monitor” dashboard does not exist in version 10. Simple summary of this process can be monitored in “Aggregation/Archive Log” or you can create report without any filters to see all messages.

Here link to query definition – Query Definition

This same information is stored in log turbine_backup.log on CM

mysql select SQLGUARD_VERSION result is 10.0
logme   act_name= 'CM Backup' act_success='1' act_comment='Starting system backup with CM_SYNC 192.168.0.80 0'  act_day_num='now()' act_dumpfile='' act_header='1' 
****** Sun May 22 10:40:00 CEST 2016 ************
Parameters: 192.168.0.80 
function do_cm_sync
---------------
write md5 to cm_sync_file.tgz.md5
scp: /opt/IBM/Guardium/scripts/scp.exp cm_sync_file.tgz aggregator@192.168.0.80:/var/IBM/Guardium/data/importdir/cm_sync_file.tgz

Synchronization can be monitored also on backup CM aggregator in import_user_tables.log

Sun May 22 12:56:05 CEST 2016 - Import User Tables started
unit  is secondary CM
 move /var/IBM/Guardium/data/importdir/cm_sync_file.tgz.tmp to /var/IBM/Guardium/data/importdir/cm_sync_file.tgz 
number of table in DIST_INT and DATAMART tables = 19
calling /opt/IBM/Guardium/scripts/handle_agg_tables.sh
Sun May 22 12:56:13 CEST 2016 - Handle agg tables started
Sun May 22 12:56:14 CEST 2016 - Handle agg tables finished
Sun May 22 12:56:14 CEST 2016 - Import User Tables done

Synchronization is repeated with backup CM in the schedule defined under Managed Unit Portal User Synchronization

From this perspective the right thing to be considered synchronization repeated every few hours. In case of planned downtime of the CM I suggest invoke synchronization manually using Run Once Now button.

If the process finished successfully on the all units except backup CM the information about HA configuration will visible in Managed Unit list – IP addresses both CM’s

Important: To avoid “split brain” problems ensure that all managed units had possibility to refresh list of CM’s every time when IP address pair is changing

Information about list of managed units and their health status can be reached on primary CM within Central Management view

or inside Managed Units report

Promoting backup CM as a primary

Note: Switching CM functionality to a secondary server is the manual task but can be remotely instrumented using GRDAPI.

This task can be invoked from portal on a backup CM from Setup->Central Management->Make Primary CM

2016-05-21_15-43-36

Confirmation the promotion CM as primary server

or from CLI using GRDAPI command

grdapi make_primary_cm

Output from this task is located in load_secondary_cm_sync_file.log on a backup CM

2016-05-20 22:56:11 - Import CM sync info. started
2016-05-20 22:56:11 -- invoking last user sync. 
2016-05-20 22:56:22 -- unit  is secondary CM, continue 
2016-05-20 22:56:27 -- file md5 is good, continue
2016-05-20 22:58:33 -- file decrypted successfuly, continue 
2016-05-20 22:59:10 -- file unzipped successfuly, continue 
2016-05-20 22:59:10 -- unzipped file is from version 10 beforeFox=0  
2016-05-20 22:59:28 -- Tables loaded to turbine successfully
2016-05-20 22:59:28 -- not before fox  
2016-05-20 22:59:48 - copied custom classes and stuff 
2016-05-20 22:59:50 -- Import CM sync info done

After a while portal on all managed units including promoted aggregator will be restarted and we are able to see new location of primary CM (old CM will disappear from this list)

also synchronization activity will be visible on new CM

The list of units on new CM does not contain old CM to avoid “split brain”

Warning: I randomly noticed on promoted CM lack of licenses but all previously licensed features were active. However if keys will disappear they should be applied immediately

Finally new CM has been defined and all managed units updated this information.

Reconfiguration the old primary CM to get backup CM role

If a new CM promotion has been made when CM primary was active and communicated with appliances it will stop synchronization and list managed appliances on it will be empty

If promotion is related to CM failure, the old CM after restart will communicate with new one and refresh information about current status of administration domain- after few minutes the list of managed units will be cleared too.

Guardium does not provide automatic role replacement between CM’s. It requires sequence of steps.

To remove CM functionality from orphaned CM the CLI command need to be executed

delete unit type manager

It changes the appliance configuration to standalone aggregator. Then we can join it to administration domain again but this time the domain is managed by new CM (below example of registration from CLI on cmp)

register management <new_CM_ip_address> 8443

Now the old CM has aggregation function and can be delegated to get backup CM role

2016-05-22_10-29-06

backup CM selection

After this task both CM’s have reversed roles

Units patching process

Guardium administration tasks will require CM displacement only in case of the critical situation. There is no need to switch to backup CM in case of standard patching (especially when hundreds appliances will switch between CM’s). Even patch forces system reboot or stop critical services on updated unit for minutes, the temporary unavailability of unit will not stop any crucial Guardium environment functions (except temporary managed units portal unavailability). So realistic patching process should look like:

  1. patch CM
  2. patch  CM backup
  3. synchronize CM and CM backup
  4. patch other appliances in the CM administration domain.

“Split brain” situation management

Primary CM failure is not managed automatically. However this situation will be notified on all nodes during access to portal

I suggest use your existing IT monitoring system to check health of CM units using SNMP or other existing Guardium interfaces to identify problems faster and invoke new CM promotion remotely by GRDAPI.

Standard flow for manage CM failure is:

  1. Analyze CM failure
  2. If system can be restored do that instead of switch to CM Backup (especially in large environments)

If system cannot be restored:

  1. Promote backup CM to primary role
  2. Setup another aggregator as CM backup

Despite limited portal functionality on orphaned nodes the backup CM allows promote it also from GUI

I have tested two “split brain” scenarios (in small test conditions):

  • CM failure and reassign it to backup CM
  • start the stopped collector when backup CM has been promoted and old one is still unavailable

In both cases after few minutes primary CM and collector identified situation and correctly managed connection to infrastructure.

Summary:

Central Manager HA configuration is an important feature to avoid breaks in the monitoring. Its design and implementation is good however some issues with license management and new quick search features should be covered in new releases.

Advertisements