Central Management is one of the key functionality which simplifies Guardium implementation and lowers TCO. Possibility to patch, update, reconfigure and report across hundreds monitored databases is strong advantage.
Guardium implements this feature by selection one of the aggregators as a Central Manager (CM). All other Guardium infrastructure units communicate with it and synchronize information. However the CM inaccessibility disrupts this process and does not allow normal environment management. To cover these problems from version 9 the Guardium introduced the CM backup feature.
It covers two main problems:
- planned CM shutdown (patching, upgrade)
- CM failure
The CM backup configuration and switching between primary and secondary units need to be managed correctly to avoid problems on collector and aggregator layer.
General consideration for backup CM:
- main CM (primary) and CM backup (secondary) need to be accessible by all appliances in the administration domain
- quick search and outlier detection configuration should be checked after changes on CM level
- switching between CM’s sometimes requires reassigning licenses
Note: Examples in this article refer to simple Guardium infrastructure with 4 units:
- CM Primary (cmp, 192.168.0.80)
- CM Backup (cmb, 192.168.0.79)
- Collector 2 (coll2, 192.168.0.82)
- Collector 3 (coll3, 192.168.0.83)
CM Backup registration
This procedure sets one of the aggregators belonging to Guardium management domain as a backup CM and sends this information to all units.
Only aggregator with this same patch level as primary CM can be defined as backup CM. It means that the same general, hotfix, sniffer and security patches should be installed on both machines.
Screenshots above present that both units have exactly this same patches on board. If the patch level will not be this same the aggregator cannot be promoted to backup CM role.
Note: Patch level refers to this same version of Guardium services, MySQL, Redhat and sniffer. If one unit was patched in sequence – 1,4,20,31,34 and the second – 20,31,34 they are on this same patch level because patches 1 and 4 are included in patch 20
To point aggregator as a backup CM on primary CM go to Manage->Central Management->Central Management and push Designate Backup CM button
The pop-up window will display all aggregators which covers this same patch level with CM. Then select an aggregator and push Apply button
Simple message will inform that task tied with backup CM started and process can be monitored
Unfortunately “Guardium Monitor” dashboard does not exist in version 10. Simple summary of this process can be monitored in “Aggregation/Archive Log” or you can create report without any filters to see all messages.
Here link to query definition – Query Definition
This same information is stored in log turbine_backup.log on CM
mysql select SQLGUARD_VERSION result is 10.0 logme act_name= 'CM Backup' act_success='1' act_comment='Starting system backup with CM_SYNC 192.168.0.80 0' act_day_num='now()' act_dumpfile='' act_header='1' ****** Sun May 22 10:40:00 CEST 2016 ************ Parameters: 192.168.0.80 function do_cm_sync --------------- write md5 to cm_sync_file.tgz.md5 scp: /opt/IBM/Guardium/scripts/scp.exp cm_sync_file.tgz email@example.com:/var/IBM/Guardium/data/importdir/cm_sync_file.tgz
Synchronization can be monitored also on backup CM aggregator in import_user_tables.log
Sun May 22 12:56:05 CEST 2016 - Import User Tables started unit is secondary CM move /var/IBM/Guardium/data/importdir/cm_sync_file.tgz.tmp to /var/IBM/Guardium/data/importdir/cm_sync_file.tgz number of table in DIST_INT and DATAMART tables = 19 calling /opt/IBM/Guardium/scripts/handle_agg_tables.sh Sun May 22 12:56:13 CEST 2016 - Handle agg tables started Sun May 22 12:56:14 CEST 2016 - Handle agg tables finished Sun May 22 12:56:14 CEST 2016 - Import User Tables done
Synchronization is repeated with backup CM in the schedule defined under Managed Unit Portal User Synchronization
From this perspective the right thing to be considered synchronization repeated every few hours. In case of planned downtime of the CM I suggest invoke synchronization manually using Run Once Now button.
If the process finished successfully on the all units except backup CM the information about HA configuration will visible in Managed Unit list – IP addresses both CM’s
Important: To avoid “split brain” problems ensure that all managed units had possibility to refresh list of CM’s every time when IP address pair is changing
Information about list of managed units and their health status can be reached on primary CM within Central Management view
or inside Managed Units report
Promoting backup CM as a primary
Note: Switching CM functionality to a secondary server is the manual task but can be remotely instrumented using GRDAPI.
This task can be invoked from portal on a backup CM from Setup->Central Management->Make Primary CM
or from CLI using GRDAPI command
Output from this task is located in load_secondary_cm_sync_file.log on a backup CM
2016-05-20 22:56:11 - Import CM sync info. started 2016-05-20 22:56:11 -- invoking last user sync. 2016-05-20 22:56:22 -- unit is secondary CM, continue 2016-05-20 22:56:27 -- file md5 is good, continue 2016-05-20 22:58:33 -- file decrypted successfuly, continue 2016-05-20 22:59:10 -- file unzipped successfuly, continue 2016-05-20 22:59:10 -- unzipped file is from version 10 beforeFox=0 2016-05-20 22:59:28 -- Tables loaded to turbine successfully 2016-05-20 22:59:28 -- not before fox 2016-05-20 22:59:48 - copied custom classes and stuff 2016-05-20 22:59:50 -- Import CM sync info done
After a while portal on all managed units including promoted aggregator will be restarted and we are able to see new location of primary CM (old CM will disappear from this list)
also synchronization activity will be visible on new CM
The list of units on new CM does not contain old CM to avoid “split brain”
Warning: I randomly noticed on promoted CM lack of licenses but all previously licensed features were active. However if keys will disappear they should be applied immediately
Finally new CM has been defined and all managed units updated this information.
Reconfiguration the old primary CM to get backup CM role
If a new CM promotion has been made when CM primary was active and communicated with appliances it will stop synchronization and list managed appliances on it will be empty
If promotion is related to CM failure, the old CM after restart will communicate with new one and refresh information about current status of administration domain- after few minutes the list of managed units will be cleared too.
Guardium does not provide automatic role replacement between CM’s. It requires sequence of steps.
To remove CM functionality from orphaned CM the CLI command need to be executed
delete unit type manager
It changes the appliance configuration to standalone aggregator. Then we can join it to administration domain again but this time the domain is managed by new CM (below example of registration from CLI on cmp)
register management <new_CM_ip_address> 8443
Now the old CM has aggregation function and can be delegated to get backup CM role
After this task both CM’s have reversed roles
Units patching process
Guardium administration tasks will require CM displacement only in case of the critical situation. There is no need to switch to backup CM in case of standard patching (especially when hundreds appliances will switch between CM’s). Even patch forces system reboot or stop critical services on updated unit for minutes, the temporary unavailability of unit will not stop any crucial Guardium environment functions (except temporary managed units portal unavailability). So realistic patching process should look like:
- patch CM
- patch CM backup
- synchronize CM and CM backup
- patch other appliances in the CM administration domain.
“Split brain” situation management
Primary CM failure is not managed automatically. However this situation will be notified on all nodes during access to portal
I suggest use your existing IT monitoring system to check health of CM units using SNMP or other existing Guardium interfaces to identify problems faster and invoke new CM promotion remotely by GRDAPI.
Standard flow for manage CM failure is:
- Analyze CM failure
- If system can be restored do that instead of switch to CM Backup (especially in large environments)
If system cannot be restored:
- Promote backup CM to primary role
- Setup another aggregator as CM backup
Despite limited portal functionality on orphaned nodes the backup CM allows promote it also from GUI
I have tested two “split brain” scenarios (in small test conditions):
- CM failure and reassign it to backup CM
- start the stopped collector when backup CM has been promoted and old one is still unavailable
In both cases after few minutes primary CM and collector identified situation and correctly managed connection to infrastructure.
Central Manager HA configuration is an important feature to avoid breaks in the monitoring. Its design and implementation is good however some issues with license management and new quick search features should be covered in new releases.