230 likes | 406 Vues
SC11 - Nov 2011 James Carey, Phil Sanders. Toolkit for Event Analysis and Logging. Overview. Common HPC Event Analysis Framework Combined best aspects and lessons learned from BlueGene ELA and Federation (proprietary network) ELA Addressed new p775 requirements Common Event Repository
E N D
SC11 - Nov 2011 James Carey, Phil Sanders Toolkit for Event Analysis and Logging
Overview • Common HPC Event Analysis Framework • Combined best aspects and lessons learned from BlueGene ELA and Federation (proprietary network) ELA • Addressed new p775 requirements • Common Event Repository • First release priorities: CNM, Service Focal Point (HMC), PNSD, LL, GPFS • Analysis of Events to create Alerts • Rules based engine • Flexible alert delivery. For example, external program calls and e-mail • Real-time Analysis and Historic Analysis • Real-time to be pro-active and react immediately to events • Historical allows for deeper debug on-site and off-site • Robust framework to prevent loss of alerts and events • Handles event flooding • Checkpoint/Shutdown/Restart • Open Source (pyteal.sourceforge.net) • Mainly Python but also C/C++ and Perl • Using ODBC
Component … Connector Connector TEAL Concepts Alert Alert Analyzer Event Analyzer Alert Filters semaphore Monitor Event Alert Listeners Event Log (table in DB) Alert Log (table in DB) teal.conf teal.conf teal.conf
… CNM Connector Connector DB Access Alert Alert Analyzer Event Analyzer Alert Filters semaphore Monitor Event Alert Listeners • DBInterface • get_connection • gen_* • insert • select • … Event Log (table in DB) Alert Log (table in DB) DB2 teal.conf teal.conf mySQL teal.conf future
CNM … Connector Connector Configuration and Plug-ins • Duplicate • Analyzer producing • Noise • User defined Alert Alert Analyzer Event Analyzer Alert Filters • GEAR • User Defined semaphore Monitor Event Alert Listeners Event Log (table in DB) Alert Log (table in DB) • File • SMTP • RMC • External Call • User defined • Stanza-based • Used during startup (/etc/teal) • Separate files per package (teal.conf => base framework features) • Configures processing pipeline • Additional parameters for specialized function • Enabled in different modes teal.conf teal.conf teal.conf [alert_listener.RmcAlertListener] class = ibm.teal.listener.rmc_alert_listener.RmcAlertListener enabled = false
CNM … Connector Connector Network Management Usage • Duplicate • Analyzer producing • Noise • External Call • User defined Alert Event Analyzer Alert Analyzer Alert Filters • GEAR • User Defined semaphore Rules Monitor Event Init Alert Listeners Event Log (table in DB) Alert Log (table in DB) • File • SMTP • RMC • External Call • User defined [event_analyzer.CnmEventAnalyzer] class = ibm.teal.analyzer.gear.event_analyzer.GearEventAnalyzer enabled = all rules = ibm/isnm/xml/CNM_GEAR_rule.xml [alert_filter.CnmAlertFilter] class = ibm.teal.filter.alert_filter_analyzer_name.AlertFilterAnalyzerName enabled = all when = not_from_analyzer analyzer_names = CnmEventAnalyzer [alert_listener.CnmAlertListener] class = ibm.isnm.cnm_alert_listener.CnmAlertListener enabled = realtime filters = CnmAlertFilter SFP teal.conf teal.conf teal.conf
Locations • Points to a specific event location • Can be physical, logical or a mixture of both • Is hierarchical in nature • Simple - one type of item per level • Complex - multiple types of items per level • Operations • Scoping • Validation • Casting (platform specific) • XML-based description • Specified in config file: [location.Location] config = ibm/teal/xml/locations.xml
Location Code Examples FR • Complex • Compact ID • Optional Instance Values • Simple • Hierarchy innate in description CG SN DR Example: Example: <node>-<program>-<pid> comp01-firefox-1234 comp01-vncserver-4567 HB LL OM HF LR LD RM H:FR008-CG03-SN000-DR0
Events • Event id – BD700025 • Unique identifier of the event usually within the scope of the source component • 8 Characters • Source component – CNM • Component that is the source of the event • 128 Characters • Source location – H: FR052-CG07-SN004-DR0-HB0-OM14-LD14 • Location of the source of the event • Report component, Report location • Who actually reported the event • Time occurred • Time provided by the connector: when the event occurred • Time logged • Time provided by TEAL: when the event was added to the event log • Raw data – 1024 characters • Free form extra data • Can also be put into a separate table for easier access (extended data support)
Event Analyzers with the Generic Engine for Analysis Rules (GEAR) • Location centric analysis • Looking for events occurring in the same or unique locations • Specified by scoping locations • For example: • Look at events happening in the same cage (CG) • H:FR008-CG03-SN000-DR0 scoped to CG is H:FR008-CG03 • if <condition> then <action> • Condition • Event equals – match a specific event • Event occurred – a specific event occurs some number of times • All events – a set of events must all occur • Any events – a subset of events must occur • Evaluate – an external call • Logical operators (and/or) • Action • Suppress events – don’t use them to create alerts • Create alerts • GEAR supports plug-in to alert initialization • Execute – external call
Condition <condition> <any_events num=”2” Minimum number of unique events needed ids=”LinkDown, LinkHang, LinkFail” Event ids to consider comp=“test_comp” Source component for event ids (can be defaulted for all rules) unique_id=”true” Only true if events have unique ids locations=“GEAR[src_loc], GEAR[ext.neighbor_loc]” Locations from each event to consider location_match=“identical” How to partition locations. Identical = must be at same location scope=“H:drawer” Scope to use when considering location_match unique_instance=“true” Only true if unique instances (where instance is location) instance_scope=“H:hub” /> Scope to use when considering an instance </condition>
FR000 FR001 FR123 FR000 FR000 HB 01 HB 00 HB 00 HB 01 HB 02 HB 02 HB 02 HB 02 HB 01 HB 01 HB 00 HB 00 HB 02 HB 01 HB 00 HB 02 HB 01 HB 00 HB 00 HB 01 HB 02 HB 01 HB 02 HB 02 HB 00 HB 01 HB 00 HB 01 HB 02 HB 00 DR01 DR00 DR00 DR00 DR01 DR00 DR01 DR00 DR01 DR01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Example System – FRxxx-DRxx-HBxx-PTxx
FR001 FR000 FR000 FR000 HB 02 HB 01 HB 01 HB 00 HB 02 HB 01 HB 01 HB 00 HB 02 HB 02 HB 00 HB 00 HB 00 HB 01 HB 02 HB 02 HB 01 HB 00 HB 00 HB 02 HB 02 HB 01 HB 01 HB 00 DR00 DR00 DR01 DR01 DR01 DR01 DR00 DR00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Example System – location_match and scope location_match = identical – only consider events on the same drawer unique – must get events from a unique drawer ignore – don’t consider location scope = H:drawer what level to consider
FR001 FR000 FR000 FR000 HB 01 HB 02 HB 00 HB 00 HB 01 HB 02 HB 02 HB 01 HB 02 HB 01 HB 02 HB 00 HB 01 HB 00 HB 00 HB 00 HB 02 HB 01 HB 02 HB 02 HB 01 HB 00 HB 01 HB 00 DR01 DR00 DR01 DR00 DR01 DR00 DR00 DR01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Example System – Locations Link Fails: locations=“GEAR[src_loc], GEAR[ext.neighbor_loc]” Each event considered twice Link Goes Down:
FR001 FR000 FR000 FR000 HB 00 HB 01 HB 01 HB 00 HB 02 HB 01 HB 01 HB 00 HB 02 HB 02 HB 00 HB 00 HB 02 HB 01 HB 02 HB 02 HB 01 HB 00 HB 00 HB 02 HB 01 HB 02 HB 01 HB 00 DR01 DR00 DR00 DR01 DR01 DR01 DR00 DR00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Example System – ids = LinkDown, LinkHang, LinkFail Link Fails: Link Goes Down:
FR001 FR000 FR000 FR000 HB 01 HB 02 HB 00 HB 00 HB 01 HB 02 HB 02 HB 01 HB 02 HB 01 HB 02 HB 00 HB 01 HB 00 HB 00 HB 00 HB 02 HB 01 HB 02 HB 02 HB 01 HB 00 HB 01 HB 00 DR01 DR00 DR01 DR00 DR01 DR00 DR00 DR01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Example System – location_match = identical and num = 2 Link Fails: locations=“GEAR[src_loc], GEAR[ext.neighbor_loc]” Each event considered twice Link Goes Down:
FR001 FR000 FR000 FR000 HB 01 HB 02 HB 00 HB 00 HB 01 HB 02 HB 02 HB 01 HB 02 HB 01 HB 02 HB 00 HB 01 HB 00 HB 00 HB 00 HB 02 HB 01 HB 02 HB 02 HB 01 HB 00 HB 01 HB 00 DR01 DR00 DR01 DR00 DR01 DR00 DR00 DR01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Example System – unique_id = true Link Fails: On FR001 both events are the same event id Link Goes Down:
FR001 FR000 FR000 FR000 HB 01 HB 02 HB 00 HB 00 HB 01 HB 02 HB 02 HB 01 HB 02 HB 01 HB 02 HB 00 HB 01 HB 00 HB 00 HB 00 HB 02 HB 01 HB 02 HB 02 HB 01 HB 00 HB 01 HB 00 DR01 DR00 DR01 DR00 DR01 DR00 DR00 DR01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Example System – unique_instance = true and instance_scope=H:hub Link Fails: On FR000 all events are reported on the same hub The condition is FALSE If there had been another event on a different hub on FR000-DR00 then the condition would have been True. Link Goes Down:
Actions • Action called for each set of events that make the condition true • Some intelligent consolidation of permutations • For example, if location match is identical and looking for 2 events (num=2) and condition is fulfilled by 10 events at the same scope (drawer), will call the action once with10 events rather than each permutation of 2 events. It does not combine drawers. • Event suppression • Condition is checked ignoring other suppressions • Suppression can be for one or more events at one or more locations. • For example, a link failed event could suppress link retry events for the same port • Alert creation • Condition is checked only with events that are not suppressed • Alert is associated with events that caused it to be created: Condition Events • Alert prioritization and duplicate check are done after this.
What’s next? • Targeting additional platforms • SNMP event analysis • IB monitoring and reporting • Alert Analyzers • GPFS connectors and analysis • GUI • Tooling • State based event analysis
References • TEAL Sourceforge Project - http://pyteal.sourceforge.net • Command reference • Install/Configuration Instructions • Design Overview & other goodies • Mailing List • Problem Tickets • xCAT HPC Software Installation • http://sourceforge.net/apps/mediawiki/xcat/index.php?title=IBM_HPC_Stack_in_an_xCAT_Cluster • Loadleveler • GPFS • RSCT/RMC • Cluster Guide • https://www.ibm.com/developerworks/wikis/display/hpccentral/IBM+HPC+Clustering+with+Power+775+-+Cluster+Guide • Cluster Service Pack readme • https://www.ibm.com/developerworks/wikis/display/hpccentral/IBM+High+Performance+Computing+Clusters+Service+Packs
Example CNM Event >[c250mgrs52]>/opt/teal/bin/tllsevent -f text -q “event_id=BD700025” -e rec_id : 22877 event_id : BD700025 - D Link Port Down time_occurred : 2011-08-01 14:52:14 time_logged : 2011-08-01 14:52:14.369687 src_comp : CNM src_loc : FR052-CG07-SN004-DR0-HB0-OM14-LD14 src_loc_type : H rpt_comp : CNM rpt_loc : c250mgrs52##cnmd rpt_loc_type : A event_cnt : None elapsed_time : None ext.eed_loc_info : c250mgrs52:/var/opt/isnm/cnm/log ext.encl_mtms : 9125-F2C/028B596 ext.global_counter : None ext.isnm_raw_data : REG_BEGIN ISR_GLOBAL_COUNTER_REGISTER = 0x000005347ecda480 ISR_ID_REGISTER = 0x004800d01c000000 ISR_D14D15_FIR = 0x4000000000000000 D_PORT_14_SEND_NEIGHBOR_ID = 0x000800d01ee00000 OLL_LLD14_LINK_STATUS = 0xc1d6000100000000 REG_END ext.local_om1 : U78A9.001.30CM002-P1-R2-R1,52Y3020,YA193P407777,ABC122,TRMD ext.local_om2 : ext.local_planar : U78A9.001.30CM002-P1,74Y0601,YH10HA0BH002,ABC122,2E00 ext.local_port : U78A9.001.30CM002-P1-T17-T7 ext.local_torrent : U78A9.001.30CM002-P1-R2,52Y3020,YA193P407777,ABC123,TRMD ext.nbr_om1 : U78A9.001.30CK001-P1-R2-R4,52Y3020,YA193P399201,ABC123,TRMD ext.nbr_om2 : ext.nbr_planar : U78A9.001.30CK001-P1,74Y0601,YH10HA0BJ003,ABC123,2E00 ext.nbr_port : U78A9.001.30CK001-P1-T15-T8 ext.nbr_torrent : U78A9.001.30CK001-P1-R2,52Y3020,YA193P399201,ABC123,TRMD ext.neighbor_loc : H: FR052-CG04-SN006-DR0-HB0-OM11-LD11 ext.pwr_ctrl_mtms : 78AC-100BC50052 ext.recovery_file_path : /var/opt/isnm/cnm/log 23