1 / 25

CSM support for Blue Gene/P

CSM support for Blue Gene/P. CSM 1.7.0. Line item 0XR. Skills Transfer Materials. by Marty Fullam fullam@us.ibm.com. What's a Blue Gene?. File Servers. I / O Node s (1024). 1 Gigabit Ethernet. Front E nd Nodes. Compute Node s (65,536). Service Node. DB2.

camden
Télécharger la présentation

CSM support for Blue Gene/P

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam fullam@us.ibm.com

  2. What's a Blue Gene? File Servers I/O Nodes (1024) 1 Gigabit Ethernet Front End Nodes Compute Nodes (65,536) Service Node DB2 • It's IBM's flagship supercomputer offering. It's official name is the "IBM System Blue Gene Solution". See http://www-03.ibm.com/servers/deepcomputing/bluegene.htmlIt looks like this: Blue Gene/L or Blue Gene/P • The Service Node is the administration focal point of a Blue Gene. Among other things, it maintains a DB2 database of configuration, RAS, and environmental data. The Blue Gene system administrator hangs out here. • The Front End Nodes are used for compiling & submitting Blue Gene jobs. End users hang out here. • The File Servers serve files to the other systems. • The I/O and Compute Nodes (aka the Blue Gene core) run the user jobs. • All systems are POWER systems. The Service Node, Front End Nodes, and File Servers run SLES. The I/O and Compute Nodes run custom operating systems.

  3. CSM / Blue Gene Topology 1 Blue Gene with CSM to monitor the Blue Gene DB Blue Gene Service Node CSM management server Blue Gene Front End Nodes Blue Gene core (I/O and Compute Nodes) Blue Gene File Servers • Just install a CSM management server on your Blue Gene Service Node, and then add the CSM Blue Gene support. Define no CSM nodes. • Notice that there is no CSM cluster per se, just a management server. And even though the I/O and Compute Nodes of the Blue Gene core are not managed by CSM, you will still be able to monitor them. • This topology is new this release! We call it Stand-alone CSM Blue Gene monitoring support.

  4. Blue Gene Service Node CSM management server CSM managed node Blue Gene Front End Nodes CSM managed nodes Blue Gene core (I/O and Compute Nodes) Blue Gene File Servers CSM managed nodes CSM Cluster CSM / Blue Gene Topology 2 Blue Gene with CSM to monitor the Blue Gene DB, and to manage the Service Node, Front End Nodes, and File Servers • Just add CSM to the Blue Gene systems you have. Pick any system to be the management server (though it's probably most typical to use the Service Node). • Define your Service Node, Front End Nodes, and File Servers as CSM nodes. • Notice that the I/O and Compute Nodes of the Blue Gene core are not managed by CSM (mainly because they are not general-purpose Linux systems, and don't need to be burdened with CSM and RSCT software). However, as you will see, you will still be able to monitor them. • We call this topology Full CSM plus Blue Gene monitoring support.

  5. Other CSM managed node Blue Gene Service Node CSM managed node Other CSM managed node Blue Gene Front End Nodes CSM managed nodes Blue Gene core (I/O and Compute Nodes) CSM management server Other CSM managed node Blue Gene File Servers CSM managed nodes CSM Cluster IBM eServer Blue Gene Solution CSM / Blue Gene Topology 3 Blue Gene as part of a larger CSM cluster • Here, the management server is a system outside of the Blue Gene solution. • And while the Blue Gene Service Node, Front End Nodes, and File Servers are configured as managed nodes in the CSM cluster, they are not the only managed nodes, there can be lots of others completely unrelated to the Blue Gene. • This topology, like Topology 2, is Full CSM plus Blue Gene monitoring support.

  6. CSM support for Blue Gene • If you use Full CSM plus Blue Gene monitoring support (Topology 2 or 3 in previous charts), use existing CSM and RSCT function to help manage the Blue Gene Service Node, Front End Nodes, and File Servers. After all, they're just SLES POWER systems. Nothing new here. Use any or all CSM function. • For Stand-alone CSM Blue Gene monitoring support(Topology 1) or Full CSM plus Blue Gene monitoring support (Topology 2 or 3), also use the optional rpm, csm.bluegene, which gives the system administrator the ability to monitor, effectively, the Blue Gene core using standard CSM monitoring capabilities (ERRM conditions and responses). Actually, what we provide is the ability to monitor the Blue Gene DB2 database where the Service Node is continually writing RAS, configuration, and environmental data about the Blue Gene core.

  7. csm.bluegene package • An optional part of CSM (only customers with a Blue Gene would care about it!) • Used on the CSM management server (AIX, or Linux i386 or ppc64), and in the Full CSM plus Blue Gene monitoring support case on the Blue Gene Service Node / CSM managed node (SLES ppc64) too , so it is present in thecsm-aix-1.7.x.x and csm-linux-1.7.x.x tarballs and on the CDs. • If you want to use it, manual install is required on the management server (installp or geninstall on AIX, rpm –i on Linux), followed by these additional setup steps... • Stand-alone CSM Blue Gene monitoring supportcase: • Run bgsetupmon on the management server. • Full CSM plus Blue Gene monitoring support case: • On a non-SLES ppc64 management server, use the copycmspkgs -nservice_node command to copy the CSM SLES ppc64 packages from the CSM for Linux ppc64 CD (or from the expanded tarball) to the /csminstall directory. (This is not necessary on a SLES ppc64 management server because installms will have already copied the packages to the /csminstall directory.) • The Blue Gene Service Node must be configured as a CSM managed node (whether or not it is also the CSM management server), and it must have the autoupdate package installed. • The IBM.ManagedNode "Properties" attribute of the Blue Gene Service Node must include "BlueGeneNodeType|:|ServiceNode". • Then you must run bgsetupms on the management server, followed by: • installnode -nservice_node or updatenode -nservice_node. (During this install or update, SMS installs csm.bluegene on the Service Node.)

  8. What's in the csm.bluegene package? (1 of 2) It contains management server-specific files and Service Node-specific files (even though all files get installed on both types of systems). Management server files: • /opt/csm/bin/bgsetupmon- the end-user command used to set up Stand-alone CSM Blue Gene monitoring supporton the management server. • /opt/csm/bin/bgsetupms- the end-user command used to set up Full CSM plus Blue Gene monitoring supporton the management server. • /opt/csm/install/resources/bluegene.ms/IBM.Nodegroup/BlueGeneServiceNodes.pm et al - a set of predefined nodegroups created when bgsetupms calls mkresources. • /opt/csm/install/resources/bluegene.ms/IBM.Condition/*l - a set of predefined ERRM conditions created when bgsetupmon or bgsetupms calls mkresources. • /opt/csm/csmbin/bgsetupsn - a post-install customization script that sets up Blue Gene support on the Service Node (in Full CSM plus Blue Gene monitoring supportcase only). It runs on the Service Node (via a mount of the management server's /csminstall directory) when installnode -nservice_node or updatenode -nservice_node is run. It gets called by csmfirstboot or updatenode.client, respectively. (Note: /opt/csm/csmbin is the installed location; but bgsetupms copies it to /csminstall/csm/scripts, and then creates a couple of symbolic links named 500CSM_bgsetupsn.BlueGeneServiceNodes in /csminstall/csm/scripts/update and /csminstall/csm/scripts/installpostreboot, and it is one of these symbolic links that is used.)

  9. What's in the csm.bluegene package? (2 of 2) Service Node files: • /opt/csm/bin/bgmksensor - the end-user command used to create Blue Gene-specific sensors to monitor the Service Node's DB2 database for events of interest. (Keep in mind, though, that we do ship a number of predefined Blue Gene sensors, and they may be sufficient for all the monitoring the user cares to do. So this command is not necessarily used.) • /opt/csm/install/resources/bluegene.sn/IBM.Sensor/* - a set of predefined sensors (created when bgsetupmon or bgsetupsn calls mkresources). • /opt/csm/csmbin/bgmanage_trigger - an internally used command called by Blue Gene sensors to create or drop DB2 triggers and sequences as necessary. • /opt/csm/csmbin/bgrun_dbcmds - an internally used command called by bgmksensor and bgmanage_trigger to run db2 commands. • /opt/csm/lib/bgrefresh_sensor.so - an internally used shared library called by the DB2 stored procedure that bgrun_dbcmds creates. It uses RMC's runact-api to call Blue Gene sensors' SetValues() routine. • /opt/csm/pm/BlueGeneUtils.pm - a set of utilities used by the various scripts.

  10. bgsetupmon • /opt/csm/bin/bgsetupmon is run on the CSM management server in the Stand-alone CSM Blue Gene monitoring supportcase. It must be run as part of the procedure to install CSM’s Blue Gene support. It must also be run when updating CSM to a new level. It has no significant flags or options. bgsetupms • /opt/csm/bin/bgsetupms is run on the CSM management server in the Full CSM plus Blue Gene monitoring supportcase. It must be run as part of the procedure to install CSM’s Blue Gene support. It must also be run when updating CSM to a new level. It has no significant flags or options.

  11. bgmksensor • /opt/csm/bin/bgmksensor is run on the Blue Gene Service Node, if used at all. It is used to create custom IBM.Sensor resources used in monitoring the Blue Gene database. It is a higher level command than SensorRM’s mksensor command. A comparison of their usage statements highlights how different the commands are:mksensor [−n host] [−i seconds] [ −c n ] [ −e 0 | 1 | 2 ] [−u user-ID] [−h] [−v │ −V] sensor_name [″]sensor_command[″]bgmksensor −t table −o {d | i | u} [−w column[,...]] [−x "event_expression"] [−p column[,...]] [−T table] [−O {d | i | u}] [−W column[,...]] [−X "rearm_expression"] [−P column[,...]] [-h] [−v | −V] sensor_nameThink of bgmksensor as a wrapper to mksensor; both define an IBM.Sensor resource, but bgmksensor does so much more. In fact, bgmksensor hard-codes most sensor options and is more concerned with providing options related to the Blue Gene DB2 tables, operations, columns, and values that you want to monitor.

  12. Monitoring Overview • The Blue Gene Service Node routinely writes to its DB2 database all types of RAS, configuration, and environmental data related to the Blue Gene core (the I/O and Compute Nodes, the midplanes, the various interconnects, power supplies, fans, etc.). And this happens whether or not CSM is in the picture. • The CSM support for Blue Gene gives you a way to ‘watch’ the database for inserts, updates, and deletes that you deem important, and generate RMC events for them. • The resulting RMC events will drive the ERRM responses you specify. • The charts that follow show a monitoring flow example. Step through them to see what’s involved, and what happens when...

  13. Monitoring Flow Example (1 of 7) Service Node Management Server Blue Gene software DB2 database Blue Gene core

  14. Monitoring Flow Example (2 of 7) Service Node Management Server Recording of Blue Gene core events in DB2 occurs continually, and occurs whether or not CSM is present. Blue Gene software 2. insert 1. node error DB2 database Blue Gene core

  15. Monitoring Flow Example (3 of 7) Service Node Management Server BGNodeErr Sensor BGNodeErr Condition (upon BGNodeErr Sensor change, is SD.Uint32 > 0?) “E-mail root anytime” Response bgmanage_trigger bgrefresh_sensor.so shared library When CSM and csm.bluegene are installed, there are various predefined Sensors, Conditions, Responses, and commands available Blue Gene software DB2 database Blue Gene core

  16. Monitoring Flow Example (4 of 7) Service Node Management Server BGNodeErr Sensor BGNodeErr Condition “E-mail root anytime” Response bgmanage_trigger BGP_COMMON_EXT DB2 Procedure bgrefresh_sensor.so shared library When you start monitoring a Blue Gene-related Condition with startcondresp, a number of things are created in DB2 (via bgmanage_trigger,the Command specified in the Sensor): A Trigger or two, a Sequence, and a couple of Procedures BGNodeErr_CSM DB2 Sequence BGP_COMMON DB2 Procedure Blue Gene software BGNodeErrCSMe DB2 Trigger (upon new row in TBGLNode is STATUS = ‘M’?) DB2 database Blue Gene core

  17. Monitoring Flow Example (5 of 7) Service Node Management Server BGNodeErr Sensor BGNodeErr Condition “E-mail root anytime” Response BGP_COMMON_EXT DB2 Procedure bgrefresh_sensor.so shared library When the Blue Gene software updates a table in the database, DB2 evaluates the Triggers associated with that table BGNodeErr_CSM DB2 Sequence BGP_COMMON DB2 Procedure Blue Gene software BGNodeErrCSMe DB2 Trigger 3. evaluate 2. insert 1. node error DB2 database Blue Gene core

  18. Monitoring Flow Example (6 of 7) Service Node Management Server BGNodeErr Sensor BGNodeErr Condition “E-mail root anytime” Response 8. SetValues() BGP_COMMON_EXT DB2 Procedure bgrefresh_sensor.so shared library In this case, the BGNodeErr_CSM Trigger evaluates ‘true’ and it does its thing: get next sequence number and call BGP_COMMON. And eventually SetValues() is called to write the new sequence number into BGNodeErr Sensors’s SD.Uint32. 7. call 6. call BGNodeErr_CSM DB2 Sequence BGP_COMMON DB2 Procedure 4. get next 5. call Blue Gene software BGNodeErrCSMe DB2 Trigger (new row in TBGLNode & STATUS = ‘M’) DB2 database Blue Gene core

  19. Monitoring Flow Example (7 of 7) Service Node Management Server 9. evaluate BGNodeErr Sensor BGNodeErr Condition “E-mail root anytime” Response 10. do response BGP_COMMON_EXT DB2 Procedure bgrefresh_sensor.so shared library At this point, it’s business as usual for RMC. Since BGNodeErr Sensor’s SD.Uint32 > 0, BGNodeErr Condition is satisfied and the Response occurs. BGNodeErr_CSM DB2 Sequence BGP_COMMON DB2 Procedure Blue Gene software BGNodeErrCSMe DB2 Trigger DB2 database Blue Gene core

  20. Monitoring Details (1 of 3) • When you use bgmksensor to define a Blue Gene-related sensor, we temporarily create in the Blue Gene DB2 database the constructs required for monitoring. (By ‘constructs’ we mean the DB2 Triggers, Sequences, and Stored Procedures.)We do this to expose any errors. If we waited until you actually tried to use the defined sensor in a real monitoring situation, it would be harder to expose the errors. If DB2 gags on any of the constructs, bgmksensor reports the error(s) and creates no sensor. Whether successful or not, it deletes all DB2 constructs it created (they were temps, remember?). • When you use startcondresp to start monitoring a Blue Gene-related condition, the Command stored in the associated sensor gets run. The Command is /opt/csm/csmbin/bgmanage_trigger, and it creates the same Blue Gene DB2 database constructs that bgmksensor had created, but this time they’re not temporary. They stay defined until monitoring is stopped with stopcondresp.

  21. Monitoring Details (2 of 3) • For a given Blue Gene-related sensor, we create the following DB2 constructs: • A Trigger to watch for bgmksensor’s -x event expression. If the sensor name is BGFanTempHi, the event Trigger name is BGFanTempHiCSMe. • A Trigger to watch for bgmksensor’s -X rearm expression, if specified. If the sensor name is BGFanTempHi, the rearm Trigger name is BGFanTempHiCSMr. • A Sequence to give us a unique new number for each event or rearm forwarded from a Trigger to BGP_COMMON. If the sensor name is BGFanTempHi, the Sequence name is BGFanTempHi_CSM. • Stored Procedures named BGP_COMMON and BGP_COMMON_EXT, if we don’t already have them. (Unlike the Triggers and Sequences, these are not created on a per sensor basis; there are just the two, and they serve all sensors created.)

  22. Monitoring Details (3 of 3) • To provide support for Blue Gene rearm monitoring, we needed (and got) a new feature in RSCT. Normally, for a Condition that has a rearm expression, RMC ‘toggles’ between evaluating the Condition’s event expression, and its rearm expression. And when a Condition corresponds to a single resource, this makes perfect sense. However, in the world of Blue Gene monitoring, a Condition corresponds to a set of resources. So we can’t have RMC toggling; we must do the toggling down at the DB2 Trigger level because it is there where we’re able to distinguish one eventing or rearming resource from another. The bottom line is that when we’re monitoring a Blue Gene DB2 table for events and rearms, the Condition used must be the non-toggling type. • Because we assume the event/rearm toggling responsibilities, we introduced a DB2 table named TCSMEvents to keep track of when it’s proper to forward an event up to RMC, and when to forward a rearm. So be aware that we create this table, and that the DB2 Triggers we create manipulate its contents. TCSMEvents has two columns: sensor and origin. The latter uniquely identifies the event/rearm origin. If a sensor is in TCSMEvents, an event was forwarded last; otherwise a rearm was forwarded last, or no event or rearm was observed yet. TCSMEvents is created when the first Blue Gene monitoring is started. It is dropped when the csm.bluegene rpm is removed from the Service Node.

  23. Debugging • bgsetupmon, bgsetupms and bgmksensor have a -v flag for verbose output. • bgrefresh_sensor.so will write some debug info to a file named /var/log/csm/csm_bg.log on theService Node if you do the followingprior to starting monitoring: Create /var/log/csm/csm_bg.log with 666 permissions. Temporarily modify /opt/csm/pm/BlueGeneUtils.pm in CreateTrigger where you see:CALL $SP_name('$trigname', CAST(seq_value AS CHAR(10)), $out_col_stuff2, '$debug'); Change it to: CALL $SP_name('$trigname', CAST(seq_value AS CHAR(10)), $out_col_stuff2, ‘1'); • Some helpful DB2 commands that the bglsysdb userid can run on the Service Node:db2 { stop | start } database managerdb2 connect to bgdb0db2 “select trigname from syscat.triggers where trigname like ‘%CSM%’”db2 “drop trigger bglsysdb.xxxCSMe”db2 “drop trigger bglsysdb.xxxCSMr”db2 “select seqname from syscat.sequences where seqname like ‘%CSM%’”db2 “drop sequence bglsysdb.xxx_CSM”db2 “select procname from syscat.procedures where procname like ‘%BGP%’”db2 “drop procedure bglsysdb.common_bgp”db2 “drop procedure bglsysdb.common_bgp_ext”db2 disconnect bgdb0

  24. What’s changed, what’s new this release? • We’ve added Blue Gene/P support! • We’ve added Stand-alone CSM Blue Gene monitoring support (and bgsetupmon to set this up). • Note that in the case of Stand-alone CSM Blue Gene monitoring support, our predefined Blue Gene-related ERRM Conditions are created on the CSM management server / Service Node, and their Management Scope is set to ‘l’ (for ‘local’). Customers who create their own Blue Gene-related ERRM Conditions in the Stand-alone case must do the same!

  25. References CSM Blue Gene/P Support - Component Design: /project/design/doc/clusters_6B/csm/bluegene/CSM-BGP-CompDes.pdf • CSM 1.7.0 Planning and Installation Guide: • See section “CSM support for Blue Gene” • CSM 1.7.0 Administration Guide: • See section “Using CSM with the IBM System Blue Gene Solution” • CSM 1.7.0 Command and Technical Reference: • bgsetupmon, bgsetupms and bgmksensor

More Related