480 likes | 768 Vues
EMC Smarts ControlCenter and SIA Integration #246. Lynda Em April 4, 2007. Agenda. SMARTS Technology What is SIA? SIA Architecture Cross-communication between CC and SIA Installation process Troubleshooting techniques Possible demo???. EMC Smarts Technology. Value Proposition.
 
                
                E N D
EMC SmartsControlCenter and SIAIntegration #246 Lynda Em April 4, 2007
Agenda • SMARTS Technology • What is SIA? • SIA Architecture • Cross-communication between CC and SIA • Installation process • Troubleshooting techniques • Possible demo???
EMC Smarts Technology Value Proposition • Automated, actionable intelligence • Pinpoint service-affecting problems in real time • Quantify impact to prioritize action • Update automatically to adapt to infrastructure changes • Cross-domain correlation • Correlate information, applications, infrastructure, and business services across management silos • Business-centric • Understand exactly how IT problems affect services and customers
What is SIA? Overview SIA terms
Storage Insight for Availability 1.0.0.1 • Release Date 8/21/06. • Smarts Storage Insight for Availability is the first offering in the new Smarts Storage Insight family. • Storage Insight for Availability automates root cause and impact analysis of availability problems across the EMC tiered storage infrastructure, resulting in a dramatic reduction in downtime and mean time to repair for existing ControlCenter V5.2 customers. • Based on patented Smarts technology, only Storage Insight for Availability automates fault management for the Fibre Channel SAN infrastructure.
What Will SI for Availability do for you? • Automated problem diagnosis – symptoms vs. root causes • Root cause problems identified: • Symmetrix units, front-end directors, port links, devices • CLARiiON units, disks, Storage Processors and port links • Fibre Channel switch units and port links • Host Bus Adapter cards and port links • Celerra Data Movers • Impact Analysis • Impacted elements along the data path: • Host systems, Host file systems, power path devices, host physical devices, logical volumes • Celerra Data Movers and client shares • Cross-domain root cause and impact analysis for Celerra • Celerra gateway models • CLI/XHMP polling for client mapping to Celerra Data Movers • Cross domain analysis in conjunction with IP AM
IP Availability Manager Storage Insight for Availability IP Network Infrastructure Storage Insight for Availability Deployment Architecture SMARTS Global Console • EMC ControlCenter 5.2 • Monitors SAN Infrastructure • Storage Insight for Availability • Automated SAN fault management • Service Assurance Manager • Integration point for Smarts products • Global Console • Focal point for monitoring and analysis • Business Impact Manager • Identify business impact of problems • IP Availability Manager • Correlation between FC SAN and IP network, through Celerra Gateway Business Impact Manager SMARTS Service Assurance Manager EMC ControlCenter 5.2 SP5 SAN Infrastructure
6 Root Cause 4 Codebook Correlation Business Impact 1 ICIM Library 3 ICIM Repository 2 Discovery 5 Polling/Pinging EMC Smarts Technology Automating Service Management—Start to Finish Analysis Context Collection
Storage Insight Terms • PortLink • A physical connection between two ports: • A port on the FC switch and a port on the HostSystem, OR • A port on the FC switch and a port on the StorageSystem, OR • Two ports on peer FC switches • SCSITargetInitiatorPath • A physical connection in a SAN • Between a port on the HostSystem and a port on the StorageSystem • SCSITargetInitiatorPaths are layered over PortLinks • DataPath • A logical connection in a SAN • Between a HostPhysicalDevice on the HostSystem and an ArrayStorageVolume on the StorageSystem
Storage Insight Terms • DataPathRedundancyGroup • Composed of two or more DataPaths for redundancy • Supports a logical connection between a HostFileSystem on the HostSystem and an ArrayStorageVolume on the StorageSystem. • Powerpath_Datapath • Layered over a SCSITargetInitiatorPath element • Is associated with one and only one PowerPathDevice on the HostSystem • Is always part of a PowerPathRedundancyGroup • PowerPathRedundancyGroup • Composed of two or more Powerpath_DataPaths • Supports a logical connection between a HostFileSystem on the HostSystem and an ArrayStorageVolume on the StorageSystem • Is used to model I/O paths managed by PowerPath
Global Console Views • Notification Log • Can be filtered to create custom logs • Summary View • Map Views • Physical Maps, SAN Maps, NAS Maps, IP Maps • Topology View • Browse the detail of specific devices and relationships
SIA Architecture Overview ControlCenter communication
ControlCenter Mediation Layer • ControlCenter Agents used • Storage Agent for Symmetrix • Storage Agent for CLARiiON • Storage Agent for NAS • Fibre Channel Connectivity Agent • Host Agents for Windows, Solaris, AIX, HP-UX and Linux
Alerts and DCP Schedules • SIA is listening for a small subset of ControlCenter Alerts • Imperative that alerting is working properly • Some SI-A subscribed alerts are on by default and some must be turned on • Powerpath alerts must be created from Alert Template folder • Most alerts are agent-controlled • CLARiiON alert schedules can be user-controlled • Powerpath alerts must be changed on each host • Switch alerts are DCP-controlled • FCC Agent dcp is 1 hour by default • Set SNMP traps from switch to FCC agent for immediate events
Architectural Overview • Two Domain Managers • SIA Topology Server (STS) • SIA Analysis Server (SAS) • Both use ECC API NG 2.0 (GA3) – specifically Build Identifier “09JUN2006.1345.242” • Both establish JDBC connections to the ECC repository • Probe support for these • Symmetrix • CLARiiON • Celerra • Host • Switch/Fabric • AM + NAS needed only for Celerra RCA & cross domain • Additional Celerra probes talk directly to Celerra • NFS clients • CIFS clients
SIA Probes • Hybrid approach using ECC API NG 2.0 as well as direct DB queries • ControlCenter probes use ECC API NG 2.0 to get all Symmetrix, CLARiiON, Celerra, Host, and Switch instances and subscribe for alerts • Individual Probes launched for each instance • (STS) Access DB for detailed topology • (SAS) Process alerts • Additional Celerra probes launched to get NFS and CIFS clients of each Celerra • Use XHMP (CIFS) or SSH (NFS) to get information
SIA component dependencies • SAM 6.5.1 (RP 38) server needed for monitoring and maps • IP-AM with NAS extensions needed only for Celerra • ControlCenter 5.2SP4 + SIA specific ECC hotfix 3655 • ControlCenter API NG 2.0 Server – GA3 • ControlCenter Host Agents • 1 SIA server set per ControlCenter Server • Dedicated machine for Smarts servers • OS Support • Windows 2000 Advanced Server 2004 • Windows 2003 Enterprise Edition SP1 • Windows Server 2003 R2 Enterprise Edition
How SIA works Auto-discovery Root-cause analysis
Server Interaction & Sequencing • Topology & Analysis servers are independent • Operations need to be coordinated • Analysis Server needs to know when to import new topology • Alert processing must be suspended during import • Analysis server needs to connect to Topology Server • etc • Sequence defined for following scenarios: • Server cold-start • Discovery • Rediscovery • Server restart
Storage Insight for Availability Installation Process
Installation • ControlCenter 5.2 SP 4 with SIA specific Hot Fix 3655 • ControlCenter API 2.0 Server • Smarts Broker 6.5.1 • Smarts SAM 6.5.1 • Smarts IP AM + NAS 6.5.1 (optional) • Uninstall any previous SIA product • Smarts SIA 1.0.0.1
Post Install Steps • Start ECC & API NG 2.0 Server • Start Smarts Broker • Start Smarts SAM • Start IP-AM + NAS (if necessary) • Start SIA services (or servers) • [See Install & Config Guide for command-line options, if desired] • Launch Global Console (sm_gui) and attach to SIA-Topology • Launch Domain Manager Admin Console • Launch Polling and Thresholds • Configure ECC related credentials and Celerra related credentials (if necessary) • Back in DMAC, add Source for ControlCenter and IP-AM (if necessary)
Storage Insight for Availability P & S Guidelines Support issues
Scalability issues in 1.0 • Original single server ran out of memory for large topologies • Solution was to split into 2 servers • Topology server (STS) • Analysis server (SAS) • Servers run on same host • Server processes can be changed to access 3GB RAM on Windows for large environments • This change has to be made to the Windows OS before SIA installation • set /3GBand /PAEswitches in theBoot.inifile on the system
Hardware specifications (from P&S Guide) • 3 GB extension may be needed. • Minimum 2 CPUs & 4GB RAM needed
Discovery timings • Initial discovery of the ControlCenter Repository can take a long time, depending on the size of the topology. • Alert processing is suspended during discovery • Queued alerts are processed later • Rediscoveries are quicker • P&S Guidelines have processing downtime examples • Need to balance how often re-discoveries are done
SI-A – Cisco support • Issue found in 1.0 • SI-A currently does not support Cisco VSANs properly • The switches are discovered, however, there are inaccuracies in the topology and root cause analysis • Resolution for 1.0.0.1 • Cisco switches will not be imported into the SIA topology in 1.0.0.1 • Rolling Patch will be provided for full Cisco switch support • Expected approximately10 weeks after 1.0.0.1 GA (late October) • What do I do if customer has Cisco switches? • Recognize sales and implementation cycles relative to 1.0.0.1 patch when targeting customers with Cisco switches
Troubleshooting SIA Log Files • SIA Servers have separate logs: • in Incharge6/SI/smarts/local/logs/<server-name>.log.<ver> • e.g.: SIA-TOPOLOGY.log (default) • e.g.: SIA-ANALYSIS.log (default) • Probe framework creates additional logs: • also in local/logs in the format <server-name>-<ClassName>_0.log* • each one has an associated .lck as well • e.g.: SIA-TOPOLOGY-ClariionProbe_0.log • e.g.: SIA-ANALYSIS-EccAlertDispatcherMgr_0.log • ECC client API log files • in smarts/ECC/client/log/server/ *Note: on shutdown logs stay open as rps is saved. If restarted early you will see new probe logs with “.1” appended log filename.
Other log files • SAM Log files • in SAM/local/logs/<server-name>.log.<ver> • \InCharge6\SAM\smarts\local\logs> • InCharge6\SI\smarts\local\logs> • InCharge6\IP\smarts\local\logs> • Repository files • *.rps files for SIA, SAM and IP AM • \InCharge6\SAM\smarts\local\repos> • \InCharge6\SI\smarts\local\repos> • \InCharge6\IP\smarts\local\repos> • ECCAPI NG log files • ECC\ECCAPING\server\log\server>
SIA <-> ECC Connection Problems • SIA server connections can fail in these ways • ECC API connection is lost (ECCConnectionDown) • ControlCenter Connection Failed notification • DB connection is lost (DBConnectionDown) • Database Connection Failed notification • SIA Analysis Server detects these for itself and the SIA Topology Server • Two EMSAgent instances are monitored • ControlCenter // represents SIA Analysis Server • ControlCenter-Topology // represents SIA Topology Server • Notify an event on the appropriate EMSAgent instance • Events will clear when connection is re-established • Refer to the User Guide on corrective actions
Discovery Errors • Discovery errors in SIA should be rare as most of the topology is pulled from Control Center • Objects may be deleted between time main probe in SIA gets the list of instances and individual probe goes to ECC to get the instance data • Will get a discovery error with database id as the instance name • Celerra clients may not be discovered if probe parameters are wrong or box is unreachable from SIA • AM may have discovery errors if it cannot reach a control station or data mover of a Celerra • Fabric may disappear from SI after initial discovery following a fabric split. Will reappear after subsequent discovery.
Naming Issues • SIA naming is designed to be consistent with AM with respect to NAS entities and Hosts, but may not be consistent for other devices exposed to AM • This will result in multiple representations for the same instance in SAM, e.g. FiberChannelSwitch and Switch
Timing issues • Celerra • AM may detect a Data Mover as unreachable before a failover happens • Will get notification of Data Mover down in SIA which will clear when AM detects the standby • Switch Alerts • Switch alerts require SIA to access the database for switch or port status • Database may not yet reflect status that caused alert • SIA will re-access database after 10 minutes (configurable) • Powerpath Alerts • Powerpath Alerts arrive based on a 30 minute DCP • May come after root cause has already been identified