340 likes | 462 Vues
Experiences with NMI at Michigan. Shawn McKee October 1, 2004 NMI/SURA Testbed Workshop. Outline. A little history: NMI at Michigan About our environment and motivations Comments on some middleware components Issues for Middleware at Michigan Outlook and Summary.
E N D
Experiences with NMI at Michigan Shawn McKee October 1, 2004 NMI/SURA Testbed Workshop
Outline • A little history: NMI at Michigan • About our environment and motivations • Comments on some middleware components • Issues for Middleware at Michigan • Outlook and Summary
History: Michigan as an “Honorary” SURA Member! • Michigan proposed to join the NMI/SURA testbed as soon as we heard about the opportunity • Michigan has a long history of work in this area: LDAP, NSFNet, AFS/IFS, KX509, CoSign, CHEF/OGCE/Sakai, NFS V4, … • We were beginning to start up a campus-wide Initiative call “MGRID” (Michigan Grid Research and Infrastructure Development)…NMI fit perfectly into our plans and interests • We were accepted into the testbed as its northern-most member…
Campus Research and Grid Motivation • Michigan is a major research institution with a large, varied mix of researchers. • Many of our departments make extensive use of computing/storage/network resources and are always requiring more, for the same (or less) cost… • Many of our researchers are part of larger national or international collaborations. • Grid computing and NMI middleware help us to optimize our existing resources and plug us in to developing national and international efforts. • This is likely the case for most Universities around the country…
Some More Context • Michigan, thru our MGRID initiative, has been adapting and adopting Middleware to enable our distributed resources • NMI has been a key component of our work • Portals seem to be the key to enabling transparent access to various resources • We are building out for our future needs: tools like KX.509 and our XML Grid Accounting are being augmented with additional components like Walden and new applications like NTAP…
MGRID – www.mgrid.umich.edu • A center to develop, deploy, and sustain an institutional grid at Michigan • Many groups across the University participate in compute/data/network-intensive research grants – increasingly Grid is the solution • ATLAS, NPACI, NEESGrid, Visible Human, NFSv4, NMI • MGRID allows work on common infrastructure instead of custom solutions • Middleware, like NMI, make it possible
NMI Components • The NMI package consists of many components • Michigan used many of the components in our work on MGRID and with various application domains • KX.509 was central to much of our work bridging our Kerberos users to X509 (PKI) space • Grids components (Globus, Condor, etc) were the primary means to make resources accessible • Many NMI components were included in VDT, Grid3 and NPACI Rocks distributions
One Application Domain Perspective • I would also like to comment a bit on my biased application perspective • As a high-energy physicist I need to worry about accessing and processing LOTS of data, globally • In less than 3 years the LHC collider will begin to run and our ATLAS experiment will need to make ~10 Petabytes/year of data available to ~2000 physicists worldwide. • To handle this we need all the resources we can get…middleware is the basis for making these resources accessible and usable.
ATLAS (www.usatlas.bnl.gov) • A Torroidal LHC Apparatus • Collaboration • 150 institutes • 1850 physicists • Detector • Inner tracker • Calorimeter • Magnet • Muon • United States ATLAS • 29 universities, 3 national labs • 20% of ATLAS
Tier2 Center Tier2 Center Tier2 Center Tier2 Center Tier2 Center HPSS HPSS HPSS HPSS Data Grids for High Energy Physics CERN/Outside Resource Ratio ~1:4Tier0/( Tier1)/( Tier2) ~1:2:2 ~PByte/sec ~100-400 MBytes/sec Online System Offline Farm,CERN Computer Ctr ~25 TIPS Tier0 +1 ~10-40 Gbits/sec HPSS Tier 1 France Italy UK BNL Center Tier 2 ~10 Gbps Tier 3 Physicists work on analysis “channels” Each institute has ~10 physicists working on one or more channels Institute ~0.25TIPS Institute Institute Institute 100 - 10000 Mbits/sec Physics data cache Tier 4 Workstations
The Problem Abort, retry, fail?
Building Upon NMI • Middleware is glue to enable applications easy access to resources, data and instruments • Portals organize the middleware while hiding complexity
Grid Portal Work MGRID Portal User Workstation Apache SSL – Client Certificate required mod ssl Browser Kerberos V5 mod kct libpkcs11 KCT kx509 mod kx509 Kerberos KCA mod jk kinit Kerberos KDC Tomcat CHEF Grid Service GSI LDAP SASL Walden GateKeeper Resource Manager We would like to propose these for NMI R5+! Service
MGRID Accounting • Step 1: Grid scheduling software (e.g. PBSPro, Condor) generates usage log files in various formats • Step 2: MGRID Accounting translates usage log files into common XML format (http://www.psc.edu/~lfm/Grid/UR-WG/) • Step 3: MGRID Accounting ingests data into MySql database for report generation and review
Accounting Example on MGRID Through our portal we can easily select and display account information for MGRID resources
MGRID Walden Authorization • Fine-Grained authorization module based on XACML standard (XACML-based policy engine) • Cluster owners have complete administrative control over who users their resources • Policy files define rules based on group membership, time of day, resource load, etc. • Local account management is unnecessary • Group membership can be assigned from one or several secure LDAP servers
MGRID NTAP Projecthttp://www.citi.umich.edu/projects/ntap/ • NTAP: Network Testing and Performance • Purpose: provide a secure and extensible network testing and performance tool invocation service at U-M • Service based on Globus • Has modular, fine-grained authorization • Added signed group membership(s) to reservation data • Now provides two authorization methods: • Keynote policy engine / AFS PTS group service • PERMIS policy engine / LDAP group service • Runs on dedicated nodes attached to routers in a VLAN environment
Kerberos V5 KCT KCA KDC pilot LDAP Output NW Topology NTAP Architecture User Workstation Portal Host 1. The user authenticates to the portal host via kx.509 and submits a network test request browser Apache mod ssl libpkcs11 2. The portal host constructs a path between specified endpoints, issues test reservations, and updates the output database. mod kct kx509 mod kx509 kinit mod jp mod php PMP Host PMP Host 3. PMPs* on the test path run performance tests between pairs of routers. GateKeeper GateKeeper Resource Mgr Resource Mgr iperf, etc iperf, etc * Performance Monitoring Platform 4. The portal host displays results.
GridNFS (NMI Development)http://www.citi.umich.edu/projects/ • Michigan has been funded to develop GridNFS, a middleware solution that extends distributed file system technology and flexible identity management techniques to meet the needs of grid-based virtual organizations. • The foundation for data sharing in GridNFS is NFS version 4 • The challenges of authentication and authorization in GridNFS are met with X.509 credentials • In tying these middleware technologies together in the way we propose, we fill the gap for two vital, missing capabilities. • Transparent and secure data management integrated with existing grid authentication and authorization tools. • Scalable and agile name space management for establishing and controlling identity in virtual organizations and for specifying their data resources. • GridNFS is a new approach that extends “best of breed” Internet technologies with established Grid architectures and protocols to meet these immediate needs
Some Comments about Select NMI Components • In the next few slides I want to discuss our experiences with a few specific components • Overall the NMI components have been indispensable for our activities at Michigan • There are numerous EDIT components regarding information management and organization that I won’t cover in detail, though these are required to make progress on inter-institutional collaboration and resource sharing
Globus Experiences • We had already been using Globus since V1.1.3 for our work on the US ATLAS testbed • The NMI release was nice because of the GPT packaging which made installation trivial. • There were some issues with configuration and coexistence: • Had to create a separate NMI gatekeeper to not impact our production grid users • No major issues found…Globus just worked • Our primary Globus installation was via the Grid3 package for ATLAS
Condor-G • Condor was already in use at our site and in our testbed. • Condor-G installed over existing Condor installations produced some problems: • Part of the difficulty was not understanding the details of the difference between Condor and Condor-G • A file ($LOG/.schedd_address) was owned by root rather than the condor user and this “broke” Condor-G. Resolved via the testbed support list • Condor-G has evolved over the life of the testbed and is an integral part of our ATLAS Data Challenge infrastructure
Network Weather Service (NWS) • Installation was trivial via GPT (server/client bundles) • Interesting product for us. We have done significant work with monitoring. • NWS advantages: • Easy to automate network testing, once you understand the config details • Prediction of future value of resources is fairly unique and potentially useful for grid scheduling • NWS disadvantages: • Difficult user interface (relatively obscure syntax to access measured/predicted data) • Our REU student may take up an NWS related project
KX509 for Enabling Access • The University of Michigan has around 200,000 active “uniqnames” in its Kerberos authentication system. It is not feasible to replicate this into other systems and so we have developed KX509 for translation to PKI space. • Our MGRID portal and gatekeepers are all configured to utilize KX509 generated credentials from our users normal Kerberos identities. • This makes authentication trivial for our installed user base.
GSI OpenSSH • Useful program to extend functionality of PKI to OpenSSH. • Allows “automatic” interactive login to proxy holders based upon Globus mapfile entries • Simple to install---In principle a superset of OpenSSH on the server end • We had a problem with a conflict in dynamic libraries which it installs on a non-NMI host • Very convenient in conjunction with KX509
Campus Grid Implementation • Our MGRID challenge has been how to develop and enable a useable, deployable grid infrastructure across different academic/administrative divisions within the University • A key aspect of the challenge is the NMI components which are intended to “standardize” much of the needed functionality around information flow, authentication, authorization, monitoring and resource delivery • Delivering something which is as easy to use and deploy as possible is a very important…
Distribution and Installation • As we started to integrate NMI components and extend and develop our own concepts we ran into a major issue: others want to use/take advantage of what we are delivering. • Many of you likely realize the complexity which can surround the installation/configuration of even a single grid component, let alone a complete system involving many components. • Our plans are to provide PacMan distributions of our software as well as CDs for “bare metal” installs. This is a critically important (and just beginning) effort for us, especially as more users on campus start asking “How can I participate/take advantage of MGRID?”
Ease of Use and Adoption • One thing we realized early was a requirement that any grid solution we developed be easy to adopt and use. • MGRID choices have been strongly influenced by this overriding concept: • Using a portal to provide client capability • Leveraging existing authentication and information services as much as possible • Providing tools and an environment for our “virtual grid computer” similar to what a single workstation provides for its users • Thus “Ease of Use and Adoption” is not just for Users but for Administrators and Managers as well!
Authorization • Some of the hardest issues MGRID is facing are related to authorization. • We are tracking packages like Permis and Shibboleth to help provide solutions • Secure LDAP (Walden) can help provide a campus-wide resource building upon existing attributes to help “feed” authorization policy engines which are being developed • This is an area of intense interest for us, especially because of our work at Michigan on NFS V4, GridNFS and NTAP
Ongoing and Future Efforts • GridNFS has been funded for 3 years by the NSF NMI Development program • Development of MARS, a “Meta-scheduling” package, is now funded by NSF. • Planning how to merge NTAP into GNMI/Internet2 • Easy to use installation and upgrade packages are under development and are critical to our success on campus. • Continue to emphasize standards and ease of use and adoption as our guidelines for delivering functionality • Continue efforts in Authorization and Accounting to produce grid systems which deliver a range of capabilities similar to what individual systems current provide.
Points to Conclude With… • NMI has been a key component of our efforts • Use of a portal can make access to various distributed resources safe and easy • Making it easy to distribute, deploy and configure middleware has to be a priority if we are to make a real impact. • Working with others is very important: • Learning from their experiences • Input on our directions • Collaborating for common solutions • Michigan plans to continue working with NMI and developing needed infrastructure for successful, effective grids and networks. LUNCH!