Overview

E-Science and Statistical Modelling in Social Research Daniel GroseAudrienne Cutajar BezzinaCQeSS University of Lancaster

Overview • Introduction to the GRID • The problems associated with using the grid. • The need for HPC grid resources. An example - SABRE. • A solution. GROWL – an overview. • Summary. • Questions and Discussion

Introduction to the GRID

The Grid some Definitions • "…is distributed computing across multiple administrative domains" • Dave Snelling, senior architect of UNICORE • […provides] “flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions, and resource” • From “The Anatomy of the Grid: Enabling Scalable Virtual Organizations” • "…enables communities (“virtual organizations”) to share geographically distributed resources as they pursue common goals.."

Categories of GRID Usage 1. Computational GRID for high-performance computation. • High Latency – MPI on HPC system or cluster • Low Latency – distributed heterogeneous systems 2. Data GRID for sharing and administrating large volumes of data. 3. Sensor GRID for real-time monitoring (for example - electronic transactions, traffic and pedestrian flows, environmental features). 4. Access GRID for collaborative visualization involving distant researchers.

Some Examples of Computationally Demanding Statistical Methods • Financial time series models; • GIS and spatial data analysis; • Survival analysis with correlated risks, event history analysis, problem of initial conditions; • Data mining and data fusion; • Bootstrap methods; • Simulation methods; • Bayesian methods; • Visualization of multivariate data; • AI systems for statistical analysis. • Analysis of some new large data set (e.g. credit card purchases, supermarket store data)

The Power Grid Analogy The expression "Computational Grid" was coined by analogy with power grids • In power grids, plug in your appliance and draw current, without caring where the power is generated • In computational grids, plug in your application and draw cycles Its not yet this simple.

The problems associated with using the grid

The Problems • Large number of software components required by client application to enable the Grid - e.g. security components, resource allocators, schedulers etc • Components difficult to install and manage. • No integration into existing client research applications (R, S, Stata, MATLAB etc.) • No well defined 'work flows' - existing methods are 'ad hoc' • ' ... making applications “Grid enabled” is seen by some as a distraction from getting real science done.' - J M Schopf & B Nitzberg. “Grids : The Top Ten Questions”

Classification of Grid User(adapted from Foster and Kesselman)

The Problems – A Missing Layer

The need for HPC grid resources. An example - SABRE.

SABRESoftware for the Analysis of Binary Recurrent Events SABRE is designed to model recurrent events for a collection of individuals or cases and many other types of repeated measures data with binary, ordinal or count responses. It fits both standard models and various mixture models which allow for residual heterogeneity. It can be used to fit the following univariate statistical models : • binary data with logit, probit or complementary log-log link • ordinal response data using a probit link • count response data using a log-linear Poisson model • continuous response using identity link SABRE employs reweighted least squares (standard homogenous models) and Newton-Raphson maximum likelihood (random effects binary models) algorithms. Both algorithms have been parallelised using MPI.

Run Time Comparisons. SABRE – Parallel SABRE - STATA Comparisons : • A random effects logit model fitted in STATA using the xtlogit command with 12-point quadrature. • SABRE, logit link, 12-point quadrature • Parallel SABRE, logit link, 12-point quadrature Illustrative data, months to employment dataset contains 3,655,704 monthly observations on 199,881 individuals, with 14,716 non-zero binary outcomes. Comparison uses same starting values. The total number of model parameters = 54.

Run Time Comparisons. SABRE – Parallel SABRE - STATA

SABRE Developments • Has been extended for bi-variate analysis. • Will be extended to tri-variate analysis and greater. • Computational time increases geometrically with the number of variates. • These developments demand HPC resources.

A solution. GROWL – an overview.

A Solution - GROWLGrid Resources on a Work station Library Project Objective – Demonstrate a lightweight client/server library that provides : • Transparent client side handling of GRID related issues e.g security, file transfer etc. • Modules, libraries and “plug in's” that interface with existing client software tools. • Extensibility via a simple API with common language mappings (C++,C and Fortran). • A persistant multi-client server linked to existing grid components (primarily the Globus toolkit) providing access to HPC resources, session management, scheduling, authentication etc.

GROWL Architecture

GROWL ArchitectureEnd User

GROWL ArchitectureGROWL Developer

GROWL ArchitectureGrid Developer

GROWL ArchitectureSystems Administrator

Sample R Session > library("sabreR") > sabre.session.0<-new.sabre.session() > sabre.current.session(sabre.session.0) > data<-list("pid","wave","hid","pno",",ivfio","hoh","opfamb","opfamc","opfame", + "opfamf","opfamg","opfamh","sex","age12","race","region","jbsic","jbgold", + "jbstat","child","qfedhi","mastat", + "tenure","emp") > sabre.data(sabre.session.0,data) > sabre.yvariate("opfamb") > sabre.ordinal(5,"cutpt") > sabre.read("bhps.dat") > variables<-list("cutpt","sex","jbgold","qfedhi","mastat","tenure","jbsic") > factors<-list<("factor.cutpt","factor.sex","factor.jbgold","factor.qfedhi", + "factor.mastat","factor.tenure","factor.jbsic") > sabre.factor(variables,factors) > sabre.lfit(model="fcutpt",list("fsex",factors) > sabre.fit() > results<-sabre.results()

NGS: National Grid Service • Fully functional now (since 1st Sept 2004) • Core comprises • JISC-funded nodes • Compute clusters at Leeds and Oxford (64 dual processor systems) • Data clusters at RAL and Manchester (20 dual processor systems, 18 TB) • Access is free at point-of-use, subject to light-weight peer review • National HPC services HPCx and CSAR • Access through UK e-Science (or other recognised) certificates • First line of support provided by Grid Support Centre • until Grid Operation Support Centre is established

UK E-Science Programme • NeSC, Regional E-Science Centres and Centre of Excellence – DTI Core Programme; • Pilot projects – EPSRC, ESRC; • UK National Grid Service + e-Science Grid - JCSR and DTI Core Programme; • NCeSS: National Centre for e-Social Science – ESRC; • CQeSS: Collaboration for Quantitative e-Social Science - ESRC (+ future NCeSS nodes); • VRE/VLE initiatives JISC.

Summary • Clearly, the resources that are available on the GRID can provide significant benefit for e-social science – for example, the use of parallel SABRE in longitudinal studies. • However, although the software components necessary to provide HPC services and share data on the GRID exist, they require specialist knowledge to install, use and administer. • Middleware libraries, such as GROWL are being developed that will enable many quantitative social scientists to make the step to using e-Science technology to solve their problems. • Furthermore, the middleware layer includes extensions for the applications and environments currently used by social scientists – for example Stata, R and S.

References and Resources “The Grid: Blueprint for a Future Computing Infrastructure.” I. Foster and C. Kesselmann (editors) 1998.Morgan Kaufmann Publishers. “The Anatomy of the Grid – Enabling Scalable Virtual Organisations.” I. Foster, C. Kesselmann and S. Tuecke 2001. Intl J. Supercomputer Applications. “Grids : The Top Ten Questions.” M. Schopf and B. Nitzberg. Growl - http://www.growl.org.uk Sabre - http://www.cas.lancs.ac.uk/software/sabre3.1/sabre.html JISC - http://www.jisc.ac.uk/

References and Resources National E-Science Centre - http://www.nesc.ac.uk National Centre for e-Social Science - http://www.nces.ac.uk National Grid Service - http://www.ngs.ac.uk UK Grid Support Centre - http://www.grid-support.ac.uk Global Grid Forum - http://www.grids.ac.uk Access Grid Support Centre - http://www.agsc.ja.net Open Middleware Infrastructure Institute - http://www.omii.ac.uk

What if ? • You could automatically access all of the Archived Data Sets and those used in every social research publication and decide on the most appropriate data for your research needs, without having to spend days reading through coding schedules and questionnaires; • You could automatically re-estimate all the models others have used on these data sets, and see what happens if you drop or add new variables to the analysis; • You could quickly formulate (check the identification etc) and estimate any new models or combinations of existing models you thought might be relevant; • You could do this across multiple datasets; • You could match your research questions to information held in existing digital resources. Search for new explanations; • Integrate multiple sources of data and text to help to fill in missing data and ideas.

Discussion Your questions. Some questions for you. • What statistical methods are you currently using. Are they limited by the computer resources available ? • What are you not currently doing that may be possible given appropriate GRID access and resources ? • What statistical applications do you use ? • If you are currently using GRID resources, what are the major problems (if any) that you encounter ? • If you recognise a need for GRID resources but are not currently employing them, why not ?

Overview

Overview

Presentation Transcript

Overview

Overview

OVERVIEW

Overview

Overview

Overview

Overview

Overview

overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview