Computing In Research

Computing In Research Dr. S.N. Pradhan Professor, CSE Department

Agenda • Introduction • Data analysis and Visualization • Interactive Data language(IDL) • Scilab & Scicos • Symbolic Computation • Mathematica/ maxima

A Data Analysis Pipeline Raw data Processed data Hypothesis or Model Results D Cleaning Filtering Transforming Statistical Analysis Pattern Rec Knowledge Disc Validation A B C

Where can visualization come in • All stages can benefit from visualization • A: identify bad data, select subsets, help choose transforms (exploratory) • B:help choose computational techniques, set parameters, use vision to recognize, isolate, classify patterns (exploratory) • C: Superimpose derived models on data (confirmatory) • D:Present results (presentation)

What decides how to visualize • Characteristics of data • Types, size, structure • Semantics, completeness, accuracy • Characteristics of user • Perceptual and cognitive abilities • Knowledge of domain, data, tasks, tools • Characteristics of graphical mappings • What are possibilities • Which convey data effectively and efficiently • Characteristics of interactions • Which support the tasks best • Which are easy to learn, use, remember

Issues Regarding Data • Type may indicate which graphical mappings are appropriate • Nominal vs. ordinal • Discrete vs. continuous • Ordered vs. unordered • Univariate vs. multivariate • Scalar vs. vector vs. tensor • Static vs. dynamic • Values vs. relations • Trade-offs between size and accuracy needs • Different orders/structures can reveal different features/patterns

User perceptions • What graphical attributes do we perceive accurately? • What graphical attributes do we perceive quickly? • Which combinations of attributes are separable? • Coping with change blindness • How can visuals support the development of accurate mental models of the data? • Relative vs. absolute judgements – impact on tasks

Issues regarding mappings • Variables include shape, size, orientation, color, texture, opacity, position, motion…. • Some of these have an order, others don’t • Some use up significant screen space • Sensitivity to occlusion • Domain customs/expectations

Issues regarding Interactions • Interaction critical component • Many categories of techniques • Navigation, selection, filtering, reconfiguring, encoding, connecting, and combinations of above • Many “spaces” in which interactions can be applied • Screen/pixels, data, data structures, graphical objects, graphical attributes, visualization structures

Importance of Evaluation • Easy to design bad visualizations • Many design rules exist – many conflict, many routinely violated • 5 E’s of evaluation: effective, efficient, engaging, error tolerant, easy to learn • Many styles of evaluation (qualitative and quantitative): • Use/case studies • Usability testing • User studies • Longitudinal studies • Expert evaluation • Heuristic evaluation

Different Views

Mappings • Based on data characteristics • Numbers, text, graphs, software, …. • Logical groupings of techniques (Keim) • Standard: bars, lines, pie charts, scatterplots • Geometrically transformed: landscapes, parallel coordinates • Icon-based: stick figures, faces, profiles • Dense pixels: recursive segments, pixel bar charts • Stacked: treemaps, dimensional stacking

Mappings • Based on dimension management (Ward) • Dimension subsetting: scatterplots, pixel-oriented methods • Dimension reconfiguring: glyphs, parallel coordinates • Dimension reduction: PCA, MDS(Multi Dimensional Sclaing), Self Organizing Maps • Dimension embedding: dimensional stacking, worlds within worlds

Sensor Network SENSOR LAB AT BERKELEY

Pairwise link quality Link Quality Distance Between Nodes

Glyphs

Dimensional Stacking • Break each dimension range into bins • Break the screen into a grid using the number of bins for 2 dimensions • Repeat the process for 2 more dimensions within the subimages formed by first grid, recurse through all dimensions • Look for repeated patterns, outliers, trends, gaps

Dimensional Stacking

Pixel oriented technique

Methods to cope with scale • Many modern datasets contain large number of records (millions and billions) and/or dimensions (hundreds and thousands) • Several strategies to handle scale problems • Sampling • Filtering • Clustering/aggregation • Techniques can be automated or user-controlled

Visualization a powerful component of the data analysis process • Each stage of analysis can be enhanced • Visualization can help guide computational analysis, and vice versa • Multiple linked views and a rich assortment of interactions key to success

Numerical Recipes in C & C++ • Numerical Recipes in C is a collection (or a library) of C functions written by Press et al. • Library of mathematical functions • Useful while doing process or system modeling. • Break down the model in known mathematical functions and then one can use routines.

GNU Scientific Library • Basic mathematical functionsComplex numbers • Polynomials • Special functions • Vectors and matrices • PermutationsCombinations

MultisetsSorting • Linear algebra • Eigensystems • Fast Fourier transforms • Numerical integration (based on QUADPACK) • Random number generation • Quasi-random sequences • Random number distributions • Statistics • Histograms

Interactive Data language • Data manipulation and visualization • Commercially availble packages IDL from ITT Visual Information System. • Consists of • Data Analysis • Data visualization • Animation

Open Source IDL(GDL) • Open Source equivalent of IDL and much more • GDL is used particularly in geosciences. • GDL is dynamically-typed, vectorized and has object-oriented programming capabilities. • The library routines handle numerical calculations, data visualisation, signal/image processing, interaction with host OS and data • input/output. GDL supports several data formats such as netCDF, • HDF4, HDF5, GRIB, PNG, TIFF, DICOM, etc. Graphical output is handled by X11, PostScript, SVG or z-buffer terminals

Part II • Analysis may, therefore, be categorized as • Descriptive analysis • Inferential analysis (often known as statistical analysis). • Correlation analysis • Causal analysis (regression analysis) • Multivariate analysis

Descriptive analysis • Descriptive analysis” is largely study of distributions of one variable. This study provides us with profiles of companies, work groups, persons and other subjects on any of a multiple of characteristics such as size. Composition, efficiency, preferences, etc.” This sort of analysis may be in respect of one variable (described as unidimensional analysis); or in respect of two variables (described as bivariate analysis) or in respect of more than two variables (described as multivariate analysis). In this context we work out various measures that show the size and shape of a distribution(s) along with the study of measuring relationships between two or more variables.

Correlation analysis • Correlation analysis studies the joint variation of two or more variables for determining the amount of correlation between two or more variables. In most social and business researches interest lies in understanding and controlling relationships between variables and so correlation analysis are relatively more important.

Causal Analysis • Causal analysis (regression analysis) is concerned with study of how one or more variables affect changes in another variable. It is a study of functional relationships existing between two or more variables. Causal analysis is considered relatively more important in experimental researches.

Multivariate analysis • Multivariate analysis is defined as “all statistical methods which simultaneously analyze more than two variables on a sample of observations”. With the availability of computer facilities, there has been a rapid development of this kind of analysis.

Multiple regression analysis: This analysis is adopted when the researcher has one dependent variable which is presumed to be a function of two or more independent variables. The objective of this analysis is to make a prediction about the dependent variable based on its covariance with all the concerned independent variables.

Multiple discriminant analysis: This analysis is appropriate when the researcher has a single dependent variable that cannot be measured, but can be classified into two or more groups on the basis of some attribute. The object of this analysis happens to be to predict an entity’s possibility of belonging to a particular group based on several predictor variables.

Multivariate analysis of variance (or multi-ANOVA): This analysis is an extension of two way ANOVA, wherein the ratio of among group variance to within group variance is worked out on a set of variables.

Canonical analysis: This analysis can be used in case of both measurable and non-measurable variables for the purpose of simultaneously predicting a set of dependent variables from their joint covariance with a set of independent variables.

Analysis of variance (ANOVA) is a useful technique concerning researches in the fields of economics, biology, education, psychology, sociology, business/industry and several other disciplines. This technique is used when multiple sample cases are involved. The significance of difference between the means of two samples can be judged through either z-test or the t-test, but the difficulty arises when we happen to examine the significance of the difference amongst more than two sample means at the same time. The ANOVA technique enables us to perform this simultaneous test and as such is considered to be an important tool of analysis in the hands of a researcher.

Role of Simulation Simulation is a method and application to mimic the real system, mostly via computer. Simulation is a numerical technique for conducting experiments on a computer which involves logical and mathematical relationships that interact to describe the behavior and structure of a complex real-world system over extended period of times.

Simulation makes it possible to study and experiment with the complex internal interactions of a given system. • It provides better understanding of the system. • Simulation can be used as a pedagogical device for teaching students. • The experience of designing a computer simulation model may be more valuable than the designing of actual model.

Simulation can be used to experiment with new situation about which we have little or no information available. • To verify analytical solution. • Cheap - No need of costly equipments • Complex scenarios can be easily tested • Results can be quickly obtained • More ideas can be tested in smaller time limit

Pitfalls of simulation • It cannot provide insight for all possible scenarios Eg. Mobile networks must be tested with different mobility models • Failure to have a well-defined set of objectives at the beginning of the simulation study. • Failure to communicate with the decision-maker (or the client) on a regular basis. • Lack of knowledge of simulation methodology and also of probability and statistics.

Real systems too complex to model leading to inappropriate level of model detail. • Failure to collect good system data. • Belief that so-called ”easy-to-use” simulation packages require a significantly lower level of technical competence. • Blindly using simulation software without understanding its underlying assumptions. • Replacing a probability distribution by its mean. • Failure to perform a proper output-data analysis

Simulation Checklist • Checks before developing simulation • Is the goal properly specified? • Is detail in model appropriate for goal? • Does team include right mix (leader, modeling, programming, background)? • Has sufficient time been planned? • Checks during simulation development • Is random number random? • Is model reviewed regularly? • Is model documented?

Checklist cont… • Checks after simulation is running • Is simulation length appropriate? • Are initial transients removed? • Has model been verified? • Has model been validated? • Are there any surprising results? If yes, have they been validated?

Terminology • State variables • Variables whose values define current state of system • Saving can allow simulation to be stopped and restarted later by restoring all state variables • Event • A change in system state • Ex: Three events: arrival of job, beginning of new execution, departure of job

Continuous-time and discrete-time models • If state defined at all times  continuous • If state defined only at instants  discrete • Ex: class that meets M-F 2-3 is discrete since not defined other times • Continuous-state and discrete-state models • If uncountably infinite  continuous • Ex: time spent by students on hw • If countable  discrete • Ex: jobs in CPU queue • Note, continuous time does not necessarily imply continuous state and vice-versa • All combinations possible

Deterministic and probabilistic models • If output predicted with certainty  deterministic • If output different for different repetitions  probabilistic • Ex: For proj1, dog type-1 makes simulation deterministic but dog type-2 makes simulation probabilistic

Static and dynamic models • Time is not a variable  static • If changes with time  dynamic • Ex: CPU scheduler is dynamic, while matter-to-energy model E=mc2 is static • Linear and nonlinear models • Output is linear combination of input  linear • Otherwise  nonlinear

Open and closed models • Input is external and independent  open • Closed model has no external input • Ex: if same jobs leave and re-enter queue then closed, while if new jobs enter system then open • Stable and unstable • Model output settles down  stable • Model output always changes  unstable

Selecting Simulation language • Four choices: simulation language, general-purpose language, extension of general purpose, simulation package • Simulation language – built in facilities for time steps, event scheduling, data collection, reporting • General-purpose – known to developer, available on more systems, flexible • The major difference is the cost tradeoff – simulation language requires startup time to learn, while general purpose may require more time to add simulation flexibility

Computing In Research

Computing In Research

Presentation Transcript

HPC clusters in Research Computing

Research Issues in Cooperative Computing

Research Analysis - in educational computing

Computing Research in Latin America

Research Topics in Ubiquitous Computing

Research in Cloud Computing

Research Computing

Research in Computing CSC 3990

SWM43 Research in Computing: Introduction to Computing Research

UF Research Computing

Research Issues in Cooperative Computing

Research Computing @ RIT

SWM43: Research in Computing

Research Computing

Research in Computing

Security in Research Computing

Research Challenges in Autonomic Computing

Research in Computing Discipline

Research Computing Support

Research Issues in Cooperative Computing

Research Issues in Cooperative Computing

Research in Computing Discipline