Opportunities in Statistical Software: Phystat Workshop

Opportunities in Statistical Software:Phystat Workshop Jim Linnemann MSU March 1 , 2004

Preliminaries • Be sure to get a parking permit from Lorie Neuman (room 4218, X 2180) • Wireless: Tom Rockwell can help if you can’t get access; you should just get a direct connection to outside world • Dhcp with an address starting with 10. • If you need to print something, email to • linnemann@pa.msu.edu • Introductions

Why you? • You—developers—can actually change things! • I would personally like a better analysis environment for HEP. • I keep hearing about R from statisticians! • I am convinced astronomers and HEP together will get something better than either has alone. • And maybe we will have some things that statisticians can use, too. • Suggested to Brad Efron using arxiv.org for statistics • I subscribe to the “right people in a room” theory.

What Can We Accomplish? • We won’t convince anyone to drop what they do now and adopt product xxx instead! • But we might benefit from seeing different development cultures, work styles, or interesting ideas • We might find ways to make interfaces across projects, or identify common projects • If this starts to look interesting, we can spend more time on sharpening this up • The “agenda” can be revised at any time!

Sociology • HEP experiments: own data reduction software (C++) • Usually develop common tools used by whole collaboration • Use more generic software as tools, and final data analysis • Particle Astrophysics similar, but more Fortran/C • HEP lab-dominated in cross-experiment software • CERN, Fermilab, SLAC, DESY, KEK, Brookhaven • Some instances of cross-lab collaboration • Grid computing is one of few non-lab major software projects • Some tools are university based (specific simulations) • Typically free to community, but not gnu… • Smaller packages: repositories not that well developed • Not much commercial software • Office; mathematica/maple; some mathcad/matlab/kaleidagraph • IDL much less used than in astronomy: not as image-oriented • Latex; ghostview; gnuplot-like • Statistics: more distributed? • Astronomy: more large software grants?

Some Possible Goals • Repository sponsorship • Web or Python interfaces to libraries • Root user package repository? • Interfaces between R and Root • GUI for R? • R scripting in Root? R libraries in Root? • Handling of larger datasets in R?

HEP Small Packages • Example: calculation of significance, limits from observed counts, estimated background, uncertainties, efficiencies, etc. • Several competing procedures • Some are published (PHYSTAT; NIM) • Standard programs not on public, recognized web sites: know the author, or someone in collaboration implements and maybe posts or puts in local repository • Programs not collected by Particle Data Group • publishes generally-recognized methods review

Questions to see differences: • Goals + strengths • What would you like to add next? • User community: Who? How many? Platforms? • User interface: GUI, Scripting, Web, link library, code? • Documentation: how? Quality? • How big is developer community? • How are contributions made/tested/integrated? • Releases and bug tracking mechanisms • Implementation language(s) • Licensing/distribution

Proposed Presentations • Rene Brun: Root data mining in HEP • Eric Feigelson: VOSTATS R in astronomy? • Luke Tierney: R (and omegastats?) • Who? Frustrating Examples • Sherry Towers TerraFerMA classification in HEP • Adam Lyon Using R in HEP • Scott Snyder Alternative Root Interfaces • Tim Beers Rostat robust legacy code • Right Order? Space out or bunch? • First pass quickly to survey, then reconsider? • Discussion during presentation or after?

Other possible activities • Discussion/panel: • What do users want? • How could projects reinforce one another • Selecting achievable goals • What are options for Fermilab projects? • Technical Working Group(s) • Specifics, e.g. root/R interface (brass tacks) • Planning of joint projects? • Planning of further workshops? • Developer or user oriented? • Post Talks to web? • Semi-private (developer use)? • Or public, with publicity to users

Some projects that got awayParticularly Python-based • StatPy—Tom Loredo • Python interface to Root—Harrison Prosper • Orange and related: Python--Aleks Jakulin • Jas—Java analysis framework

Restaurant: Villegas 6:15pmN. to Grand River; E 3.2 mi. past Okemos Rd, Marsh Rd 1735 W Gr River, 347-2080 (on right before Dobie) Central Park BMPS

Dessert: Jim & Ruth Linnemann1217 Ascot Pl 349-6138 Continue E (right) on Grand River Left at Cornell Rd (1 mi) Right at Ascot Place (3rd right; 2 miles or so) 1st drive on right of Ascot

Example 1: 2 sample classification • Plot signal efficiency vs background rejection curves (ROC) • Selection based on a set of variables (or combinations of variables). • Click on efficiency value to find value selection criterion in original variables. • Superimpose curves for several candidate variable selections. • Data: • Look in a coordinated fashion at two separate data sets with related but non-identical data structures • HEP data usually tree-structured: • many instances, each including variable number of lower-level objects • Typically 2 or more levels down , • I might analyze these by forming a variable number of derived variables from the low level objects. • Much of this process is algorithmic, but I wind up re-doing it by hand each time I try it.

Ex 2: No integrated repsository • End of an analysis: sample of data events, and an expected set of possible backgrounds, each with an uncertainty. • Want to calculate a statistical significance (or 90% CL) for these. • Usually have to extract these numbers and then find a completely separate piece of software, either in someone's private area, or on the web, or if • I'm really lucky, in a macro someone's written. • There aren't good central mechanisms (repositories or interactive web sites) or for sharing such algorithms, either.

Ex 3: New Statistical Methods • While the environment I'm used to is good at exploring and fitting large data sets, the number of statistical methods part of that framework is limited. • I'd like to be able to apply many of the tests I might find in a textbook to comparing two distributions. • Or I’d like to perform bootstrap calculations or “ensemble tests” without writing from scratch a “toy Monte Carlo”: to identify the statistical uncertainty of my fitting results with simulated experiments. • These tests exist in R, but my data is in Root.

Root: key features • GUI for presentation graphics and selection (“cuts”) • I/O for tree-structured data: scales to petabytes • Histogram as base metaphor (akin to vector) • Sophisticated nonlinear fitting • C++ at command line, macros, compiled macros

R: key features • Elegant data manipulation: S language: • command prompt and macros • interpreted, heading to byte-compilation • GUI: only now building hooks • most users satisfied command line • Standard tool of professional research statisticians • Sophisticated graphics • standard statistical plots not used in HEP • missing histograms with error bars • Links to further multidimensional graphics (Ggobi) • Data in virtual memory • Data frames: vectors are a basic metaphor (cf. histogram in Root) • interfaces to databases (postgres; mysql) • Parallel computation under development • Broad package library, with trivial download

Opportunities in Statistical Software: Phystat Workshop

Opportunities in Statistical Software: Phystat Workshop

Presentation Transcript

Energy Efficient Windows and Building Design * Software Workshop * Building Simulation Tools

The Early Detection of Disease –Statistical Challenges

Computer Software

Statistical Analysis Overview I Session 1

PIH 2005 Energy Bill Workshop

SPSS 201: Using SPSS to Perform Commonly Used Statistical Testing in Medical Research (Workshop)

Supplement – Statistical Thermodynamics

Workshop # 1: Opportunities in Family-Centered Practice

Statistical Process Control

Statistical inference for astrophysics

Statistical Process Control

Workshop on Java Programming

1. Introduction to Software Engineering

WELCOME

STATISTICAL DATA ANALYSIS

Engendering the Statistical System in Cambodia

Nonparametric Statistical Techniques

4-1 Statistical Inference

Supplement – Statistical Thermodynamics

Domain Adaptation for Statistical Machine Translation