Data collection and Statistics

Data collection and Statistics Evert Jan Bakker and Gerrit Gort Biometris - Wageningen University

Introduction: What is Statistics? • Probability calculus - theoretical and exact • Descriptive StatisticsJust describes the data. All conclusions only refer to the sample. The conclusions are ‘always correct’. Càn be convincing already.Graphical representations of the data. • Inference(Test of Hypothesis, Estimate Conf. Interval)Conclusions are drawn about a population (e.g. Wageningen Students) or a general phenomenon (maize yield), only using data from a limited sample. • Experimental design/ Sampling designRandomisation,Blocking, Special designs…/sample size

Inference • An experiment used for inference : • Question / Hypothesis • Design of the experiment Statistics • Carry out the experiment • Analysis of the experimental data Statistics • For standard designs, the data analysis follows a fixed calculation pattern, which is known before the experiment is done.

2 types of research aims 1. exploration : generate new ideas Measure many response variables; report any fact of interest / relationship / differences, using “any” descriptive analysis. 2. Inference (test / confidence interval):drawing conclusions about a population or a general phenomenon based on sample data. Inference has to be done according to the rules, so as not to ‘Lie with Statistics’.The model of analysis should be reasonable

Qualitative vs Quantitative data “green”

Data collection • Primarydata collection: • forobservational research: sampling, how?, howmany? • forexperimental research: design of experiment (choice of exerimental units, randomisation, measurement of response(s), nr. of replications • In case secondary data is used: knowhow the data wereobtained (meta-data). Otherwise the conclusionwillbeaboutanunknownpopulation. • Sampling: random, stratification, subsampling, ... Conclusioncanbedrawnabout a populationfromwhich a random sample was taken.

Design principles : brief overview 1. Repetition (n > 1) • required for more precision 1-sample example: st.devof -  is • required to know natural variation 2-sample example: -must be compared with the natural variation, impossible without repetition 2. Random drawings / Random allocation of treatments • no bias (systematic error) • introduction of chance in the system

Design principles : brief overview (2) 3. Increase homogeneity : all experimental units are as similar and in as similar conditions as possible, - except the conditions influenced by the treatment 4. Measure other variables that may influence the response  in the analysis used as covariates 5. In case of known other possible sources of variation:Blocking  create homogeneous groups (blocks) In the analysis, block-effects can be corrected for. Total variation = Treatm effect + Error + Error Block/coveff Total variation = Treatm effect +

Lessons, alsofrom personal experience • Own PhD experience: Notbelieving the results led toan extra year of analyses! • Lesson: knowyour analysis in advance • Real-life research experience in Mali Choice of experimental units

Cows observed in pasture land - example During 10 days, 3 cows are observed, one per observer, during 8 hours, 12 times per hour, during 60 (s). Measurement: amount of time spent walking (%) = y. Result for walking (%) between 10 and 12 a.m.: 72 observ. observations per cow, (suppose): within-cow sE = 10. Some cows walk more than others, e.g. Between-cow standard deviation of mean time spent walking: sC = 4. Suppose : = 20. What is the standard error?

Cows example y = C + E C = mean for a (random) cow, E = deviation = measurement – C Var ()= Var() + Var() = 42/3 + 102/ 72 = 5.33 + 0.84 = 6.17 So, using 1 cow per observer: se() = 6.17 = 2.6 If 2 cows per observer were used: Var () =Var() + Var() = 42/6 + 102/ 120 = 3.5 se() = 3.5 = 2.01 If 4 cows per observerwere used, .....se() = 1.65

Cows example • Make sure to think about the sources of variation. Important sources need to be often sampled independently. The observations were pseudo-replications. The many within-cow observations enabled us to have a very precise estimate of the mean walking % for each of the 3 cows, but not for the overall mean. • Experimental /sampling units: units to which a treatment is assigned / that were randomly sampled.Measured units: units on which measurements are taken. Example: pens vs chickens in the pen.

Sample size calculations: 2 treatments • 2 Hypothetical Populations, one for each treatment. We call the population means: μ1 and μ2 • Parameter of interest: Δ=μ1- μ2 • Samples: y1,1, …, y1,n1; y2,1, …, y2,n2 • Model = Assumptions: the data are outcomes of n1 and n2 independent drawings from N(μ1, σ1) and N(μ2, σ2).Extra assumption: σ1= σ2 = σ.

3 (of many) possible realities Δ= 0 (no difference) Δ= Δ1 (large difference) Δ= Δ2 (small difference) Assumed: Normality and σ1= σ2

Testing: reality vs. conclusion Given a relevant Ha reality (value for Δ ), and given α (e.g. 0.05) the power of a planned experiment can be calculated.

Simulations to mimick the test result • Excel: simulations 2 samples.xls • one experiment with test is repeated 200 times • We assume that σ is approximately known • We can vary “reality” Δ = μ1 – μ2 That is: let us assume that Δ is …. (so and so much)Then see how frequent H0 is rejected (=power of the test) • We can vary sample size n (=n1=n2). • We can vary α • We can then simulate power (demonstration of simulation program)

Formula for sample size : confidence interval • Formula (n per sample), for a (1-α) C.I.Error Margin ≤ M. tα/2≈ 2.0/2.2 • Precision criteria that have to be specified: 1- α = confidence level and M = max Error Margin • Notes 1) σ has to estimated 2) if α=0.05, z=1.96. 3) if outcome for n is small (< 10) change the t-value with df = 2(n -1) and calculate again. 4) In testing, in stead of M, we specify Δ, the minimum relevant difference and  (=1 –power)

2C. Power calculation with Russ Lenth • Lenth, R. V. (2006). Java Applets for Power and Sample Size [Computer software]. Retrieved March 15, 2009, from http://www.stat.uiowa.edu/~rlenth/Power. • Example : Estimate p = fraction of baby’s with consti-pation (<0.2) with an Error Margin of at most 1%. Define y=1 (yes) or 0 (no). Then Var(y) = σ2 = p(1-p) < (0.2*0.8)=0.16.  formula: n ≥ …

Conclusions • In design phase • Thinkaboutthe relevant “sources of variation” (influential factors) which of themwillyouinclude in design, which of themwillyou keep constant? Block design? Split plot? • Measureconditionsthatvary (weather,...) • Measuregeneralconditions (even ifthey do notvaryacrosstreatmentsin your experiment) • Avoid / beaware of pseudo-replicationexperimental units  measured unit sampling unit  measured unit • Correct randomisation

Analysis • Conclusions from a statistical analysis are drawn in the context of a statistical model. The correctness and the relevance of the conclusion depend on the correctness and the relevance of the model. • Model = assumptions about the observations • Systematic part (how the mean value of the response depends on the factor levels / factor level combinations) • Random part: independence, Normality and equal variance(independence follows from correct randomisation)

Conclusions • For sample sizecalculations, the researcher must • knowbeforehandwhich analysis shewillperformwith the collected data. • specify research goals in terms of precisionrequirements: Minimum relevant difference , power (0.8/0.9), α (5%) • know error variation:  (guess: range/4)  Decide on sample sizes (Russ Lenth Power) • Measureand store quantitative data, whenpossible, notbinary data.

Conclusions • Enter data once in a data base. Use programs toderivecalculated variables or partialdata sets forwhichyou do an analysis. • In case of need, contact a statistician !... beforehand.

Data collection and Statistics