ENVM 7015 a) Importance of peer-reviewed journal articles b) Application of statistics in research

ENVM 7015a) Importance of peer-reviewed journal articlesb) Application of statistics in research

Types of journal papers • Review • Descriptive (i.e. survey) • Hypothesis testing (i.e. theory) • Others: Commentary, letter to the editor, news etc. • Why are they so important? Why should we cite these articles rather than website of newspapers?

Processes for Publishing a Peer-reviewed article Paper preparation (data collection, data analysis, writing in scientific way following the journal style) Submission for Peer-Review (Editor may reject the paper at this stage or send it for review by at least 2 referees) Editor’s Decision (Editor may reject or accept* the paper with consideration of the reviewers’ comments). *with minor or major revision. Editor’s Decision (Publish the paper or re-send it for further review). If rejected, Respond to the Comments & Revise the paper, then re-submit it to another journal If accepted, Respond to the Comments & Revise the paper, then return it to Editor

Familiar yourself with Web of Science!

Journal Citation Reports indicate the journal rank!

A new, powerful search engine SCOPUS!

How to conduct a questionnaire survey? How to conduct a good survey http://student.bmj.com/back_issues/0501/education/143.html How to conduct a good interview http://www.nonprofitstaffing.com/307.asp Good practice in the conduct and reporting of survey research http://intqhc.oxfordjournals.org/cgi/content/full/15/3/261 Basic data analysis with Excel http://www-mariachi.physics.sunysb.edu/wiki/images/2/28/Basic_Data_Analysis.ppt

STATISTICS: • Derived from the Latin for “state” - governmental data collection and analysis • Study of data (branch of mathematics dealing with numerical facts i.e. data) • The analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions based on the data

Five Different Types of Statistical Analysis • Descriptive analysis – data distribution • Inferential analysis – hypothesis testing • Differences analysis – hypothesis testing • Association analysis – correlation • Predictive analysis – regression

Descriptive vs. Inferential Statistics A Hypothesis: • A statement relating to an observation that may be true but for which a proof (or disproof) has not been found • The results of a well-designed experiment or data collection may lead to the proof or disproof of a hypothesis

Samples Sub-samples Population

For example, Heights of male vs. female at age of 25. Our observations: male H > female H; it may be linked to genetics, consumption and exercise etc. Is that true for male H> female H? i.e. Null hypothesis: male H < or = female H Scenario I: Randomly select 1 person from each sex. Male: 170 Female: 175 Then, Female H> Male H ? Scenario II: Randomly select 3 persons from each sex. Male: 171, 163, 168 Female: 160, 172, 173 What is your conclusion then?

Important take-home messages here: • Sample size is very important and will affect your conclusion • Measurement results vary among samples (or subjects) – that is “variation” or “uncertainty”. • Variation can be due to measurement errors (random or systematic errors) and inherent within samples variation. For example, at age 20, female height varies from 158 to 189 cm. Why? • Therefore, in Statistics, we always deal with distributions of data rather than a single point of measurement or event.

Different levels of measurement: (1) nominal, (2) ordinal, (3) interval or ratio scale 1 2 3 0 1 10 100 1000 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Measurements of Location mean Mean = Sum of values/n = Xi/n e.g. length of 8 fish larvae at day 3 after hatching: 0.6, 0.7, 1.2, 1.5, 1.7, 2.0, 2.2, 2.5 mm mean length = (0.6+0.7+1.2+1.5+1.7+2.0+2.2+2.5)/8 = 1.55 mm 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 mm

mean median 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 mm Median, Percentiles and Quartiles • Order = n/2 for n is an odd number • Order = (n+1)/2 for n is an even number e.g. 0.6, 0.7, 1.2, 1.5, 1.7, 2.0, 2.2, 2.5 mm order 1 2 3 4 5 6 7 8 order = (8+1)/2 = 4.5 Median = 50th percentile = (1.5 + 1.7)/2 = 1.6 mm order for Q1 = 25th percentile = (8+1)/4 = 2.25 mm, then Q1 = 0.7 + (1.2 - 0.7)/4 = 0.825 mm

mean median 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 mm • Median is often used with mean • Mean is used much more frequent, however, • Median is a better measure of central tendency for data with skewed distribution or outliers mean median 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 mm

Other measures of central tendency • Range midpoint or range = (Max value - Min value)/2 • not a good estimate of the mean and seldom-used • Geometric mean = n(x1x2 x3 x4….xn) = 10^[mean of log10(xi)] • Only for positive ratio scale data • If data are not all equal, geometric mean < arithmetic mean • Use in averaging ratios where it is desired to give each ratio equal weight

Measurements of dispersion Range e.g. length of 8 fish larvae at day 3 after hatching: 0.6, 0.7, 1.2, 1.5, 1.7, 2.0, 2.2, 2.5 mm Range = 2.5 - 0.6 = 1.9 mm (or say from 0.6 to 2.5mm) Percentile and quartiles

Population Standard Deviation () • Averaged measurement of deviation from mean xi - x • e.g. five rainfall measurements, whose mean is 7 Rainfall (mm) xi - x (xi - x)2 12 12 - 7 = 5 25 0 0 - 7 = -7 49 2 2 - 7 = -5 25 5 5 - 7 = -2 4 16 16 - 7 = 9 81 Sum = 184 Sum = 184 • Population variance: 2 =  (xi - x)2/n = 184/5 = 36.8 • Population SD:  = (xi - x)2/n = 6.1

Sample SD (s) s = [(xi - x)2]/ (n - 1) s = [xi2 - (xi)2 /n]/ (n - 1) • Two modifications: • by dividing [(xi - x)2] by (n -1) rather than n, gives a better unbiased estimate of  (however, when n increases, difference between s and  declines rapidly) • the sum of squared deviations can be calculated as  (xi2)- ( xi)2/ n

Sample SD (s) • e.g. five rainfall measurements, whose mean is 7 Rainfall (mm) xi2 xi 12 144 12 0 0 0 2 4 2 5 25 5 16 256 16 (xi2) = 429 xi = 35 (xi)2 = 1225 • s2 = [xi2 - (xi)2 /n]/ (n - 1) = [429 - (1225/5)]/ (5 - 1) = 46.0 • s = (46.0) = 6.782

Frequency Distribution e.g. The particle sizes (m) of 37 grains from a sample of sediment from an estuary 8.2 6.3 6.8 6.4 8.1 6.3 5.3 7.0 6.8 7.2 7.2 7.1 5.2 5.3 5.4 6.3 5.5 6.0 5.5 5.1 4.5 4.2 4.3 5.1 4.3 5.8 4.3 5.7 4.4 4.1 4.2 4.8 3.83.8 4.1 4.0 4.0 Define convenient classes (equal width) and class intervals e.g. 1 m

e.g. Frequency distribution for the size of particles collected from the estuary Particle size (m) Frequency 3.0 to under 4.0 2 4.0 to under 5.0 12 5.0 to under 6.0 10 6.0 to under 7.0 7 7.0 to under 8.0 4 8.0 to under 9.0 2

e.g. Frequency distribution for the size of particles collected from the estuary

e.g. Frequency distribution of height of the students in a class (n = 52: 30 females & 22 males) Why bimodal-like ?

Normal curve • f(x) = [1/(2)]exp[(x  )2/(22)]

Normal curve Parameters  and determine the position of the curve on the x-axis and its shape. Normal curve was first expressed on paper (for astronomy) by A. de Moivre in 1733. Until 1950s, it was then applied to environmental problems. (P.S. non-parametric statistics were developed in the 20th century) • f(x) = [1/(2)]exp[(x  )2/(22)] male female

f(x) = [1/(2)]exp[(x  )2/(22)] N(10,1) N(20,1) N(20,2) N(10,3) • Normal distribution N(,) • Probability density function: the area under the curve is equal to 1.

The standard normal curve •  = 0,  = 1 and with the total area under the curve = 1 • units along x-axis are measured in  units • Figures: (a) for 1 , area = 0.6826 (68.26%); (b) for 2   95.44%; (c) the shaded area = 100% - 95.44%

Application of the Standard Normal Distribution • For example: We have a large data set (e.g. n = 200) of normally distributed suspended solids determinations for a particular site on a river: x = 18.3 ppm and s = 8.2 ppm. We are asked to find the probability of a random sample containing 30 ppm suspended solids or more.

Application of the standard normal distribution • The standard deviation (or Z value): Z = (Xi - )/ • Z = (30 - 18.3)/8.2 = 1.43 • Check the Z Table (Table B2 in Zar’s book), you will obtain • the probability for the samples having 30 ppm = 0.0764 or 7.64% • i.e. for n = 200, more on about 15 occasions for having 30 ppm

Central Limit Theorem • As sample size (n) increases, the means of samples (i.e. subsets or replicate groups) drawn from a population of any distribution will approach the normal distribution. • By taking mean of the means, we smooth out the extreme values within the sets while keeping x  x. • As the number of subsets increases, the standard deviation of the mean of the means will be reduced and the frequency distribution is very close to the normal distribution

Inferential statistics - testing the null hypothesis • Inferential = “that may be inferred.” Infer = conclude or reach an opinion • The hypothesis under test, the null hypothesis, will be that Z has been chosen at random from the population represented by the curve. • Z values close to the mean ( = 0) are high, while frequencies away from the mean decline e.g. two values of Z are shown: Z = 1.96 and Z = 2.58 From the Table B2, we have the corresponding probability: 0.025 (2.5%) and 0.0049 (0.5%)

Inferential statistics - testing the null hypothesis As the curve is symmetrical about the mean, p to obtain a value of Z < -1.96 is also 2.5%; so the total p of obtaining a value of Z between -1.96 and +1.96 is 95% Likewise, between Z = 2.58, the total p = 99% Then we can state a null hypothesis that a random observation of the population will have a value between -1.96 and + 1.96.

Inferential statistics - testing the null hypothesis Alternatively, we can state the null hypothesis as that a random observation of Z will lie outside the limit -1.96 or +1.96. There are 2 possibilities: Either we have chosen an ‘unlikely’ value of Z, or our hypothesis is incorrect. Conventionally, when performing a significant test, we make the rule that if Z values lies outside the range 1.96, then the null hypothesis is rejected and the Z value is termed significant at the 5% level or  = 0.05 (or p < 0.05) — critical value of the statistics. For Z =  2.58, the value is termed significant at the 1% level.

Statistical Errors in Hypothesis Testing • Consider court judgements where the accused is presumed innocent until proved guilty beyond reasonable doubt (I.e. Ho = innocent)

Statistical Errors in Hypothesis Testing • Similar to court judgements, in testing a null hypothesis in statistics, we also suffer from the similar kind of errors:

Statistical Errors in Hypothesis Testing • e.g. Ho = responses of cancer patients to a new drug and placebo are similar • If Ho is indeed a true statement about a statistical population, it will be concluded (erroneously) to be false 5% of time (in case  = 0.05). • Rejection of Ho when it is in fact true is a Type I error (also called an  error). • If Ho is indeed false, our test may occasionally not detect this fact, and we accept the Ho. • Acceptance of Ho when it is in fact false is a Type II error (also called a  error).

Chi-square statistics • Widely used for the analysis of nominal scale data • Introduced by Karl Pearson during 1900 • Its theory and application expanded by him and R. A. Fisher • This lecture will cover the topics: Chi-square test, G test, Kolmogorov-Smirnov goodness of fit for continuous data

The 2 test: 2 =  (observed freq. - expected freq.)2/ expected freq. • Obtain a sample of nominal scale data and to infer if the population from which it came conforms to a certain theoretical distribution. • Used to test Ho that the observations (not the variables) are independent of each other for the population. • Based on the difference between the actual observed frequencies(not %) and the expected frequencies that would be obtained if the variables were truly independent.

The 2 test: 2 =  (observed freq. - expected freq.)2/ expected freq. • Used as a measure of how far a sample distribution deviates from a theoretical distribution • Ho: no difference between the observed and expected frequency (HA: they are different) • If Ho is true then both the difference and chi-square value will be SMALL • If Ho is false then both measurements will be Large, HA will be accepted

Example • In a questionnaire, 259 adults were asked what they thought about cutting air pollution by increasing tax on vehicle fuel. 113 people agreed with this idea but the rest disagreed. Perform a Chi-square text to determine the probability of the results being obtained by chance.

Agree Disagree Observed 113 259 -113 = 146 Expected 259/2 = 129.5 259/2 = 129.5 Ho: Observed = Expected 2 = (113 - 129.5)2/129.5 + (146 - 129.5)2 /129.5 = 2.102 + 2.102 = 4.204 df = k - 1 = 2 - 1 = 1 From the Chi-square Table B1 in Zar’s book Critical 2 ( = 0.05, df = 1)= 3.841 << calculated 2 = 4.202, 0.025<p<0.05 Therefore, rejected Ho. The probability of the results being obtained by chance is between 0.025 and 0.05.

Cross tabulation or contingency tables: • Further examination of the data on the opinion on increasing fuel to cut down air pollution (example 1): • Ho: the decision is independent of sex Males Females Agree 13 (a) 100 (b) Disagree 116 (c) 30 (d) Expected frequency for cell a = (a + b)[(a + c)/n] Males Females n Agree 13 100 113 113(129/259)=56.28 113(130/259)= 56.72 Disagree 116 30 146 146(129/259)=72.72146(130/259)= 73.28 n 129 130 259

Cross tabulation or contingency tables: • Ho: the decision is independent of sex Males Females n Agree 13 100 113 56.28 56.72 Disagree 116 30 146 72.7273.28 n 129 130259 2 = (13 - 56.28)2/56.28 + (100 - 56.72)2/56.72 + (116 - 72.72) 2/72.72 + (30 - 73.28)2/73.28 = 117.63 df = (r - 1)(c - 1) = (2 - 1)(2 - 1) = 1 Critical 2 ( = 0.05, df = 1)= 3.841  p<0.001 Therefore, reject Ho and accept HA that the decision is dependent of sex.

Quicker method for 2 x 2 cross tabulation: Class A Class B n State 1 a b a + b State 2 c d c + d n a + c b + d n = a + b + c +d 2 = n (ad - bc)2/(a + b)(c + d)(a + c)(b + d) Males Females Agree 13 100 113 Disagree 116 30 146 129 130 259 2 = 259(13  30 - 116  100)2/(113)(146)(129)(130) = 117.64 2 ( = 0.05, df = 1)= 3.841  p<0.001; Therefore, rejected Ho.

ENVM 7015 a) Importance of peer-reviewed journal articles b) Application of statistics in research

ENVM 7015 a) Importance of peer-reviewed journal articles b) Application of statistics in research

Presentation Transcript

Weaver Publications Official Vacation Journal Wyoming 2008 Usage Research

Peer-to-peer Communication Services Project Status Presentation Sep 18, 2007

Web-based Journal Manuscript Management and Peer-Review Software and Systems

Writing a Paper

Matrix Decomposition and its Application in Statistics

AFP Journal Review March 1, 2010

A Survey of Peer-to-Peer Content Distribution Technologies

Peer Influence

Peer-to-Peer Systems

Introduction to Statistics

Part 4 : Application Support Tools Research Issues

Statistics for non-statisticians

Statistics

Statistics in Medical Research

Nevin L. Zhang Room 3504, phone: 2358-7015, Email: lzhang@cst.hk Home page

Statistics And Application

How to Write a Paper and Get It Published (An Insider’s View)

Probability and Statistics

Educational Research: Data analysis and interpretation – 2 Inferential statistics

CS 552 Peer 2 Peer Networking

Educational Research: Data analysis and interpretation – 1 Descriptive statistics

Chapter 2: Application Layer