540 likes | 1.14k Vues
Introduction to biostatistics. Georgi Iskrov , MBA, MPH, PhD Department of Social Medicine. Before we start. Final SMME I exam: Entry test in Bioethics Entry test in Biostats ______________________ 1 case for statistical analysis and interpretation
E N D
Introduction to biostatistics GeorgiIskrov, MBA, MPH, PhD Department of Social Medicine
Before we start Final SMME I exam: • Entry test in Bioethics • Entry test in Biostats ______________________ • 1 case for statistical analysis and interpretation • 1 bioethical case for comment and discussion • 1 theory question from the bioethics questionnaire
Before we start http://www.raredis.work/edu/
Outline • Population vs sample • Descriptive vs inferential statistics • Sampling methods • Sample size calculation • Level of measurement • Graphical summaries
Why do we need to use statistical methods? • Why do we need to use statistical methods? • To make strongest possible conclusion from limited amounts of data; • To generalize from a particular set of data to a more general conclusion. • What do we need to pay attention to? • Bias • Probability
Definition of biostatistics The science of collecting, organizing, analyzing, interpreting and presenting data for the purpose of more effective decisions in clinical context. “Turning data into knowledge” (Patrick Heagerty)
Population vs Sample • Population includes all objects of interest whereas sample is only a portion of the population. • Parameters are associated with populations and statistics with samples • Parameters are usually denoted using Greek letters (μ, σ) while statistics are usually denoted using Roman letters (X, s) • There are several reasons why we do not work with populations. • They are usually large, and it is often impossible to get data for every object we're studying • Sampling does not usually occur without cost, and the more items surveyed, the larger the cost
Descriptive vs Inferential statistics • We compute statistics, and use them to estimate parameters. • The computation is the first part of the statistical analysis (Descriptive Statistics) and the estimation is the second part (Inferential Statistics). • Descriptive Statistics The procedure used to organize and summarize masses of data • Inferential Statistics The methods used to find out something about a population, based on a sample
Descriptive vs Inferential statistics Sampling Population Parameters From population to sample Sample Statistics From sample to population Inferential statistics
Probability • A measure of the likelihood that a particular event will happen. • It is expressed by a value between 0 and 1. • First, note that we talk about the probability of an event, but what we measure is the rate in a group. • If we observe that 5 babies in every 1000 have congenital heart disease, we say that the probability of a (single) baby being affected is 5 in 1000 or 0.005. 0.0 1.0 Cannot happen Sure to happen
Sampling • Individuals in the population vary from one another with respect to an outcome of interest.
Sampling • When a sample is drawn there is no certainty that it will be representative for the population. Sample A Sample B
Error • Random error can be conceptualized as sampling variability. • Bias (systematic error) is a difference between an observed value and the true value due to all causes other than sampling variability. • Accuracy is a general term denoting the absence of error of all kinds.
Sampling • Sampling • Aspecific principle used to select members of population to be included in the study. • Due to the large size of target population, researchers have no choice but to study the a number of cases of elements within the population to represent the population and to reach conclusions about the population. • Biased sample Biased sample is one in which the method used to create the sample results in samples that are systematically different from the population. • Random sample In random sampling, each item or element of the population has an equal chance of being chosen at each draw.
Sampling Sample B Sample A Population
Sampling Sample B Sample A Population
Sampling • Stages of sampling: • Defining target population • Determining sampling size • Selecting a sampling method • Properties of a good sample: • Random selection • Representativeness by structure • Representativeness by number of cases
Sampling • Random sampling: Sample group members are selected in a random manner • Highly effective if all subjects participate in data collection • High level of sampling error when sample size is small • Systematic: Including every Nth member of population in the study • Time efficient • Cost efficient • High sampling bias if periodicity exists
Sampling • Judgement: Sample group members are selected on the basis of judgement of researcher • Time efficiency • Samples are not highly representative • Unscientific approach • Personal bias • Convenience: Obtaining participants conveniently with no requirements whatsoever • High levels of simplicity and ease • Usefulness in pilot studies • Highest level of sampling error • Selection bias
Sampling • Snowball: Sample group members nominate additional members to participate in the study • Possibility to recruit hidden population • Over-representation of a particular network • Reluctance of sample group members to nominate additional members
Sampling • Stratified: Representation of specific subgroup or strata • Effective representation of all subgroups • Precise estimates in cases of homogeneity or heterogeneity within strata • Knowledge of strata membership is required • Complex to apply in practical levels • Cluster: Clusters of participants representing population are identified as sample group members • Time and cost efficient • Group-level information needs to be known • Usually higher sampling errors compared to alternative sampling methods
Sample size calculation Law of Large Numbers: As the number of trials of a random process increases, the percentage difference between the expected and actual values goes to zero. Application in biostatistics: Bigger sample size, smaller margin of error. A properly designed study will include a justification for the number of experimental units (people/animals) being examined. Sample size calculations are necessary to design experiments that are large enough to produce useful information and small enough to be practical.
Sample size calculation Generally, the sample size for any study depends on: Acceptable level of confidence; Expected effect size and absolute error of precision; Underlying scatter in the population; Power of the study.
Sample size calculation For quantitative variables: Z – confidence level; SD – standard deviation; d – absolute error of precision.
Sample size calculation For quantitative variables: A researcher is interested in knowing the average systolic blood pressure in pediatric age group at 95% level of confidence and precision of 5 mmHg. Standard deviation, based on previous studies, is 25 mmHg. => 97
Sample size calculation For qualitative variables: Z – confidence level p – expected proportion in population d – absolute error of precision
Sample size calculation For qualitative variables: A researcher is interested in knowing the proportion of diabetes patients having hypertension. According to a previous study, the actual number is no more than 15%. The researcher wants to calculate this size with a 5% absolute precision error and a 95% confidence level. => 196
When do you need biostatistics? BEFORE you start your study! After that, it will be too late…
Planning Researchprogramme: • Aim • Object • Units of observation • Indices of observation • Place • Time • Statistical analyses • Methodology
Planning • Aim The aim of the investigation is trying to summarize and formulate clearly the research hypothesis. • Object Object of the investigation is the event, that is going to be studied. • Units of observation • Logical unit – each studied case • Technical unit – the environment, where the logical units are situated • Indices of observation– not too many, but important; measurable; additive and self controlling. • Factorial • Resultative
Planning • Place • Time • Single – events are studied in a single moment of time, the so called “critical moment”. • Continuous – used to characterize a long term tendency of the events • Statistical analyses • Methodology
One vs Many • Many measurements on one subject are not the same thing as one measurement on many subjects. • With many measurements on one subject, you get to know the one subject quite well but you learn nothing about how the response varies across subjects. • With one measurement on many subjects, you learn less about each individual, but you get a good sense of how the response varies across subjects.
Paired vs Unpaired • Data are paired when two or more measurements are made on the same observational unit (subjects, couples, and so on). • Data are unpaired, where only one type of measurement is made on each unit.
Data processing • Data check and correction • Data coding • Data aggregation • According to the data usage: • Primary • Secondary • According to the number of indices • Simple • Complex • It is always a good idea to summarize your data (at least for important variables) • You become familiar with the data and the characteristics of the sample that you are studying • You can also identify problems with data collection or errors in the data (data management issues) • Range checks for illogical values
Variables vs Data • A variable is something whose value can vary. • Data are the values you get when you measure a variable.
Quantitative (metric) variables • Continuous • Measured units • Metric continuous variables can be properly measured and have units of measurement. • Continuous values on proper numeric line or scale • Data are real numbers (located on the number line). • Discrete • Integer values on proper numeric line or scale • Metric discrete variables can be properly counted and have units of measurement – ‘numbers of things’. • Counted units • Data are real numbers (located on the number line).
Qualitative (categorical) variables • Nominal • Values in arbitrary categories • Ordering of the categories is completely arbitrary. In other words, categories cannot be ordered in any meaningful way. • No units! • Data do not have any units of measurement. • Ordinal • Values in ordered categories • Ordering of the categories is not arbitrary. It is now possible to order the categories in a meaningful way. • No units! • Data do not have any units of measurement.
Levels of measurement • There are four levels of measurement: Nominal, Ordinal, Interval, and Ratio. These go from lowest level to highest level. • Data is classified according to the highest level which it fits. Each additional level adds something the previous level didn't have. • Nominal is the lowest level. Only names are meaningful here. • Ordinal adds an order to the names. • Interval adds meaningful differences. • Ratio adds a zero so that ratios are meaningful.
Levels of measurement • Nominal scale – eg., genotype You can code it with numbers, but the order is arbitrary and any calculations would be meaningless. • Ordinal scale – eg., pain score from 1 to 10 The order matters but not the difference between values. • Interval scale– eg., temperature in C The difference between two values is meaningful. • Ratio scale – eg., height It has a clear definition of 0. When the variable equals 0, there is none of that variable. When working with ratio variables, but not interval variables, you can look at the ratio of two measurements.
Data processing • Some visual ways to summarize data: • Tables • Graphs • Bar charts • Histograms • Box plots
Frequency table • Elements • Formal • Title • Main column • Main row • Legend • Logical
Frequency table Simple table Table 1. Anti-HBs (+) outcomes per group from a HBV screening study* Title Main row Main column Legend * Part of TPTBHB Project
Frequency table Complex table (cross tabulation) Table 2. HBV high-risk groups to be screened by residence* Residence Risk group * Part of TPTBHB Project
Bar chart • Bar chart is a way to visually represent qualitative data. • Data is displayed either horizontally or vertically and allows viewers to compare items, such as amounts, characteristics, and frequency. • Bars are arranged in order of frequency, so more important categories are emphasized. • Bar charts can be either single, stacked, or grouped.
Pie chart • Pie chart is helpful when graphing qualitative data, where the information describes a trait or attribute and is not numerical. • Each slice of pie represents a different category, and each trait corresponds to a different slice of the pie—with some slices usually noticeably larger than others.
Histogram • A histogram is used with quantitative data. Ranges of values, called classes, are listed at the bottom, and the classes with greater frequencies have taller bars. • A histogram often looks similar to a bar chart, but they are different because of the level of measurement of the data: • A bar chart is for categorical data, and the x-axis has no numeric scale • A histogram is for quantitative data, and the x-axis is numeric.
Boxplot • Boxplot is a method for graphically depicting groups of numerical data through their quartiles.
Scatterplot • Scatterplot is a type of plot using Cartesian coordinates to display values for two variables for a set of data. Data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.