Preparing to analyse data

Preparingtoanalyse data Assist. Prof. E. Çiğdem Kaspar, Ph.D

No statisticaltechniquewill ever yield ‘good’ resultsfrom data of dubiousquality. Buyse (1984)

Beforeanalysing a set of data it is importanttocheck as far as possiblethatthe data seemcorrect. • Errors can be made • whenmeasurementsaretaken • whenthe data areoriginallyrecorded, • whentheyaretranscribedfromtheoriginalsource (such as fromhospitalnotes), • whenbeingtypedinto a computer.

Wecannotusuallyknowwhat is correct, sowerestrictourattentiontomaking sure thattherecordedvaluesareplausible. Thisprocess is calleddata checking(or data cleaning). • Wecannotexceptto spot alltranscriptionand data entryerrors, but wehopetofindthemajorerrors. As wewillsee, it is thelargeerrorsthat can influencestatisticalanalysis . • It is alsoimportanttoscreenthe data toidentifyfeaturesthatmaycausediffucultiesduringtheanalysis. Threespecificaspectsareconsidered; • missing data, • outlyingvalues • andpossibleneedfor data transformation.

Data Checking • Errors in recorded data arecommon. • Forexample, therecordedvaluesmay be wrongbecause of confusionoverthecorrectunits of measurement, digitmay be transposedwhen data aretranscribed, or data may be mistypedwhenbeingenteredonto a computer. • Data checkingaimstoidentifyandifpossible, rectifyerrors in the data.

Data Checking • First step is tocheckthatthe data havebeentypedintothe file correctly. • Forlargefilesdoubleentry is best, wherebythe data areretypedandcomparedwiththefirstversion. • Forsmall data setsthesimplestway is foronepersontoreadaloudthe data fromthecomputerwithanotherpersoncheckingagainsttheoriginal data.

Data Checking Categorical data • Forcategoricalvariables it is simpletocheckthatallrecorded data valuesareplausiblebecausethere is fixednumber of pre-specifiedvalues. Forexample, ifwehavefourcodesforbloodgroup as follows • 1=A 2=B 3=0 4= AB • Thenweexcepttofindonlyvalues 1,2,3 or 4 in the data, exceptforanysubjectswithmissinginformation. Ifmissingvaluesarecoded as 9, thenweknowthatanybloodgroupcoded as 0,5,6,7 or 8 is clearywrong.

Data Checking Continous data • Forcontinousmeasurementswecannotusuallyidentifypreciselywhichvaluesareplausiblewhichare not, and it is not importantto do so. • Itshould, however, always be possibletospecifylowerandupperlimits on what is reasonableforthevariableconcerned. • Forexample, in a study of pregnancywemight put limits of 14 and 45 on metarnalage, or in a study of adultmaleswemayuselimits of 70 and 250 mm Hgforsystolicbloodpressure. • Wethenneedtoidentifyvaluesoutsidethelimits, a procedureknown as rangechecking.

Data Checking • Valuesremainingoutsidetheprespecifiedrangemusteither be left as theyare, orrecorded as ‘missing’ iftheyarefeltto be impossibleratherthanjustunlikely. • Itmaytherefore, be advisabletohavetwosets of limitsforeachvariabledenotingsuspicious (orunlikely) valuesandimpossiblevalues. • A commoncause of error is misplacingthedecimalpoint, perhapsbecause of confusionovertherightunits of measurementtouseor a transcription.

Logicalchecks • Checkingthe data is morecomplicatedwhenthevalues of a variablethatarereasonabledepend on thevalue of someothervariable. Wecalltheselogicalcheckings. • Firstly, it is commonforsomeinformationto be soughtonly in certaincases. Forexample, in a study of survivalafter a kidneytransplant, information on numberpreviouspregnancies is relavantonlyforwomen, andsofor men should be set tomissingorto a differentcodeindicating ‘not applicable’.

Dates • Recordeddatesareimportantwhentheyareusedtocalculatethe time betweentwoevents. Forexample, we can calculate a subject’sage at someevent, such as surgeryordeath, fromthedate of theeventandthesubject’sdate of birth. Othercommoncalculationsarethe time between an eventandthepatient’sdeath (theirsurvival time) orthe time betweenthefirstsymptomandthediagnosis of thedisease. • Datesshould be checked as follows: 1. checkthatalldatesarewithin a reasonable time span. 2. checkthatalldatesarevalid. 3. checkthatdatesarecorrectlysequenced. 4. checkderivedagesand time intervals

Outliers • Checkingthe data forcontinousvariablesmayrevealsomeoutlyingvaluesthatareincompatiblewiththe rest of the data. • Typicallytheremay be oneortwooutliersfor a fewvariables, althoughformostvariablestherewill not be any. • Outliersareparticularlyimportantbecausethey can have a considerableinfluence on theresults of a statisticalanalysis. • Becausebydefinationtheyareextremevalues, theirinclusionorexclusion can havemarkedeffect on theresults of an analysis.

Outliers • Bill Gates makes $500 million a year. • He’s in a room with 9 teachers, 4 of whom make $40k, 3 make $45k, and 2 make $55k a year. What is the mean salary of everyone in the room? What would be the mean salary if Gates wasn’t included? Mean With Gates: $50,040,500 Mean Without Gates: $45,000

Outliers • A singleoutlyingpoint can have a considerableeffect on thevisualimpression. Ifwecoverthesuspiciousvalue it is clearthatthere is no apparentrelation in the rest of the data. • To find any outliers in a set of data, we need to find the 5 Number Summary of the data. Step 1: Sort the numbers from lowest to highest Step 2: Identify the Median Step 3: Identify the Smallest and Largest numbers Step 4: Identify the Median between the smallest number and the Median for the entire set of data, and between that Median and the largest number in the set. (25 thpercentileand 75 thpercentile)

Outliers

Outliers • A usefulstrategytoadoptwhenanalysing data is tocarryouttheanalysisbothincludingandexcludingthesuspiciousvalue(s). Ifthere is a littledifference in theresultsobtainedthentheoutlier(s) had minimal effect, but ifexcludingthemdoeshave en effect it may be bettertofind a alternativemethod of analysis.

MissinG data • Therearevariety of reasonsthat data would be missingsuch as missing data can be resultedfrom • thestudyparticipations, • thestudydesign, • theinvestigator, • theresearchunitsand • thereasonsthat can not be controlled. Missing data can effecttheresult of a studybecauseallstatisticaltestsweredevelopedforcomplete data sets.

MissinG data • Therearethreetypes of concernsthattypicallyarisewithmissing data: • (1) loss of efficiency; • (2) complication in data handlingandanalysis; and • (3) biasduetodifferencesbetweentheobservedandunobserved data.

MissinG data • Wemustfirstdeterminehowtheprocessgeneratingmissingvaluesdepends on thevariables in the data set. • Thesemechanisms can be classifiedintothreecategorieswhichareknown as • missingcompletely at random (MCAR), • missing at random (MAR) • andmissing not at random (MNAR).3

MissinG data • Afterdecidingthemissing data mechanism, varioussolutionandimputationmethods can be usedfordealingwithmissing data problem. • Such as meanimputation, casedeletionmethod vs. • Themostcommon device is touseformissingvaluessuch as 9, 99, 999 or 99.9, accordingtothenature of thevariable.

Data screening • Wehaveconsideredvariousaspects of checking, as far as possible, thatthe data arecorrect. Theotherimportantaspect of preliminary data examination is toseehowsuitablethe data areforthetype of analysisthat is intended, a processsometimescalleddata screening. Data screening is concernedlargelywiththedistribution of thecontinous data.

Data screening • Manytypes of statisticalanalysis of continous data arebased on assumptionthatthe data are a samplefrom a populationwith a normal distribution. • Alternativemethodsbased on ranksareusuallyavailablethat do not makethatassumption, but theyhavecertaindisadvantages. • It is importanttoknowthedistribution of the data beforeembarking on an analysisbased on theassumption of Normality. • Data thatare not compatiblewith a normal distribution can often be transformedtomakethemacceptablenearto Normal.

Data screening

Data screening • In statistics, normality tests are used to determine whether a data set is well-modeled by a normal distribution or not. Whichare; • Graphicalmethods • Histogram • Normal probabiltyplot • Q-Q plot 2. Statisticaltests • Kolmogorov-Smirnov test • Shapiro-Wilk’s W test

Data screening • Moststatisticalmethodsforanalysingcontinous data incorporateassumptionsaboutthe data in thepopulationfromwhichthesamplewasdrawn. • Inparticulartheyinclude an assumptionthatthe data comefrom a populationwherethevaluesareNormallydistributed. Thusweexpectthe data to be consistentwiththatassumption, which is whyweneedto test of Normality.

Data screening • Weoftenfindthat a transformation of the data willyield a distributionthat is muchnearerto a Normal distribution. • By far themostcommon is thelogarithmicorlogtransformation.

Preparing to analyse data

Preparing to analyse data

Presentation Transcript

How to ANALYSE effectively

Preparing for Fall Data Analysis

Data Screaming! Validating and Preparing your data

Preparing Quantitative data for analysis

Best Practices for Preparing Ecological Data to Share

Preparing Spatial Data to Archive

Preparing Data

Preparing the Data

Preparing to Automate Data Management

How to Analyse Poems

Beginning to analyse language

Preparing Data for Analysis

Preparing Data for Analysis

Preparing data from downloaded datasets

Preparing to Automate Data Management

HOW TO ANALYSE THE DATA

Steps for preparing report data

Preparing Spatial Data to Archive

Preparing your Data using Python

Preparing and Deploying Data to ArcPad

How to ANALYSE effectively

Sea Ice

Sea Ice