Name: Gul rukh khan • Data Mining • Data Analysis with SPSS • To check data whether it is fit for Research or otherwise (or it is hooked data) • Regression Analysis on Data Hooked data: Manipulated Data
Data mining: Huge amount of data and too little information There is a need to extract useful information from the data and to interpret the same. To discover Business Intelligence from Mountain of Accumulated Data. • DATA EXPLOSION • Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories • We are drowning in data, but trying and starving for knowledge! Data mining Definition Extracting or “mining” knowledge from large amounts of patterns/data Extraction of implicit, previously unknown and unexpected, potentially extremely useful information from data Issues in Data Mining: Huge Volume and complex data issue Data Ownership issue Privacy and Security issue Gul Rukh Khan
SPSS : Statistical Package For The Social Sciences Types of data: Cross-sectional data refer to observations of many different individuals (subjects, objects) at a given time, each observation belonging to a different individual. 2 Sources: • Primary • Secondary • 3 types of Data • CROSS SECTIONAL DATA (Observation Data of many indiv:) • TIME SERIES DATA (Sequence of Data Points) • PANEL OR POOL DATA (CrossSection+TimeSeries Data) Database specifically designed for Time Series Data. A time series is a sequence of data points, measured typically at successive points in time spaced at uniform time intervals. variables: • Qualitative OR Categorical OR Dummy Variable (e.g. Gender) • Quantitative OR Numeric Variable • Nominal (Names, Non-Ranking, just like Banks (HBL, ABL, UBL etc.) (whichever may come first) • Religion : Islam, Hindu, Christian etc. • Ordinal (which are in Order): ORDINAL • Often • Very often • Daily basis ORDINAL • Agree • Slightly Agree • Extremely Agree
Basic analysis in SPSS: • If U have QualitativeORCategoricalVariable: • Then U should go for FREQUENCIES • If U have NumericORQuantitative Data • Then U should go for DESCRIPTION • Quantitative Data • Standard Deviation • Variance • Minimum Value • Max Value etc. • Frequencies: • Display the Data • One – Way – Data • Two-way or Cross Tab
Tests required :to check your data for random and normality • Two types of Tests: • Parametric and Non-Parametric • Parametric: • Sometimes based on Population. • Unknown value which may be calculated from population e.g. population mean, std deviation etc. • Non – Parametric: • Unknown value which may be calculated from Qualitative Data e.g. sample mean, std deviation etc. • Assumption for Parametric Test: • Data should be Random. • Should follow Random distribution.
Test for random data and normality • Check whether data is random or not? RUNS TEST
Asymp. Sig. (2-tailed): 0.913 0.913 > 0.5 therefore, data is Random In this case Data is fit for Research Interpretation : IF Asymp. Sig. (2-tailed): 0.013 0.013 < 0.5 therefore, data is Not Random In this case Data is not fit for Research
Test : to check whether the data is normal or not (two Russians analysts) Kolmogorov and Smirnov Test
Interpretation of data normality This Test is self Explanatory Test Distribution is normal
What is Regression? Introduction Regression based on Prediction that how one variable Regress other variable. It measures the relationship between Variables. • RegressionAnalysis is a very valuable tool for a manager • Regression can be used to • Understand the relationship between variables • Predict the value of one variable based on another variable • Simple linear regression models have only two variables • Multiple regression models have more variables
Independent variable Independent variable Dependent variable = + What is Regression? Introduction • The variable to be predicted is called the Dependent Variable (also called Response Variable) • The value of this variable depends on the value of the Independent Variable (Explanatory or Predictor Variable)
Example How to Draw Regression Line: -2.5 6.25 2.5 -0.42857 1 0.183672 -1 3.57143 -1.5 2.25 0 -0.25714 0.066121 3.74286 -0 -0 -0.5 3.91429 0.5 -0.08571 1 0.007346 -1 0.25 0.5 0.007348 1 0.25 2 0.08572 2 4.08572 1.5 1 0.25715 1 1.5 2.25 0.066126 4.25715 0.42858 1 -2.5 6.25 2.5 -1 0.183681 4.42858 8 Sum 3 Sum 0.514294 Sum 3.5 X 4.0 Y 17.5 Sum β1 = (X – X) (Y – Y) = 3 = 0.17143 (X – X)2 17.5 Regression Line βo = Y – β1X = 4 – 0.17143 x 3.5=3.4 Y-Intercept Y = βo + β1X = 3.4 + 0.17143 x 1=3.57143 βo+ β1X = 3.4 + 0.17143 x 6=4.42858 βo 3.4 0.0642857142857143 0.327014947
Thank you very much sirand dear colleagues For taking your valuable Time To my little presentation
SPSS stands for "Statistical Package for the Social Sciences SpSS Windows : Data Editor (by default) Output viewer Syntax window Script window • Qualitative OR Categorical Variables or : • e.g. Gender :Male and Female • Quantitative Variables: • e.g. current-Salary, beginning-Salary etc.
If QuantitativeORnumeric Data If QualitativeORCategorical Data
One way data Two way data (Cross tab)
Data sorting: • Data sort-case • Select Field • Then select Ascending or Descending • OK • (OUTPUT view will appear) Note: First Save your Original File, otherwise all data will be changed accordingly.
Data transform: • Transfer Compute Variable • Target Variable (Select name for New Variable) as • Target variable: Total Marks: Marks+20 SQRT(marks)