1 / 128

Statistics And Application

Statistics And Application. Revealing Facts From Data. What Is Statistics. Statistics is a mathematical science pertaining to collection, analysis, interpretation, and presentation of data .

phiala
Télécharger la présentation

Statistics And Application

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistics And Application Revealing Facts From Data

  2. What Is Statistics • Statistics is a mathematical science pertaining to collection, analysis, interpretation, and presentation of data. • It is applicable to a wide variety of academic disciplines from the physical and social sciences to the humanities, as well as to business, government, medicine and industry.

  3. Statistics Is … • Almost every professionals need a statistical tool. • Statistical skills enable you to intelligently collect, analyze and interpret data relevant to their decision-making. • Statistical concepts enable us to solve problems in a diversity of contexts. • Statistical thinking enables you to add substance to your decisions

  4. Statistics is a science • To assist you making decisions under uncertainties. Decision making process must be based on data neither on personal opinion nor on belief. • It is already an accepted fact that "Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write." So, let us be ahead of our time. • In US, students learn statistics from middle school

  5. Type Of Statistics • Descriptive statistics deals with the description problem: Can the data be summarized in a useful way, either numerically or graphically, to yield insight about the population in question? Basic examples of numerical descriptors include the mean and standard deviation. Graphical summarizations include various kinds of charts and graphs. • Inferential statistics is used to model patterns in the data, accounting for randomness and drawing inferences about the larger population. These inferences may take the form of answers to yes/no questions (hypothesis testing), estimates of numerical characteristics (estimation), prediction of future observations, descriptions of association (correlation), or modeling of relationships (regression). Other modeling techniques include ANOVA, time series, and data mining.

  6. Type of Studies • There are two major types of causal statistical studies, experimental studies and observational studies. In both types of studies, the effect of differences of an independent variable (or variables) on the behavior of the dependent variable are observed. The difference between the two types is in how the study is actually conducted. Each can be very effective. • An experimental study involves taking measurements of the system under study, manipulating the system, and then taking additional measurements using the same procedure to determine if the manipulation may have modified the values of the measurements. In contrast, an observational study does not involve experimental manipulation. Instead data are gathered and correlations between predictors and the response are investigated.

  7. Type of Statistical Courses Two types: • Greater statistics is everything related to learning from data, from the first planning or collection, to the last presentation or report, which is deep respect for data and truth. • Lesser statistics is the body of statistical methodology, which has no interest in data or truth, and are generally arithmetic exercises. If a certain assumption is needed to justify a procedure, they will simply to "assume the ... are normally distributed" -- no matter how unlikely that might be.

  8. Statistical Models • Statistical models are currently used in various fields of business and science. • The terminology differs from field to field. For example, the fitting of models to data, called calibration, history matching, and data assimilation, are all synonymous with parameter estimation.

  9. Data Analysis • Developments in statistical data analysis often parallel or follow advancements in other fields to which statistical methods are fruitfully applied. • Decision making process under uncertainty is largely based on application of statistical data analysis for probabilistic risk assessment of your decision.

  10. (cont.) • Decision makers need to lead others to apply statistical thinking in day to day activities and secondly, • Decision makers need to apply the concept for the purpose of continuous improvement.

  11. Is Data Information? • Database in your office contains a wealth of information. • The decision technology group members tap a fraction of it • Employees waste time scouring multiple sources for a database. • The decision-makers are frustrated because they cannot get business-critical data exactly when they need it. • Therefore, too many decisions are based on guesswork, not facts. Many opportunities are also missed, if they are even noticed at all. • Data itself is not information, but might generate information.

  12. Knowledge • Knowledge is what we know well. Information is the communication of knowledge. • In every knowledge exchange, the sender make common what is private, does the informing, the communicating. • Information can be classified as explicit and tacit forms. • The explicit information can be explained in structured form, while tacit information is inconsistent and fuzzy to explain. • Know that data are only crude information and not knowledge by themselves.

  13. Data → Knowledge (?) • Data is known to be crude information and not knowledge by itself. • The sequence from data to knowledge is: from Data to Information, from Information to Facts, and finally, from Facts to Knowledge. • Data becomes information, when it becomes relevant to your decision problem. • Information becomes fact, when the data can support it. Facts are what the data reveals. • However the decisive instrumental (i.e., applied) knowledge is expressed together with some statistical degree of confidence.

  14. Fact → knowledgeFact becomes knowledge, when it is used in the successful completion of a statistical process.

  15. Statistical Analysis • The exactness of a statistical model increases, the level of improvements in decision-making increases: the reason of using statistical data analysis. • Statistical data analysis arose from the need to place knowledge on a systematic evidence base. • Statistics is a study of the laws of probability, the development of measures of data properties and relationships, and so on.

  16. Statistical Inference • Verify the statistical hypothesis: Determining whether any statistical significance can be attached that results after due allowance is made for any random variation as a source of error. • Intelligent and critical inferences cannot be made by those who do not understand the purpose, the conditions, and applicability of the various techniques for judging significance. • Considering the uncertain environment, the chance that "good decisions" are made increases with the availability of "good information." The chance that "good information" is available increases with the level of structuring the process of Knowledge Management.

  17. Knowledge Needs Wisdom • Wisdom is the power to put our time and our knowledge to the proper use. • Wisdom is the accurate application of accurate knowledge. • Wisdom is about knowing how technical staff can be best used to meet the needs of the decision-maker.

  18. History Of Statistics • The word statistics ultimately derives from the modern Latin term statisticum collegium ("council of state") and the Italian word statista ("statesman" or "politician"). • The birth of statistics occurred in mid-17th century. A commoner, named John Graunt, who was a native of London, begin reviewing a weekly church publication issued by the local parish clerk that listed the number of births, christenings, and deaths in each parish. These so called Bills of Mortality also listed the causes of death. Graunt who was a shopkeeper organized this data in the forms we call descriptive statistics, which was published as Natural and Political Observation Made upon the Bills of Mortality. Shortly thereafter, he was elected as a member of Royal Society. Thus, statistics has to borrow some concepts from sociology, such as the concept of "Population". It has been argued that since statistics usually involves the study of human behavior, it cannot claim the precision of the physical sciences.

  19. Statistics is for Government • The original principal purpose of Statistik was data to be used by governmental and (often centralized) administrative bodies. The collection of data about states and localities continues, largely through national and international statistical services. • Censuses provide regular information about the population. • During the 20th century, the creation of precise instruments for public health concerns (epidemiology, biostatistics, etc.) and economic and social purposes (unemployment rate, econometry, etc.) necessitated substantial advances in statistical practices.

  20. History of Probability • Probability has much longer history. Probability is derived from the verb to probe meaning to "find out" what is not too easily accessible or understandable. The word "proof" has the same origin that provides necessary details to understand what is claimed to be true. • Probability originated from the study of games of chance and gambling during the sixteenth century. Probability theory was a branch of mathematics studied by Blaise Pascal and Pierre de Fermat in the seventeenth century. • Currently; in 21st century, probabilistic modeling are used to control the flow of traffic through a highway system, a telephone interchange, or a computer processor; find the genetic makeup of individuals or populations; quality control; insurance; investment; and other sectors of business and industry.

  21. Stat Merge With Prob • Statistics eventually merged with the field of inverse probability, referring to the estimation of a parameter from experimental data in the experimental sciences (most notably astronomy). • Today the use of statistics has broadened far beyond the service of a state or government, to include such areas as business, natural and social sciences, and medicine, among others. • Statistics emerged in part from probability theory, which can be dated to the correspondence of Pierre de Fermat and Blaise Pascal (1654). Christiaan Huygens (1657) gave the earliest known scientific treatment of the subject. Jakob Bernoulli's Ars Conjectandi (posthumous, 1713) and Abraham de Moivre's Doctrine of Chances (1718) treated the subject as a branch of mathematics.

  22. Development in 18-19 centery • The theory of errors may be traced back to Roger Cotes's Opera Miscellanea (posthumous, 1722), but a memoir prepared by Thomas Simpson in 1755 (printed 1756) first applied the theory to the discussion of errors of observation. • Daniel Bernoulli (1778) introduced the principle of the maximum product of the probabilities of a system of concurrent errors. • The method of least squares, which was used to minimize errors in data measurement, is due to Robert Adrain (1808), Carl Gauss (1809), and Adrien-Marie Legendre (1805) by the problems of survey measurements, reconciling disparate physical measurements. • General theory in statistics: by Laplace (1810, 1812), Gauss (1823), James Ivory (1825, 1826), Hagen (1837), Friedrich Bessel (1838), W. F. Donkin (1844, 1856), and Morgan Crofton (1870). Other contributors were Ellis (1844), De Morgan (1864), Glaisher (1872), and Giovanni Schiaparelli (1875).

  23. Statistics in 20 Century Karl Pearson (March 27, 1857 – April 27, 1936) was a major contributor to the early development of statistics. Pearson's work was all-embracing in the wide application and development of mathematical statistics, and encompassed the fields of biology, epidemiology, anthropometry, medicine and social history, his main contributions are: Linear regression and correlation. The Pearson product-moment correlation coefficient was the first important effect size to be introduced into statistics; Classification of distributions forms the basis for a lot of modern statistical theory; in particular, the exponential family of distributions underlies the theory of generalized linear models; Pearson's chi-square test. Sir Ronald Aylmer Fisher, FRS (17 February1890 – 29 July1962) Fisher invented the techniques of maximum likelihood and analysis of variance, and originated the concepts of sufficiency, ancillarity, Fisher's linear discriminator and Fisher information. His 1924 article "On a distribution yielding the error functions of several well known statistics" presented Karl Pearson'schi-squared and Student'st in the same framework as the normal distribution and his own analysis of variance distribution z (more commonly used today in the form of the F distribution). These contributions easily made him a major figure in 20th century statistics. He began the field of non-parametric statistics, entropy as well as Fish information were essential for developing Bayesian analysis.

  24. Statistics in 20 Century • Gertrude Mary Cox (January 13, 1900 – 1978) Experimental Design • Charles Edward Spearman (September 10, 1863 - September 7, 1945) non-parametric analysis, rank correlation coefficient • Chebyshev's inequality • Lyapunov's central limit theorem • John Wilder Tukey (June 16, 1915 - July 26, 2000): jackknife estimation, exploratory data analysis and confirmatory data analysis. • George Bernard Dantzig (8 November1914 – 13 May2005):developing the simplex method and furthering linear programming, advanced the fields of decomposition theory, sensitivity analysis, complementary pivot methods, large-scale optimization, nonlinear programming, and programming under uncertainty. • Bayes' theorem • Sir David Roxbee Cox (born Birmingham, England, 1924) has made pioneering and important contributions to numerous areas of statistics and applied probability, of which the best known is perhaps the proportional hazards model, which is widely used in the analysis of survival data.

  25. School Thought of Statistics • The Classical, attributed to Laplace: • Relative Frequency, attributed to Fisher • Bayesian, attributed to Savage What Type of Statistician Are You?

  26. Classic Statistics • The problem with the Classical Approach is that what constitutes an outcome is not objectively determined. One person's simple event is another person's compound event. One researcher may ask, of a newly discovered planet, "what is the probability that life exists on the new planet?" while another may ask "what is the probability that carbon-based life exists on it?" • Bruno de Finetti, in the introduction to his two-volume treatise on Bayesian ideas, clearly states that "Probabilities Do not Exist". By this he means that probabilities are not located in coins or dice; they are not characteristics of things like mass, density, etc

  27. Relative Frequency Statistics • Consider probabilities as "objective" attributes of things (or situations) which are really out there (availability of data). • Use the data we have only to make interpretation. • Even substantial prior information is available, Frequentists do not use it, while Bayesians are willing to assign probability distribution function(s) to the population's parameter(s).

  28. Bayesian approaches • Consider probability theory as an extension of deductive logic (including dialogue logic, interrogative logic, informal logic, and artificial intelligence) to handle uncertainty. • First principle that the uniquely correct way is your belief about the state of things (Prior), and updating them in the light of the evidence. • The laws of probability have the same status as the laws of logic. • Bayesian approaches are explicitly "subjective" in the sense that they deal with the plausibility which a rational agent ought to attach to the propositions he/she considers, "given his/her current state of knowledge and experience."

  29. Discussion • From a scientist's perspective, there are good grounds to reject Bayesian reasoning. Bayesian deals not with objective, but subjective probabilities. The result is that any reasoning using a Bayesian approach cannot be checked -- something that makes it worthless to science, like non replicate experiments. • Bayesian perspectives often shed a helpful light on classical procedures. It is necessary to go into a Bayesian framework to give confidence intervals. This insight is helpful in drawing attention to the point that another prior distribution would lead to a different interval. • A Bayesian may cheat by basing the prior distribution on the data, because priors must be personal for coherence to hold before the study, which is more complex. • Objective Bayesian: There is a clear connection between probability and logic: both appear to tell us how we should reason. But how, exactly, are the two concepts related? Objective Bayesians offers one answer to this question.

  30. Steps Of The Analysis • Defining the problem: An exact definition of the problem is imperative in order to obtain accurate data about it. • Collecting the data: Designing ways to collect data is an important job in statistical data analysis. Population and Sample are VIP aspects. • Analyzing the data: Exploratory methods are used to discover what the data seems to be saying by using simple arithmetic and easy-to-draw pictures to summarize data. Confirmatory methods use ideas from probability theory in the attempt to answer specific questions. • Reporting the results

  31. Type of Data, Levels of Measurement & Errors • Qualitative and Quantitative • Discrete and Continuous • Nominal, Ordinal, Interval and Ratio • Types of error: Recording error, typing error, transcription error (incorrect copying), Inversion (e.g., 123.45 is typed as 123.54), Repetition (when a number is repeated), Deliberate error, Type Error, etc.

  32. Data Collection: Experiments • Experiment is a set of actions and observations, performed for solving a given problem, to test a hypothesis or research concerning phenomena. Itis an empirical approach acquiring deeper knowledge about the physical world. • Design of experiments In the "hard" sciences tends to focus on the elimination of extraneous effects, in the "soft" sciences it focuses more on the problems of external validity, by using statistical methods. Events occur naturally from which scientific evidence can be drawn, which is the basis for natural experiments. • Controlled experiments To demonstrate a cause and effect hypothesis, an experiment must often show that, for example, a phenomenon occurs after a certain treatment is given to a subject, and that the phenomenon does not occur in the absence of the treatment. A controlled experiment generally compares the results obtained from an experimental sample against a control sample, which is practically identical to the experimental sample except for the one aspect whose effect is being tested.

  33. Data Collection: Experiments • Natural experiments or quasi-experiments Natural experiments rely solely on observations of the variables of the system under study, rather than manipulation of just one or a few variables as occurs in controlled experiments. Much research in several important science disciplines, including geology, paleontology, ecology, meteorology, and astronomy, relies on quasi-experiments. • Observational studies Observational studies are very much like controlled experiments except that they lack probabilistic equivalency between groups. These types of experiments often arise in the area of medicine where, for ethical reasons, it is not possible to create a truly controlled group. ] • Field Experiments Named in order to draw a contrast with laboratory experiments. Often used in the social sciences, economics etc. Field experiments suffer from the possibility of contamination: experimental conditions can be controlled with more precision and certainty in the lab.

  34. Data Analysis • It will follow different approaches!

  35. Applied Statistics

  36. Actuarial science Applies mathematical and statistical methods to finance and insurance, particularly to the assessment of risk. Actuaries are professionals who are qualified in this field.

  37. Actuarial science • Actuarial science is the discipline that applies mathematical and statistical methods to assess risk in the insurance and finance industries. Actuaries are professionals who are qualified in this field through examinations and experience. • Actuarial science includes a number of interrelating subjects, including probability and statistics, finance, and economics. Historically, actuarial science used deterministic models in the construction of tables and premiums. The science has gone through revolutionary changes during the last 30 years due to the proliferation of high speed computers and the synergy of stochastic actuarial models with modern financial theory (Frees 1990). • Many universities have undergraduate and graduate degree programs in actuarial science. In 2002, a Wall Street Journal survey on the best jobs in the United States listed “actuary” as the second best job (Lee 2002).

  38. Where Do Actuaries Work and What Do They Do? The insurance industry can't function without actuaries, and that's where most of them work. They calculate the costs to assume risk—how much to charge policyholders for life or health insurance premiums or how much an insurance company can expect to pay in claims when the next hurricane hits Florida. Actuaries provide a financial evaluation of risk for their companies to be used for strategic management decisions. Because their judgement is heavily relied upon, actuaries' career paths often lead to upper management and executive positions. When other businesses that do not have actuaries on staff need certain financial advice, they hire actuarial consultants. A consultant can be self-employed in a one-person practice or work for a nationwide consulting firm. Consultants help companies design pension and benefit plans and evaluate assets and liabilities. By delving into the financial complexities of corporations, they help companies calculate the cost of a variety of business risks. Consulting actuaries rub elbows with chief financial officers, operating and human resource executives, and often chief executive officers. Actuaries work for the government too, helping manage such programs as the Social Security system and Medicare. Since the government regulates the insurance industry and administers laws on pensions and financial liabilities, it also needs actuaries to determine whether companies are complying with the law. Who else asks an actuary to assess risks and solve thorny statistical and financial problem? You name it: Banks and Investment firms, large corporations, public accounting firms, insurance rating bureaus, labor unions, and fraternal organizations..,

  39. Typical actuarial projects: • Analyzing insurance rates, such as for cars, homes or life insurance. • Estimating the money to be set-aside for claims that have not yet been paid. • Participating in corporate planning, such as mergers and acquisitions. • Calculating a fair price for a new insurance product. • Forecasting the potential impact of catastrophes. • Analyzing investment programs.

  40. VEE–Applied Statistical Methods Courses that meet this requirement may be taught in the mathematics, statistics, or economics department, or in the business school. In economics departments, this course may be called Econometrics. The material could be covered in one course or two. The mathematical sophistication of these courses will vary widely and all levels are intended to be acceptable. Some analysis of real data should be included. Most of the topics listed below should be covered:  Probability.  3 pts.  Statistical Inference.  3 pts.  Linear Regression Models.  3 pts. Time Series Analysis.  3 pts. Survival Analysis.  3 pts. Elementary Stochastic Processes.  3 pts. Simulation.  3 pts. Introduction to the Mathematics of Finance.  3 pts. Statistical Inference and Time-Series Modelling.  3 pts. Stochastic Methods in Finance.  3 pts. Stochastic Differential Equations and Applications.  3 pts. Advanced Data Analysis.  3 pts. Data Mining.  3 pts. Statistical Methods in Finance.  3 pts. Nonparametric Statistics.  3 pts. Stochastic Processes and Applications,  3 pts.

  41. Some Books Generalized Linear Models for Insurance Data, by Piet de Jong and Gillian Z. Heller Stochastic Claims Reserving Methods in Insurance (The Wiley Finance Series) by Mario V. Wüthrich and Michael Merz Actuarial Modelling of Claim Counts: Risk Classification, Credibility and Bonus-Malus Systems, by Michel Denuit, Xavier Marechal, Sandra Pitrebois and Jean-Francois Walhin Loss Models: From Data to Decisions (Wiley Series in Probability and Statistics) (Hardcover) by Stuart A. Klugman, Harry H. Panjer and Gordon E. Willmot

  42. Biostatistics or Biometry • Biostatistics or biometry is the application of statistics to a wide range of topics in biology. • Public health, including epidemiology, nutrition and environmental health, • Design and analysis of clinical trials in medicine • Genomics, population genetics, and statistical genetics in populations in order to link variation in genotype with a variation in phenotype. • Ecology • Biological sequence analysis .

  43. Data Mining • Knowledge-Discovery in Databases (KDD), is the process of automatically searching large volumes of data for patterns. • The nontrivial extraction of implicit, previously unknown, and potentially useful information from data • Data mining involves the process of analyzing data • Data Mining is a fairly recent and contemporary topic in computing. • Data Mining applies many older computational techniques from statistics, machine learning and pattern recognition.

  44. Data Mining and Business Intelligence Increasing potential to support business decisions End User DecisionMaking Business Analyst Data Presentation Visualization Techniques Data Mining Data Analyst Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses DBA Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems

  45. Database Technology Statistics Data Mining Visualization Machine Learning Pattern Recognition Other Disciplines Algorithm Data Mining: Confluence of Multiple Disciplines

  46. Data Mining: On What Kinds of Data? • Database-oriented data sets and applications • Relational database, data warehouse, transactional database • Advanced data sets and advanced applications • Data streams and sensor data • Time-series data, temporal data, sequence data (incl. bio-sequences) • Structure data, graphs, social networks and multi-linked data • Object-relational databases • Heterogeneous databases and legacy databases • Spatial data and spatiotemporal data • Multimedia database • Text databases • The World-Wide Web

  47. Top-10 Most Popular DM Algorithms:18 Identified Candidates (I) • Classification • #1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann., 1993. • #2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, 1984. • #3. K Nearest Neighbours (kNN): Hastie, T. and Tibshirani, R. 1996. Discriminant Adaptive Nearest Neighbor Classification. TPAMI. 18(6) • #4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid After All? Internat. Statist. Rev. 69, 385-398. • Statistical Learning • #5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer-Verlag. • #6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. Wiley, New York. Association Analysis • #7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In VLDB '94. • #8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. In SIGMOD '00.

  48. The 18 Identified Candidates (II) • Link Mining • #9. PageRank: Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. In WWW-7, 1998. • #10. HITS: Kleinberg, J. M. 1998. Authoritative sources in a hyperlinked environment. SODA, 1998. • Clustering • #11. K-Means: MacQueen, J. B., Some methods for classification and analysis of multivariate observations, in Proc. 5th Berkeley Symp. Mathematical Statistics and Probability, 1967. • #12. BIRCH: Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: an efficient data clustering method for very large databases. In SIGMOD '96. • Bagging and Boosting • #13. AdaBoost: Freund, Y. and Schapire, R. E. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 1 (Aug. 1997), 119-139.

  49. The 18 Identified Candidates (III) • Sequential Patterns • #14. GSP: Srikant, R. and Agrawal, R. 1996. Mining Sequential Patterns: Generalizations and Performance Improvements. In Proceedings of the 5th International Conference on Extending Database Technology, 1996. • #15. PrefixSpan: J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and M-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. In ICDE '01. • Integrated Mining • #16. CBA: Liu, B., Hsu, W. and Ma, Y. M. Integrating classification and association rule mining. KDD-98. • Rough Sets • #17. Finding reduct: Zdzislaw Pawlak, Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Norwell, MA, 1992 • Graph Mining • #18. gSpan: Yan, X. and Han, J. 2002. gSpan: Graph-Based Substructure Pattern Mining. In ICDM '02.

More Related