1 / 26

Data analysis

Data analysis. Malathi Veeraraghavan Professor Charles L. Brown Dept. of Electrical and Computer Engineering University of Virginia mvee@virginia.edu Date: July 6, 2010. Outline. Increasing interest in data Data mining New course: From Data to Knowledge One example data set analysis

Télécharger la présentation

Data analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Data analysis Malathi Veeraraghavan Professor Charles L. Brown Dept. of Electrical and Computer Engineering University of Virginia mvee@virginia.edu Date: July 6, 2010

  2. Outline • Increasing interest in data • Data mining • New course: From Data to Knowledge • One example data set analysis • Summary

  3. “The data deluge” “Data, data everywhere” • Economist Special Issue Feb 27-Mar. 5, 2010 • Walmart databases alone are estimated at more than 2.5 petabytes (a petabyte is 1 million gigabytes). • From businesses to governments, data collection and analysis is rapidly becoming the next big thing. • The industry of information management is growing at almost 10% a year, roughly twice as fast as the software business.

  4. “The data deluge” • “A new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data.” • Hal Varian, Google’s chief economist notes that “Data are widely available; what is scarce is the ability to extract wisdom from them.”

  5. Business intelligence • Nestle sells > 100,000 products in 200 countries using 550,000 suppliers • Problem: not using its huge buying power effectively • Used SAP software and analyzed its data • Just one ingredient – vanilla – its American operation reduced the number of specifications and used fewer suppliers, saving $30M per year • Annual savings: $1 billion Economist special issue

  6. Medical use • Dr. McGregor from University of Ontario • Goal: spot fatal infections in premature babies • Monitors subtle changes in 7 streams of real-time data, such as heart rate, blood pressure, etc. • ECG alone takes 1000 readings/second • Infections are detected before obvious symptoms emerge • Naked eye cannot see it, but the computer can! • Who programs these? CS graduates. • Another term: Evidence Based Medicine Economist special issue

  7. Government usage • An add-on to a 1986 law required firms to disclose the harmful chemicals they release. • When the public started tracking these numbers, by 2000, American businesses had reduced their emissions by 40% Economist special issue

  8. Best-sellers • “Super-crunchers: Why Thinking-by-Numbers Is the New Way to Be Smart” by Ian Ayres • “Money Ball: The Art of Winning an Unfair Game” by Michael Lewis • “The Long Tail” by Chris Anderson • Malcolm Gladwell books - Outliers • Microtrends – Mark Penn (elections) • Freakonomics – S. Dubner and S. Levitt

  9. Moneyball example • 2002 season: Richest team, NY Yankees, had a payroll of $126 million, while the Oakland A’s had a payroll of less than a third of that, about $40 million, and yet they had reached the playoffs three years in a row, and took the Yankees close to elimination. How did they do it? • Billy Beane, general manager of Oakland A’s • Respected statistics • Hired Paul DePodesta, Harvard MBA, who applied Bill James’ formulas and selected players based on their statistics. • Runs created = (Hits + Walks) Total Bases/(At Bats + Walks) • Jeremy Brown – only player in the history of the SEC with 300 hits and 200 walks, but he was overweight • Scouts vs. statisticians! • The tendency of everyone to generalize wildly from his own experience. Most people think their own experience is typical!

  10. Malcolm's Gladwell's "Outliers” hockey players story • Why Canadian hockey players born early in the year have a big advantage; cutoff date was Jan. 1 • ESPN conducted a little study: All the NHL players from this season who were born from 1980 to 1990. • Sure enough: Many more were born early in the year than late. http://sports.espn.go.com/espn/page2/story?page=merron/081208

  11. Examples from “The Long Tail” • Rhapsody, an online music store, which in Dec. 2005 had 1.5M tracks, reported that the number of downloads/month for even the 100,000th track was in the 1000s, when a Walmart store, the largest brick-and-mortar music retailer, stocks only 55,000 tracks. • Rhapsody reports that 40% of its total sales came from the Long Tail products, i.e., those not available in retail stores. • Anderson gives several such examples, calling these businesses Long-Tail aggregators • Google as the long-tail aggregator of advertising • eBay of goods • Amazon of books • Apple of music • Netflix of movies

  12. Experts vs. intuition • Ian Ayres’ book • “The future belongs to people like Wolfers who are comfortable with both intuition and numbers” • Wolfers analyzed 44,000 college basketball games (> 16 years) • Also see Jason Lehrer’s “How we Decide” – another bestseller Ian Ayres’ book, page 220

  13. What Wolfers did • Plot density function of number of games that beat the Las Vegas spread • Perfect normal bell curve! • Just look at games with point spreads less than or equal to 12 • Perfect normal bell curve • Look at games with point spread > 12 • 47% chance that the favored team beat the spread (53% failed to cover the spread) • more than 20% of games fell in this category of games with >12 spreads • Is it point shaving? • Look at the score five minutes before the end of the game – right on track to beat the spread 50% of the time! • Indeed a stronger case for point shaving Ian Ayres’ book, page 216

  14. 2SD Rule: To understand variability • There is a 95% chance that a normally distributed variable will fall within two standard deviations (plus or minus) of its mean • Statistical significance – simple intuitive concept – there is less than 5% chance that a random variable will be more than two standard deviations away from the mean. • Stanford Law school students knew that professors were required to give a 3.2 mean. They wanted to know if the professor was a “spreader” or a “clumper”! Ian Ayres’ book, page 213

  15. “Margin of error” • News article says “Laverne is leading Shirley 51% to 49% with a margin of error of 2%” and so the race is a “statistical dead heat.” • This is wrong! Why? • Margin of error = 2SD • So standard deviation is 1% • This means there is an 84% chance that Laverne leads in the polls Ian Ayres’ book, page 224

  16. P(X≤1) = 0.84, where X~N(0,1)

  17. Exercise • See if you can use the 2SD rule and just your intuition to derive a number for the standard deviation for adult male height • Estimate two things: mean and standard deviation Ian Ayres’ book, page 214

  18. An answer • Average adult male height is 5’ 9” • To estimate SD, 95% of adult males should fall between what two heights? • Say 5’ 3” and 6’3”? • Then SD = 3” – Just a guess • Can be fairly confident that SD is not 1” or 5”!

  19. Technology trends enabling all this data analysis • Cloud computing • Amazon , Google, Yahoo, Microsoft • Open source software • R programming language • NY Times article, Jan. 7, 2009 • Hadoop allows ordinary PCs to analyze huge quantities of data that previously required supercomputers Economist special issue

  20. Technology or techniques? • Moore’s Law • Processing power doubles every two years • Supercrunching does need CPUs, but computing power has been available • More important: Kryder’s Law • Storage capacity of hard drives has been doubling every two years • Chief technology office (Mark Kryder) for hard drive manufacturer, Seagate Ian Ayres’ book, page 151

  21. Examples of data storage • Yahoo! • “Captures 12 TB of data every day” • Half the books in the Library of Congress • Costs • TB hard drive costs $400 (2007). Now (2010) it is $65! • Usage • Allows every Hertz and UPS employee to use handheld machines and capture every transaction’s data Ian Ayres’ book, page 152

  22. Three techniques • Regressions • error term ~ N(0,2) • Randomization • Run experiments by treating different samples in different ways • Neural networks • Functional form is not assumed to be linear or anything specific Ian Ayres’ book

  23. Opportunities for CS programmers • Implementing • statistical analysis techniques • machine learning techniques • neural networks • Visualization tools • Wattenberg’s idea to show a map of the market instead of graphs showing index movements • Privacy issues

  24. Data mining • Techniques • statistical analysis • machine learning • neural networks • Examples • Walmart • NBA (basketball) analysis http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm

  25. Course material • From Data to Knowledge • First offering: Fall 2010 • Focus on data sets • Less on statistical techniques • Learn R programming through class-provided R programs • http://www.ece.virginia.edu/mv/edu/D2K/index.htm

  26. Summary • Importance of data analysis • in every walk of life! • Application area • complexity in coding mathematical techniques, visualization, privacy • Importance of computer engineering advances, e.g., storage • Teaching languages/protocols with examples? Untested

More Related