1 / 13

Mining Baseball Statistics

Mining Baseball Statistics. Data Mining – CSE881. Paul Cornwell Kajal Miyan Mojtaba Solgi Project URL: http://kmp-cse881.appspot.com/. Overview of Baseball. Baseball is a team sport There are two major leagues: AL (American), NL (National)

nasya
Télécharger la présentation

Mining Baseball Statistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Baseball Statistics Data Mining – CSE881 Paul Cornwell Kajal Miyan Mojtaba Solgi Project URL: http://kmp-cse881.appspot.com/

  2. Overview of Baseball • Baseball is a team sport • There are two major leagues: AL (American), NL (National) • Many statistics characterizing player performance are published yearly • Each league names one player MVP (Most Valuable Player) each year according to a vote • People place bets on who will be MVP 2

  3. Overview • Application: (motivation) • Can we predict who will be named MVP? • Learn how to do data mining • Learn about baseball • Impress sabermetricians • Baseball: it’s not diseases, crime, or pollution • Baseball statistics • Main task: predict MVPs for a given year • Use SVM to rank players 3

  4. playerID yearID stint teamID lgID Gbat AB R H 2B 3B HR RBI SB SO aasedo01 1985 1 BAL AL 54 0 0 0 0 0 0 0 0 0 abregjo01 1985 1 CHN NL 6 9 0 0 0 0 0 1 0 2 ackerji01 1985 1 TOR AL 61 0 0 0 0 0 0 0 0 0 adamsri02 1985 1 SFN NL 54 121 12 23 3 1 2 10 1 23 agostju01 1985 1 CHA AL 54 0 0 0 0 0 0 0 0 0 aguaylu01 1985 1 PHI NL 91 165 27 46 7 3 6 21 1 26 aguilri01 1985 1 NYN NL 22 36 1 10 2 0 0 2 0 5 Overview of Data and Mining • Data: 5 CSV files (Batting, Fielding, Master, Awards, Salaries)‏ • Data Mining: • Ranking (similar to classification)‏ • Anomaly detection (maybe)‏ 4

  5. Methodology - Preprocessing • Initial Data: ~90,000 rows in Batting table, 1871-2007 • One row: one player/year/stint/team • Cut to 1985-2007, ~28,000 rows, b/c Salary begin, rule changes • Perl script to merge tables by playerID/yearID/stint • BattingFieldingAwards(MVP)SalariesMaster = 48 columns • ~14 hours, but I got to relearn Perl! • Discovered: infeasible to use WEKA, need to use SVM-Light • Reformatted from CSV to space-delimited SVM-Light format • replace every “value” with “attribute:value” • replace commas, spaces • deleted 131 w/out fielding record (3-max: 26, 21, 16 at-bats)‏ • create (binary) rank value based on MVP status • replace all MM/DD/YYYY with YYYY • insert “qid” column according to year/league (46 qids)‏ • ... 5

  6. Methodology – Data Mining • Classification not apt to get good results, hence ranking with‏ • SVM-Light (Cornell University)‏ • Training generates a model which can rank input • Training phase Leave one (year) out • Testing Rank the players for that year • Postprocessing • SVM-Light returns only ranks of the players as integers • match ranks with corresponding players • Reformat data for visualization • Ranked the data for each attribute • Anomaly detection (in progress) • KNN on 4 attributes (Gbat, R, HR, RBI)‏ for players in >= 10 games • Compute z-scores for each attribute/year • Rank players by distance from nearest neighbor • Compare ranks in various attributes for detecting anomalies 6

  7. Methodology - Visualization • Bar charts of top 20 ranked players for various attributes • Python • Google App Engine • Google Charts tool • U.S. map of player birthState density 7

  8. Team Roles • Roles of team members • Planning - Everyone • Preprocessing – Paul Cornwell • Data Mining – Kajal Miyan • Visualization – Mojtaba Solgi 8

  9. Related Work • No apparent academic work on predicting MLB MVPs • PECOTA • Baseball Prospectus • www.baseballprospectus.com/pecota/ • Baseball “forecasting” • Makes statistical predictions about players • No MVP prediction evident • subscription service • Books are available with baseball forecasts • apparently for one year only 9

  10. Experimental Setup • Raw data downloaded from http://baseball1.com/content/view/58/82/ • Preprocessing done using Perl, Nano, Excel, OOo, TextPad • Preprocessing yields a table with ~28K rows and 45 columns • Experiments were conducted on a 2 GHz P4 machine running Kubuntu 8.04 with 1GB RAM • Data Mining and postprocessing with SVM-Light, Visual C#, Matlab • Visualization done using Python, Google App 10

  11. Experimental Evaluation • Preliminary results • SVM-Light trained on 1985-2006 data • tested on 2007 • ranked actual MVPs #1 and #11 (out of 1242 players) (2nd NL, #2)‏ • (there is one MVP for each league each year: AL, NL)‏ • 2006: ranks 7, 16 (1371 players) • 2005: ranks 1, 4 (1322 players) • 2004: ranks 1, 3 (1342 players) • 2003: ranks 3, 32 (1341 players) • 2002: ranks 1, 11 (1316 players) • Final evaluation (pending)‏ • Leave-one-out 11

  12. Visualization Demo • http://kmp-cse881.appspot.com/ 12

  13. Conclusions • MVP ranking was surprisingly successful • Early results suggest that it is feasible to predict MVPs with some accuracy • Lessons learned • Data mining is hard work • Baseball statistics are actually sort of interesting • Future work • Leave-one-out validation • Incorporate team statistics in player evaluations (expert advice)‏ 13

More Related