Preventing Data Mining Disasters: A Comprehensive Safety Guide
This report by Mary McGlohon for the SIGBOVIK Commission discusses the various data mining disasters that threaten scientific research, including numeric overflow and power law failures. It highlights substantial financial impacts and the loss of valuable research hours. The document reviews common mining pitfalls and offers practical recommendations for prevention. Emphasizing the importance of safety, it urges researchers to maintain rigor in their methods and remain calm when facing challenges, while also promoting awareness of potential hazards in data mining.
Preventing Data Mining Disasters: A Comprehensive Safety Guide
E N D
Presentation Transcript
Data Mining Disasters • A Report • Mary McGlohon • SIGBOVIK Commission for Workplace Safety
Data Mining Safety • Data mining disasters are a hazard to the progress of scientific research. • We will review some common mining disasters and make recommendations for prevention
Numeric Overflow “ • In 2007, numeric floods were responsible for over $600 million in property damages. ’’ -Department of Made-Up Statistics
Numeric Overflow ERROR::NUMERICOVERFLOW Nobody expected the breach of the levees
Numeric Overflow • Also caused loss of several hundred nerd-hours. • 1 nerd-hour = 1 grad-student-hour = 0.25 faculty-hours = 6 undergrad-hours
Numeric Overflow • Recommendation: A drowning researcher’s best bet is to grab onto a floating log.
Power Law Failures • Occurs when confusing heavy-tailed distributions such as: • Power Law (incl. Pareto, Zipf) • Lognormal • Weibull • Burr • Log-gamma • Log-Log-Log-Log-Mushroom-Mushroom
Power Law Failures • Many natural phenomena have heavy tails. • Magnitude of earthquakes • Size of human settlements • Degree distribution of “real” graphs • Time-to-response in CS professors email • Your mom • However, confusing heavy-tailed distributions confused results in...
Power Law Failures • Related danger: Statisticians, computer scientists, and physicists wasting valuable nerd-hours in religious arguments over which heavy-tailed distribution is being followed.
Power Law Failures • Statisticians get mean when they get religious. (SIGBOVIK07) • Recommendation: Calm the hell down.
Decision Tree Forest Fires • Pruning is used to prevent overfitting. • When overpruning occurs, trees are burned to stumps. • This spreads, torching entire forests. L (Aww...)
Decision Tree Forest Fires • Recommendation: Researchers should obtain burning permit before pruning with fire. • Smoking while researching is not recommended-- if you choose to do so, make sure your “butts are out”.
Voting Fraud by One-Armed Bandits • Cascading failures from other fields may cause disasters in data mining. • Fatal mistake: combining related subfields voting mechanisms and one-armed bandit problems.
Voting Fraud by One-Armed Bandits • One-armed bandits commit voting fraud by: • Impersonating real voting machines. • Cramming cake into voting machines. • (The cake is a lie.)
Other safety measures • Cool mining helmets
Conclusion • The Commission for Workplace Safety hopes this has raised awareness of potential data mining disasters. • When faced with data-mining disasters, • Remain Calm. J • Blame it on one-off errors, lack of rigor in proofs of correctness, or whatever government agency is funding the project.