1 / 17

Research Bytes 2004

Research Bytes 2004. Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute. Need for Data Mining. Data are being gathered and stored extremely fast

conley
Télécharger la présentation

Research Bytes 2004

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Research Bytes 2004 Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute

  2. Need for Data Mining • Data are being gathered and stored extremely fast • Currently, the amount of new data stored in digital computer systems every day is roughly equivalent to 3000 pages of text for every person on Earth (estimate based on a projection to 2003 of a study led by Lyman & Varian at UC-Berkeley in 2000). • Computational tools and techniques are needed to help humans in summarizing, understanding, and taking advantage of accumulated data

  3. “Non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” [Fayyad et al. 1996] Raw Data Data Mining Patterns Analytical and Statistical Patterns (rules, decision trees, …) Visual Patterns What is Data Mining?or more generally, Knowledge Discovery in Databases (KDD) Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. "From Data Mining to Knowledge Discovery in Databases" AAAI Magazine, pp. 37-54. Fall 1996.

  4. data analysis • data mining • analytical • statistical • visual clean data models • data “pre”- • processing • noisy/missing data • dim. reduction data sources • data • management • databases • data warehouses • model/pattern • evaluation • quantitative • qualitative data “good” model • model/patterns • deployment • prediction • decision support new data Data Analysis (KDD)Process

  5. Machine Learning (AI) Contributes (semi-)automatic induction of empirical laws from observations & experimentation Statistics Contributes language, framework, and techniques Pattern Recognition Contributes pattern extraction and pattern matching techniques Databases Contributes efficient data storage, data cleansing, and data access techniques Data Visualization Contributes visual data displays and data exploration High Performance Comp. Contributes techniques to efficiently handling complexity Application Domain Contributes domain knowledge KDD is Interdisciplinarytechniques come from multiple fields

  6. IF A & B THEN IF A & D THEN 0.5 IF a & b & c THEN d & k IF k & a THEN e A B C D A, B -> C 80% C, D -> A 22% 0.75 0.3 What do you want to learn from your data?KDD approaches regression classification clustering Data change/deviation detection summarization dependency/assoc. analysis

  7. Some Current Analytical Data Mining Research Projects at WPI • Mining Complex Data: Set and Sequence Mining • Systems performance Data • Sleep Data • Financial Data • Web Data • Data Mining for Genetic Analysis • Correlating genetic information with diseases • Predicting gene expression patterns • Data Mining for Electronic Commerce • Collaborative and Content-Based Filtering • Using Association Rules and using Neural Networks

  8. Analyzing Sleep Data • Purpose: • Associations between sleep patterns and health/pathology • Obtain patterns of different sleep stages (4 sleep+REM +Wake) • DATA SET • Clinical (sequential) • Electro-encephalogram (EEG), • Electro-oculogram (EOG), • Electro-myogram (EMG), • Probe measuring flow of Oxygen in blood etc. Diagnostic (tabular) • Questionnaire responses • Patient’s demographic info. • Patient’s medical history (Source: http://www. blsc.com) • Potential Rules: • Association Rules • (Sleep latency <3 min) & (hereditary disorder) => Narcolepsy confidence=92%, support= 13% • (B) Classification Rules • (snoring= HEAVY) & (AHI* > 30/hour): severe OSA*** • => (Race = Caucasian)confidence=70%, support= 8% • *AHI = Apnea – Hypopnea index, **OSA = Obstructive Sleep Apnea WPI, UMassMedical, BC

  9. Input Data • Each instance: [Tabular | set | sequential] * attributes attr1 attr2 attr3 attr4 attr5 [class] illnesses heart rate age oxygen gender Epworth P1 P2 P3 …

  10. Analyzing Financial Data • Sequential data – daily stock values • “Normal” (tabular/relational) data • sector (computers, agricultural, educational, …), type of government, product releases, companies awards, … • Desired rules: • If DELL’s stock value increases & 1999<year<2002 => IBM’s stock value decreases

  11. Events – Financial DataBasic events: 16 or so financial templates [Little&Rhodes78]difficult pattern matching – alignments and time warping Panic Reversal Head & Shoulders Reversal Rounding Top Reversal Descending Triangle Reversal

  12. Closer Look: WPI WekaTool for mining complex temporal/spatial associations

  13. Data Mining for Genetic Analysisw/ Profs. Ryder (BB, WPI), Krushkal (BB, U. Tennessee), Ward (CS, WPI), and Alvarez (CS, BC) • SNP analysis • discovering correlations between sequence variations and diseases • Gene expression • discovering patterns that cause a gene to be expressed in a particular cell

  14. Correlating Genetics with Diseases • Utilize Data Mining Techniques with Actual Genetic Data Sampled from Research • Spinal Muscular Atrophy: inherited disease that results in progressive muscle degeneration and weakness.

  15. Genomic Data Resources Wirth, B. et al. Journal of Human Molecular Genetics

  16. Our System: CAGE To predict gene expression based on DNA sequences. Muscle Cell Gene 3 Gene 1 Gene 2 Neural Cell CAGE Gene 1 Gene 3 Gene 2 Seam Cells On Gene 1 Gene 3 Gene 2 Off

  17. Ali Benamara Dharmesh Thakkar. Senthil K Palanisamy. Zachary Stoecker-Sylvia. Keith A. Pray. Jonathan Freyberger. Maged El-Sayed. Parameshvyas Laxminarayan. Aleksandar Icev. Wendy Kogel. Michael Sao Pedro. Christopher Shoemaker. Weiyang Lin. Jonathan Rudolph Eduardo Paredes Iavor N. Trifonov. Takeshi Kawato Cindy Leung and Sam Holmes. John Baird, Jay Farmer, Rebecca Gougian, Ken Monterio, Paul Young. Zachary Stoecker-Sylvia. Kristin Blitsch, Ben Lucas, Sarah Towey Wendy Kogel, Brooke LeClair, Christopher St. Yves. Brian Murphy, David Phu, Ian Pushee, Frederick Tan. Daniel Doyle, Jared Judecki, James Lund, Bryan Padovano. Christopher Cole. Michael Ciman and John Gulbrandsen. Tara Halwes Christopher Martino. Matthew Berube. Anna Novikov. Amy Kao and Dana Rock. Grad. & Undergrad. Students

More Related