1 / 40

Data Mining and Machine Learning in Population Health Studies

Data Mining and Machine Learning in Population Health Studies. Marina Sokolova Dept of ECM and School of EECS, University of Ottawa Institute for Big Data Analytics. Data Mining. Science and technology that discover new knowledge in large data sets Vast amount of accumulated data

Télécharger la présentation

Data Mining and Machine Learning in Population Health Studies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining and Machine Learning in Population Health Studies Marina Sokolova Dept of ECM and School of EECS, University of Ottawa Institute for Big Data Analytics

  2. Data Mining • Science and technology that discover new knowledge in large data sets • Vast amount of accumulated data • XXX,XXX,XXX records from health insurance companies in the NY state alone => automated methods • Ever-changing data • New drugs, tests change the problem => adaptive methods • Beyond human processing capacities sokolova@uottawa.ca

  3. Structured Data Databases, mostly organizational sokolova@uottawa.ca

  4. UnstructuredData • Text • He had an uncomplicated postoperative course and he was transferred . Advanced his diet on postop day # 4 to a transitional diet ... • Experts fear that Ebola will mutate and become spreadable via cough or sneeze ... • Images sokolova@uottawa.ca

  5. Privacy Protection • Individuals cannot be uniquely identified from the data set • Mandatory for health data custodians and human subject studies • HIPPA, PHIPA, etc. • Privacy-preserving methods • De-identification, i.e. a severing of a data set from the identity of the data contributor, but may include identifying information which could be re-linked by a trusted party in certain situations • Anonymization, i.e., irreversibly severing a data set from the identity of the data contributor sokolova@uottawa.ca

  6. Data Mining Process Step 1: Data pre-processing • Sample selection • Noise reduction • Unstructured to structured transformation • Privacy protection Step 2: Information processing • Record classification • Clustering • Association rule mining Step 3: Evaluation • Performance assessment • Result interpretation sokolova@uottawa.ca

  7. Machine Learning • Ability of algorithms to discover properties in previously unseen data, based on known properties found in training data • Algorithmic “muscles” of Data Mining • Common tasks: • Classification of instances • Clustering of instances sokolova@uottawa.ca

  8. More on ML tasks • Classification/supervised learning • An algorithm assigns data items into pre-defined categories (e.g., No, < 30, >30) • Categories do not over-lap • Binary classification is the most common • There could be more than one category for an item (multi-labelled classification):C + Female + [10-20) • Clustering/unsupervised learning • Grouping data items according to their similarities • Clusters usually do not over-lap sokolova@uottawa.ca

  9. Essential Parts of ML • learning modes • training and test stages • model selection (validation and testing, cross-validation, leave-one-out) • algorithms  (e.g., K-NN, Naïve Bayes, Support Vector Machines) • Performance evaluation      sokolova@uottawa.ca

  10. Learning Modes • Classification/Supervised • Data items are labelled • One page of a professionally annotated text from a medical domain - $10,000 • 600 personal health records - $1,500 for de-identification and 1-2 months for an experienced Research Assistant to extract relevant information ($4,500 + overhead) . Note that we usually need thousands of records! • The most accurate results • Clustering/Unsupervised • Data items are not labelled • Plenty of such data • Hard to evaluate, usually approximate results • Semi-supervised • A mixture of labelled and unlabelled data sokolova@uottawa.ca

  11. Training and Test Stages • Training and test data • Data sets are split into non-overlapping parts • Training sets are usually bigger than test sets • An algorithm is applied on the training set; • Its results are verified either automatically (supervised learning) or manually (non-supervised learning) • The algorithm parameters are adjusted depending on the results • The model with the best results is applied onthe test set • Errors are counted on the test set only! sokolova@uottawa.ca

  12. The Model Selection • Validation and test • Divide the initial set into 3 parts (training, validation, test) • Use 1 part for training and 1 part for validation • Apply on the test part • Cross-validation • Divide the initial set into 5 (10) parts • Use 4 (9 )parts for training and 1 partfor test • Repeat 5(10 ) times for a new set of training and test parts • Leave-one-out • Use all items but one for training • Apply the algorithm on the remaining item • Repeat for all data items sokolova@uottawa.ca

  13. Algorithms • Probability-based (Naïve Bayes) • Prototype-based (K-NN) • Optimization-based (SVM) • Decision-based (Decision Trees) sokolova@uottawa.ca

  14. Performance Measures • Accuracy = (tp + tn)/(tp + tn + fp + fn) • Precision (Pr) = tp/(tp + fp) • Recall (R) = tp/(tp + fn) • F-score = 2PrR/(Pr + R) sokolova@uottawa.ca

  15. Accuracy of Disease Diagnostics sokolova@uottawa.ca

  16. New Frontiers: Personal Health Information on the Web • Infodemiology studies the determinants and distribution of health information on the Internet (GuntherEysenbach, 2004) • Google Trends • BioCaster • 19 % - 28.5 % of all the Internet users to participate in online health-related discussions. • Growth of Internet of Things is expected significantly increase sharing of personal health information • Privacy protection has to be adjusted/re-developed sokolova@uottawa.ca

  17. Personal Health Information • Personal health information (PHI) is information about one’s health discussed by a patient in a clinical setting • PHI is the most vulnerable private information posted online • I have a family history of Alzheimer's disease. I have seen what it does and its sadness is a part of my life. I am already burdened with the knowledge that I am at risk. • We're going for the basic blood tests, the NT scan, and the "Ashkenazi panel" since both XX and I are Jewish from E. European descent. Privacy Protection in Big Data Analytics

  18. Research Questions Q1. Do people talk about health? Q2. How do people talk about health? Q3. What emotions can be found in health discussions? sokolova@uottawa.ca

  19. Challenges of PHI Retrieval (Information Extraction) General health information: they are promoting cancer awareness particularlylungcancer Personal health information: I had a rare condition and half of mylunghad to be removed Irrelevant: I saw a guy chasing someone and screaming at the top of hislungs Terminologythe transfer went well - my REdid it himself which was comforting. 2 embies(grade 1 but slow in development) so I am not holding my breath for a positive Technical termsSomeone with 50 DB hearing aid gain with atotal loss of 70 DB may not know that the place is producing 107 DBsince it may not appear too loud to him since he only perceives 47 DB

  20. Challenges of PHI Understanding (Semantic Analysis) Privacy Protection in Big Data Analytics

  21. Challenges of Medical Electronic Resources • Electronic medical dictionaries are developed to analyze scientific publications • the Medical Dictionary for Regulatory Activities (MedDRA): 8,561 unique terms/86 PHI terms • the Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT): 44,802 unique terms/108 PHI terms Privacy Protection in Big Data Analytics

  22. Our Approach • Humans in the loop – manual annotation of data samples (Supervised learning) • Advanced methods in data pre-processing • Sentence splitting, tokenization, part of speech tagging, lemmatization for nouns and verbs • PHI resource building (e.g., ontology of PHI terms, HealthAffect lexicon) • Use of robust algorithms • Naive Bayes • Appropriate evaluation methods • fn estimation Privacy Protection in Big Data Analytics

  23. Data Sources • Online medical forums • IVF • Hearing loss • Newborn screening for rare diseases • Social networks • MySpace • Twitter • Facebook sokolova@uottawa.ca

  24. Q1. Do people talk about health? • In randomly selected 1000 tweet threads, 15% threads revealed personal health information • In randomly selected 11800 MySpace posts, 6% posts discussed personal health • On IVF forums, participants (women 95%) mostly talk about health sokolova@uottawa.ca

  25. Q1: It all depends on the context • On HL forums, participants talk about health and quality of life/life style • On newborn screening for rare diseases, parents often discuss privacy and physical hurt; at the same time, they seldom talk about health • In a student network on Facebook, participants do NOT talk about health sokolova@uottawa.ca

  26. Q2: How people talk about health • Simple language • For me the laser treatment had unpleasant side-effects. • …got a huge bump on my forehead, fractured my nose. • Basic concepts • Concussion, thyroid, asthma, fracture, hypothermia • Cold, flu, injury, headache • Exception: Hearing Loss discussions involve more specific terms than other discussions sokolova@uottawa.ca

  27. Q3. What emotions can be found in health discussions? • Range of emotions depends on the content of health issues • Positive/negative/neutral on Twitter and HL forums • Gratitude, encouragement, endorsement, confusion on IVF forums • Strength of emotional disclosure varies • Outspoken emotional posts on newborn screening and IVF • Muted emotions on MySpace sokolova@uottawa.ca

  28. Performance Evaluation • We detect PHI: • False negatives on social networks (11,800 messages) – 0.003/baseline 0.031 • False negatives on peer-to-peer networks (2,300 documents) – 0.000/baseline 0.031 • We recognize PHI: • Precision on Twitter (1000 threads) - 0.770/baseline 0.419 • We identify PHI-related opinions: • F-score on HL forums (3515 sentences) - 0.685/baseline 0.584 Privacy Protection in Big Data Analytics

  29. Data Sets Used in Population Health Studies • Indian Liver Patient Dataset http://archive.ics.uci.edu/ml/datasets/ILPD+%28Indian+Liver+Patient+Dataset%29 • Breast Cancer Wisconsin (Diagnostic) Data Set http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29 • Haberman's Survival Data Set (breast cancer, 1999) http://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival • Many more http://archive.ics.uci.edu/ml/datasets.html?format=&task=cla&att=&area=life&numAtt=&numIns=&type=&sort=nameUp&view=table sokolova@uottawa.ca

  30. Useful links • Weka 3: Data Mining Software – open source! http://www.cs.waikato.ac.nz/ml/weka/ • Support Vector Machine – open source! http://svmlight.joachims.org/ • Andrew Ng’s (Stanford) web site with video lectures on ML http://www.academicearth.org/courses/machine-learning • Benchmark data sets repository http://archive.ics.uci.edu/ml/ sokolova@uottawa.ca

  31. Thank you! Questions? sokolova@uottawa.ca

  32. Probability-based: Naïve Bayes • Assumes that all the informative features are independent AND identically distributed. • Both assumptions are generally not true. sokolova@uottawa.ca

  33. Being Optimistic Does not Hurt • Naïve Bayes can outperform sophisticated classifiers! sokolova@uottawa.ca

  34. Prototype-based: K-nearest neighbor • Uses observations in the training set T closest in the input space to the entry x to form conclusion Y . • Y can be a predicted class label of x. • Useful in practical applications sokolova@uottawa.ca

  35. A closer look at K neighbors Labels for the test example: 2-NN: Green 3-NN: Green 4-NN: Ambiguous 5-NN: Red 6-NN: Red 7-NN: Red. sokolova@uottawa.ca

  36. Good/bad things about KNN • Only two adjustable parameters: • Number of neighbors • Closeness (i.e., distance between neighbors) • The output is easy to understand • Highly depends on the training data, population sample sokolova@uottawa.ca

  37. Optimization-based algorithms: Support Vector Machines • Highly accurate classifiers • Extremely popular for publications • Seldom used in practice sokolova@uottawa.ca

  38. Support Vector Machines Labels for the test example: • Hyper-planes in action: • various dimensions • linear hyper-planes differ by soft margins sokolova@uottawa.ca

  39. Good/bad things about SVM • Several adjustable parameters • Dimensions of discriminative hyper-planes • Kernel functions • Soft-margin • Every parameter matters • Almost a random choice sokolova@uottawa.ca

  40. Decision-based algorithms • Decision Trees • Decision Lists Can beat SVM when efficiency is as much important as effectiveness! sokolova@uottawa.ca

More Related