1 / 43

Literature Mining BMI 730

Literature Mining BMI 730. Kun Huang Department of Biomedical Informatics Ohio State University. Announcement. HW #3 is cancelled. The grades will be adjusted accordingly. Acknowledgement. Jensen et al. Nature Reviews Genetics 7 , 119–129 (February 2006) | doi:10.1038/nrg1768.

Télécharger la présentation

Literature Mining BMI 730

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Literature Mining BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University

  2. Announcement • HW #3 is cancelled. The grades will be adjusted accordingly.

  3. Acknowledgement Jensen et al. Nature Reviews Genetics7, 119–129 (February 2006) | doi:10.1038/nrg1768

  4. Acknowledgement • Dr. Hongyu Peng (Brandies Univ.) • Dr. Hagit Shatkay (http://www.shatkay.org) • provided part of the slides.

  5. Connecting the dots • Story of Thalidomide (from sedative to birth defects to anti-cancer drug)

  6. Jensen et al. Nature Reviews Genetics7, 119–129 (February 2006) | doi:10.1038/nrg1768

  7. Jensen et al. Nature Reviews Genetics7, 119–129 (February 2006) | doi:10.1038/nrg1768

  8. Information Retrieval (IR) • Finding the papers • IR systems aim to identify the text segments (be it full articles, abstracts, paragraphs or sentences) that pertain to a certain topic (e.g., yeast cell cycle). • E.g., PubMed, Google Scholar • Ad hoc IR • Text categorization (pre-defined set of papers) • Advanced – integrate Entity Recognition

  9. Ad Hoc IR • User provide query • Boolean model • Index based (e.g. “Gene and CD”)

  10. Boolean Queries DB: Database of documents. Vocabulary: {t1,…,tM } (Terms in DB, produced by the tokenization stage) Index Structure:A term  all the documents containing it. acquired immunodeficiency asthma blood blood pressure Index Database

  11. (54,745 Pubmed entries) CD cytosine deaminase Cortical dysplasia compact disk... Crohn‘s disease Chagas' disease capillary density • Ad Hoc IR • User provide query • Boolean model • Challenges • Synonymy (AGP1, aka, Amino Acid Permease1) • Polysemy

  12. Ad Hoc IR • User provide query • Vector-based model • Similarity query, e.g., Vector based. Semantic search TIME (Sept 5, 2005): Search engines are good at matching words … The next step is semantic search – looking for meaning, not just matching key words. … Nervana, which analyzes language by linking word patterns contextually to answer questions in defined subject areas, such as medical-research literature.

  13. The Vector Model DB: Database of documents. Vocabulary: {v1,…,vM } {Terms in DB} Document dDB: Vector,<w1d,…,wMd>, of weights. Weighting Principles • Document frequency: Terms occurring in a few documents are moreuseful than terms occurring in many. • Local term frequency: Terms occurring frequently within a document are likely to be significant for the document. • Document length: A term occurring the same # of times in a long document and in a short one has less significance in the long one. • Relevance: Terms occurring in documents judged as relevant to a query, are likely to be significant (WRT the query). [Sparck Jones et al. 98]

  14. 1 if ti d 0 otherwise Wid = Wid= fid fi (fi= # of docs containing ti) Some Weighting Schemes: Binary Wid = fid = # of times ti occurs in d. TF Consider Local term frequency TF X IDF (one version...) Consider Local term frequency and (Inverse) Document frequency

  15. q• d |q||d| Vector-Based similarity Document d= <w1d,…,wMd>DB Query q = < w1q,…,wMq> (q could itself be a document in DB...) Sim(q, d) = cosine (q, d ) = q d [Salton89, Witten et al99] Introductory IR.

  16. Probabilistic Models Query q ; Document d • Goal: Find all d’s such that P(relevant | d, q) is high P(relevant | d, q) Maximize log-odds: Log[ ] P(Irrelevant | d, q) [Sparck Jones et al. 98, Sahami98, Ponte&Croft 98, Hoffman 99]

  17. Latent Semantics Analysis [Dumais, Deerwester et al,1988,1990] • Motivation: • Overcoming synonymy and polysemy. • Reducing dimensionality. • Idea: • Project from “explicit term” space to a lower dimension, “abstract concept” space. • Methodology: • PCA applied to the document-term matrix. • Highest singular values are used as the features for representing documents.

  18. Information Retrieval- Details(cont.) Cancer Apoptosis Elongation Text Categorization (semantic) Automatically place documents in right categories so as to make them easy-to-find. ... ...

  19. Information Retrieval-Details(cont.) Rule-Based Text Classification A knowledge-engineering approach. Boolean rules (DNF), based on the presence/absence of specific terms within the document, decide its membership in the class. (e.g. the CONSTRUE system [Hayes et al. 90,92]) Example: If ( (<GENE_Name> ⋀transcript) ⋁ ((<GENE_Name> ⋀ Western Blot) ⋁ ((<GENE_Name> ⋀ Northern Blot)) Then GeneExpressionDoc Else ⌝GeneExpressionDoc

  20. Information Retrieval-Details(cont.) Machine Learning for Text Classification (supervised) • Take a training set of pre-classified documents • Build a model for the classes from the training examples • Assign each new document to the class that best fits it • (e.g. closest or most-probable class.) • Types of class assignment: • Hard: Each document belongs to exactly one class • Soft: Each document is assigned a “degree of membership” in several classes • Methods • Nearest neighbor • Summarizing document vectors • SVM, Bayesian, boosting

  21. Evaluating Extraction and Retrieval • To say how good a system is we need: • Performance metrics (numerical measures) • Benchmarks, on which performance is measured (the gold-standard).

  22. Evaluating Extraction and Retrieval(cont.) Performance Metrics N items (e.g. documents, terms or sentences) in the collection REL: Relevant items (documents, terms or sentences) in the collection. These SHOULD be extracted or retrieved. RETR: Retrieved items (e.g. documents, terms or sentences) are actually extracted/retrieved Some correctly (A= |REL ⋀RETR|), Some incorrectly (B = |RETR – REL|) |RETR|= A+B

  23. Collection REL RETR Evaluating Extraction and Retrieval(cont.) Performance Metrics(cont.) |NotREL – RETR| = C |Collection| = N |RETR – REL| = B |REL-RETR| = D |REL ⋀ RETR| = A

  24. Performance Metrics (cont.) Precision: P = A/(A+B) How many of the retrieved/extracted items are correct Recall: R = A/(A+D) How many of the items that should be retrieved are recovered Accuracy:(A+C)/N (Ratio of Correctly classified items) Combination Scores: F-score:2PR / (P+R) Harmonic mean, in the range [0,1] Fβ-score: (1+β2)PR / (β2·P + R) β>1 Prefer recall, β <1 Prefer precision E-measure: 1 – F(β)-score Inversely proportional to performance (Error measure).

  25. 1 25% Recall 2 3 50% 4 75% 5 6 100% 7 Performance Metrics (cont.) Precision-Recall Curves 4 relevant documents in the collection. 7 retrieved and ranked.

  26. Performance Metrics (cont.) Accounting for Ranks For a given rankn, Pn: Precision at rank n (P@n) R-Precision: PR where R is the number of relevant documents Average Scores Average Precision: Average the precision over all the ranks in which a relevant document is retrieved. Mean Average Precision: Mean of the Average Precision over all the queries. Micro-Average: Average over individual items across queries Macro-Average: Average over queries

  27. Jensen et al. Nature Reviews Genetics7, 119–129 (February 2006) | doi:10.1038/nrg1768

  28. Entity Recognition (ER) • Identifying the substance(s) • Rule and contextual based approach (manual) – e.g., ‘-ase’ for enzyme • Rule and contextual based approach (machine learning) • Dictionary-based approach • How the names are written - CDC28, cdc28, cdc28p, cdc-28 • Curation of the dictionary

  29. Entity Recognition (ER) • Major Challenge • Lack of standardization of names • ‘cdc2’ refers to two completely unrelated genes in budding and fission yeast • ‘SDS’ - serine dehydratase gene vs. Sodium Dodecyl Sulfate vs. Shwachman-Diamond syndrome • Synonymy (AGP1, aka, Amino Acid Permease1) • Polysemy

  30. Entity Recognition (ER) • Simpler version – if this symbol is for gene or its product • iHOP (Information hyperlinked over proteins) http://www.pdg.cnb.uam.es/UniPub/iHOP

  31. Vocabulary • Many, many • SNOWMED, ICD, … • ICD (International Statistical Classification of Diseases and Related Health Problems)

  32. Vocabulary • ICD • 573.3 Hepatitis, unspecified • Toxic (noninfectious) hepatitis • Use additional E code to identify cause • 571.4 Chronic hepatitis • Excludes: • viral hepatitis (acute) (chronic) (070.0-070.9) • 571.49 Other • Chronic hepatitis: • active • aggressive • Recurrent hepatitis • 070 Viral hepatitis • Includes: • viral hepatitis (acute) (chronic) • Excludes: • cytomegalic inclusion virus hepatitis (078.5)

  33. Unified Medical Language system (UMLS)

  34. Jensen et al. Nature Reviews Genetics7, 119–129 (February 2006) | doi:10.1038/nrg1768

  35. Information Extraction (IE) • Extract pre-defined types of fact — in particular, relationships between biological entities. • Co-occurrence based method • Natural language processing (NLP) based method

  36. Information Extraction Usually it requires • Identify the relevant sentences • Parse to extract specific information • Assume “well-behaved” fact sentences • Using co-occurrence relationships alone does not require parsing or good fact-structure

  37. Jensen et al. Nature Reviews Genetics7, 119–129 (February 2006) | doi:10.1038/nrg1768

  38. Text Mining (TM) • The discovery by computer of new, previously unknown information, by automatically extracting information from different written records.

  39. Blood Viscosity Platelet aggregability Vascular Reactivity Fish Oil Fish Oil Raynaud’s Syndrome Raynaud’s Syndrome Can Reduce Text Mining • Based on transitivity of relationships in co-occurrence graph. • This idea can be used to discover new facts by co-occurrence • Web Tool : Arrowsmith Reduces (and co-occurs) Increased (and co-occurs) [Swanson 86,Swanson87,Swanson90, Swanson and Smalheiser99, Weeber et al. 2001, Stapley & Benoit 2000, Srinivasan 2003, Srivinasan 2004]

  40. Jensen et al. Nature Reviews Genetics7, 119–129 (February 2006) | doi:10.1038/nrg1768

  41. Jensen et al. Nature Reviews Genetics7, 119–129 (February 2006) | doi:10.1038/nrg1768

  42. Integration: combining text and biological data Jensen et al. Nature Reviews Genetics7, 119–129 (February 2006) | doi:10.1038/nrg1768

  43. Jensen et al. Nature Reviews Genetics7, 119–129 (February 2006) | doi:10.1038/nrg1768

More Related