1 / 38

Finding biologically relevant information using ADIOS

Finding biologically relevant information using ADIOS. ThaiBinh’s final project for CBB545:. April 19, 2007. The current state of affairs in natural language processing. NLP: Converting human language into representations that are easier for computers to understand

merle
Télécharger la présentation

Finding biologically relevant information using ADIOS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding biologically relevant information using ADIOS ThaiBinh’s final project for CBB545: April 19, 2007

  2. The current state of affairs in natural language processing • NLP: Converting human language into representations that are easier for computers to understand • Most natural language processing requires a tagged training set • Tagging = time consuming/costly http://en.wikipedia.org/wiki/Natural_language_processing

  3. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42

  4. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42

  5. ADIOS • “Unsupervised learning of natural languages” • ADIOS: Automatic distillation of structure • Input: A corpus of characters (most likely, untagged sentences) • Output: A grammar “Unsupervised learning of natural languages”, Solan, et al., PNAS vol. 102, August 2005.

  6. A very quick primer on grammars • A set of “rules” for making a “sentence” • Ex. The grammar:S  S + SS  1S  a A possible derivation:SS + SS + S + S1 + S + S1 + 1 + S1 + 1 + a     

  7. S S + S 1 S + S 1 a A very quick primer on grammars • We can visualize the expansion as a tree, and read the leaves The grammar:S  S + SS  1S  a A possible derivation:SS + SS + S + S1 + S + S1 + 1 + S1 + 1 + a

  8. S S + S 1 S + S 1 a A very quick primer on grammars • We can visualize the expansion as a tree, and read the leaves The grammar:S  S + SS  1S  a A possible derivation:SS + SS + S + S1 + S + S1 + 1 + S1 + 1 + a

  9. ADIOS • The system builds a graph using the first sentence • With each successive sentence, it tries to find overlapping “subpaths” (patterns)

  10. ADIOS • Also try to generalize the path by looking for equivalence classes • Search for patterns and equivalence classes until no new ones are found

  11. ADIOS: A quick example • Input a corpus of sentences * Chong had a presentation in CBB545 # * on Tuesday Chong had a presentation # * next Thursday Laura has a presentation # * ThaiBinh has a presentation in CBB545 # * ThaiBinh has a presentation today # * today ThaiBinh has a presentation # * Chong had a presentation # * Hugo has a presentation in CBB545 today # * ThaiBinh has a presentation in CBB545 today # * Laura has a presentation in CBB545 next Thursday # * Hugo has a presentation today # * Chong had a presentation on Tuesday # * Chong had a presentation in CBB545 on Tuesday # * Laura has a presentation next Thursday # * in CBB545 ThaiBinh has a presentation # * today ThaiBinh has a presentation in CBB545 #

  12. ADIOS: A quick example • Output is a grammar P18 (a,presentation) P19 (E20,has,P18) E20 {Hugo,Laura,ThaiBinh} P21 (Chong,had) P22 (in,CBB545) P23 (P19,P22) P24 (P21,P18)

  13. P18 (a,presentation) P19 (E20,has,P18) E20 {Hugo,Laura,ThaiBinh} P21 (Chong,had) P22 (in,CBB545) P23 (P19,P22) P24 (P21,P18)

  14. P18 (a,presentation) P19 (E20,has,P18) E20 {Hugo,Laura,ThaiBinh} P21 (Chong,had) P22 (in,CBB545) P23 (P19,P22) P24 (P21,P18) (P19,P22) (E20,has,P18) (a,presentation) (in,CBB545) {Hugo,Laura,ThaiBinh}

  15. P18 (a,presentation) P19 (E20,has,P18) E20 {Hugo,Laura,ThaiBinh} P21 (Chong,had) P22 (in,CBB545) P23 (P19,P22) P24 (P21,P18) (P21,P18) (Chong,had) (a,presentation)

  16. P18 (Chong,had,a) P19 (has,a) P20 (E21,P19,presentation) E21 {Hugo,Laura,ThaiBinh} P22 (in,CBB545) P23 (P20,P22) P24 (P18,presentation)

  17. Two different grammars: Same end result P18 (Chong,had,a) P19 (has,a) P20 (E21,P19,presentation) E21 {Hugo,Laura,ThaiBinh} P22 (in,CBB545) P23 (P20,P22) P24 (P18,presentation) P18 (a,presentation) P19 (E20,has,P18) E20 {Hugo,Laura,ThaiBinh} P21 (Chong,had) P22 (in,CBB545) P23 (P19,P22) P24 (P21,P18)

  18. ADIOS • Able to generate sentences using the grammar it created • Can test if new sentence fits one of the grammar rules • Can be applied to wide variety of domains • Bible in various languages • Classify protein function based on amino acid sequence

  19. The Project • Use ADIOS to create grammar rules from biomedical sentences • Look for gene-gene associations • Look for gene-disease associations • Infer information about a pair of genes in an unseen sentence based on its sentence structure (pattern)

  20. AbnerFind mentions of genes

  21. MetamapFind mentions of diseases “The clinical effects of cortisone and ACTH (adrenocorticotropic hormone) in the collagen diseases: acute disseminated lupus erythematosus, periarteritis nodosa, dermatomyositis and scleroderma; interim report.” Phrase: "in the collagen diseases" Meta Candidates (6) 1000 C0009326:Collagen Diseases [Disease or Syndrome] Phrase: "periarteritis nodosa," Meta Candidates (4) 1000 C0031036:Periarteritis Nodosa (Polyarteritis Nodosa) [Disease or Syndrome] Phrase: "dermatomyositis" Meta Candidates (2) 1000 C0011633:Dermatomyositis [Disease or Syndrome] 1000 C0221056:Dermatomyositis (Dermatomyositis, Adult Type) [Disease or Syndrome] Phrase: "scleroderma" Meta Candidates (4) 1000 C0011644:Scleroderma [Disease or Syndrome] 1000 C0036421:Scleroderma (Systemic Scleroderma) [Disease or Syndrome]

  22. Smad7 antagonizes TGF-{beta} signaling in the nucleus PTEN negatively regulates expression of cyclin D1 The Project: Input • Replace any mention of a gene with a generic term • Ex. GeneOne antagonizes GeneTwo signaling in the nucleus GeneOne negatively regulates expression of GeneTwo

  23. The Project: Input • Replace any mention of a gene/disease with a generic term • Ex. p16 is consistently expressed in endometrial tubal metaplasia GeneOne is consistently expressed in DiseaseOne The expression of cyclin D1 is more often correlated with prognosis in cancers of ampulla of vater The expression of GeneOne is more often correlated with prognosis in DiseaseOne

  24. Let ADIOS work its “magic”…

  25. Let ADIOS work its “magic”… Out pops patterns to describe the sentences (the grammar)

  26. “Tagging” the patterns GeneOne GeneTwo antagonizes GeneOne GeneTwo negatively regulates GeneOnez GeneTwo increases transcription GeneOnez GeneTwo positively regulates

  27. “Tagging” the patterns GeneOne GeneTwo antagonizes GeneOne GeneTwo negatively regulates GeneOnez GeneTwo increases transcription GeneOnez GeneTwo positively regulates

  28. antagonizes negatively regulates “Tagging” the patterns inhibits GeneOne GeneTwo increases transcription GeneOne GeneTwo positively regulates activates

  29. Seeing a new sentence Ras/Erk pathway positively regulates Jak1/STAT6 activity

  30. Seeing a new sentence increases transcription GeneOne GeneTwo Ras/Erk pathway positively regulatesJak1/STAT6 activity positively regulates activates

  31. Seeing a new sentence increases transcription Ras/Erk Jak1/STAT6 positively regulates activates

  32. The big picture…Automatic extraction of regulation Smad7antagonizesTGF-{beta} signaling in the nucleus PTEN negatively regulates expression of cyclin D1 Ras/Erk pathway positively regulatesJak1/STAT6 activity Loss ofp53 Expression Correlates with… Neck Cancer p16 is consistently expressed in endometrial tubal metaplasia

  33. Potential (inevitable) problems • The data/sentences • Amount • ADIOS’S data usually had 1000’s of sentences • Quality • ABNER/MetaMap (used for finding gene/disease-mentions) are not always accurate • Is it even feasible? • Biologists/Scientists are very creative in coming of with various ways of saying the same thing

  34. Potential (inevitable) problems • The data/sentences • Amount • ADIOS’S data usually had 1000’s of sentences • Quality • ABNER/MetaMap (used for finding gene/disease-mentions) are not always accurate • Is it even feasible? • Stay tuned…

More Related