Download
slide1 n.
Skip this Video
Loading SlideShow in 5 Seconds..
World class IT in a world-wide market PowerPoint Presentation
Download Presentation
World class IT in a world-wide market

World class IT in a world-wide market

179 Vues Download Presentation
Télécharger la présentation

World class IT in a world-wide market

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. World class IT in a world-wide market

  2. Text Mining Highlights Marten Trautwein Syllogic Research & Development

  3. RoadMap • TextHub • A parallel information retrieval tool • Text Mine • A document clustering extension • Emile • Grammar induction & clustering

  4. What is TextHub? • Intelligent Parallel Information Retrieval Tool • Intuitive Web based graphical user interface • Compression  Decompression • Indexing  Retrieval • Document clustering & categorization

  5. The star topology • Master receives requests • Master delegates tasks • Slave performs tasks • Master collects results • Master returns answer

  6. Use of parallelism • Documents outnumber processors • Divide and conquer • Distribute documents • Communication overhead minimum • Linear speed-up (1GB per hour)

  7. Functionality details • Compression / Decompression • Canonical Huffman encoding • Indexing • Inverted file index with canonical terms • Retrieval • Boolean (AND, OR, MINUS) • Search modifiers (stemming, case folding, stop list, synonyms, semantic network) • Proximity (AT, FAR, NEAR) • Relevance ranking • Score documents

  8. Retrieval (Boolean)

  9. Retrieval (Search modifiers)

  10. Retrieval (Proximity)

  11. Relevance ranking • Rate relevance of document • Score based on number of occurrences • Score compensated for large documents • TextHub marks where document is relevant

  12. Text Mine - Document clustering • Improve relevance feed-back • Clustering of related documents • Categorization of documents • Minimum spanning tree algorithm

  13. V F T D U S A E B C Using minimum spanning tree • Combine different measures • Ordinary query retrieves relevant nodes • Nodes serve as entry-points • No global minimum spanning tree ?

  14. Emile • In coorparation with University of Amsterdam • Engine enabling • Grammar induction • Knowledge base construction • Compound term separation • Language independent

  15. Fragment of Phaistos disk 1 41 40 7. 2 12 4 40 33. 2 12 6 18 *. 2 12 13 1. 2 12 13 1 18. 2 12 27 14 32 18 27. 2 12 27 35 37 21. 2 12 31 26. 2 12 32 23 38. 2 12 41 19 35. 2 27 25 10 23 18. … 16 14 18. 16 23 18 43. Fragment of grammar [0] --> [3] . [3] --> [16] [47] [14] --> 15 [40] [14] --> 2 12 [16] --> 2 [57] 25 10 23 [16] --> [14] 13 1 [16] --> 16 14 [40] --> 7 [40] --> 29 [47] --> 18 [47] --> 24 40 [57] --> 27 [57] --> 29 Grammar induction

  16. Dictionary Type [35] K033 k033 K105 k33 Dictionary Type [87] Vrachtgeb vrachtgeb Vrachtgebouw Vracht Dictionary Type [89] CGOADTP6 Printqueue Dictionary Type [114] is Userid Password Dictionary Type [138] status Error Dictionary Type [196] scarlos vrachtbrieven Dictionary Type [215] G239 g239 Dictionary Type [237] enorm ontzettend super Dictionary Type [290] pingen benaderen Knowledge base construction

  17. Emile on Biomed (1)

  18. Emile on Biomed (2)

  19. Emile on Biomed (3)

  20. [16] --> School of Medicine , University of Washington , Seattle 98195 , USA [16] --> University of Kitasato Hospital , Sagamihara , Kanagawa , Japan [16] --> Heinrich-Heine-University , Dusseldorf , Germany [16] --> School of Medicine , Chiba University [5] --> Department of Urology , [16] [94] --> Chinese [94] --> Japanese [94] --> Polish [101] --> 32 : Cancer Res 1996 Oct [101] --> 35 : Genomics 1996 Aug [101] --> 44 : Cancer Res 1995 Dec [101] --> 50 : Cancer Res 1995 Feb [101] --> 54 : Eur J Biochem 1994 Sep [101] --> 58 : Cancer Res 1994 Mar [105] --> identified in 13 cases ( 72 [105] --> detected in 9 of 87 informative cases ( 10 [105] --> observed in 5 ( 55 [11] --> LOH was [105] % Emile outcome