1 / 20

World class IT in a world-wide market

World class IT in a world-wide market. Text Mining Highlights . Marten Trautwein Syllogic Research & Development. RoadMap. TextHub A parallel information retrieval tool Text Mine A document clustering extension Emile Grammar induction & clustering. What is TextHub?.

damita
Télécharger la présentation

World class IT in a world-wide market

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. World class IT in a world-wide market

  2. Text Mining Highlights Marten Trautwein Syllogic Research & Development

  3. RoadMap • TextHub • A parallel information retrieval tool • Text Mine • A document clustering extension • Emile • Grammar induction & clustering

  4. What is TextHub? • Intelligent Parallel Information Retrieval Tool • Intuitive Web based graphical user interface • Compression  Decompression • Indexing  Retrieval • Document clustering & categorization

  5. The star topology • Master receives requests • Master delegates tasks • Slave performs tasks • Master collects results • Master returns answer

  6. Use of parallelism • Documents outnumber processors • Divide and conquer • Distribute documents • Communication overhead minimum • Linear speed-up (1GB per hour)

  7. Functionality details • Compression / Decompression • Canonical Huffman encoding • Indexing • Inverted file index with canonical terms • Retrieval • Boolean (AND, OR, MINUS) • Search modifiers (stemming, case folding, stop list, synonyms, semantic network) • Proximity (AT, FAR, NEAR) • Relevance ranking • Score documents

  8. Retrieval (Boolean)

  9. Retrieval (Search modifiers)

  10. Retrieval (Proximity)

  11. Relevance ranking • Rate relevance of document • Score based on number of occurrences • Score compensated for large documents • TextHub marks where document is relevant

  12. Text Mine - Document clustering • Improve relevance feed-back • Clustering of related documents • Categorization of documents • Minimum spanning tree algorithm

  13. V F T D U S A E B C Using minimum spanning tree • Combine different measures • Ordinary query retrieves relevant nodes • Nodes serve as entry-points • No global minimum spanning tree ?

  14. Emile • In coorparation with University of Amsterdam • Engine enabling • Grammar induction • Knowledge base construction • Compound term separation • Language independent

  15. Fragment of Phaistos disk 1 41 40 7. 2 12 4 40 33. 2 12 6 18 *. 2 12 13 1. 2 12 13 1 18. 2 12 27 14 32 18 27. 2 12 27 35 37 21. 2 12 31 26. 2 12 32 23 38. 2 12 41 19 35. 2 27 25 10 23 18. … 16 14 18. 16 23 18 43. Fragment of grammar [0] --> [3] . [3] --> [16] [47] [14] --> 15 [40] [14] --> 2 12 [16] --> 2 [57] 25 10 23 [16] --> [14] 13 1 [16] --> 16 14 [40] --> 7 [40] --> 29 [47] --> 18 [47] --> 24 40 [57] --> 27 [57] --> 29 Grammar induction

  16. Dictionary Type [35] K033 k033 K105 k33 Dictionary Type [87] Vrachtgeb vrachtgeb Vrachtgebouw Vracht Dictionary Type [89] CGOADTP6 Printqueue Dictionary Type [114] is Userid Password Dictionary Type [138] status Error Dictionary Type [196] scarlos vrachtbrieven Dictionary Type [215] G239 g239 Dictionary Type [237] enorm ontzettend super Dictionary Type [290] pingen benaderen Knowledge base construction

  17. Emile on Biomed (1)

  18. Emile on Biomed (2)

  19. Emile on Biomed (3)

  20. [16] --> School of Medicine , University of Washington , Seattle 98195 , USA [16] --> University of Kitasato Hospital , Sagamihara , Kanagawa , Japan [16] --> Heinrich-Heine-University , Dusseldorf , Germany [16] --> School of Medicine , Chiba University [5] --> Department of Urology , [16] [94] --> Chinese [94] --> Japanese [94] --> Polish [101] --> 32 : Cancer Res 1996 Oct [101] --> 35 : Genomics 1996 Aug [101] --> 44 : Cancer Res 1995 Dec [101] --> 50 : Cancer Res 1995 Feb [101] --> 54 : Eur J Biochem 1994 Sep [101] --> 58 : Cancer Res 1994 Mar [105] --> identified in 13 cases ( 72 [105] --> detected in 9 of 87 informative cases ( 10 [105] --> observed in 5 ( 55 [11] --> LOH was [105] % Emile outcome

More Related