1 / 25

CNI, 3rd April 2006 Slide 1

UK National Centre for Text Mining: Activities and Plans. Dr. Robert Sanderson Dept. of Computer Science University of Liverpool azaroth@liv.ac.uk http://www.nactem.ac.uk. CNI, 3rd April 2006 Slide 1. Overview. Text Mining? NaCTeM Consortium Components

uma
Télécharger la présentation

CNI, 3rd April 2006 Slide 1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool azaroth@liv.ac.uk http://www.nactem.ac.uk CNI, 3rd April 2006 Slide 1

  2. Overview Text Mining? NaCTeM Consortium Components Service Infrastructure Future Work CNI, 3rd April 2006 Slide 2

  3. Centre for ... National Centre for ... what was that? TEXT Ticks Mining! CNI, 3rd April 2006 Slide 3

  4. ... Text Mining? • Text Mining: No canonical definition • Commonly used definition based on Data Mining: • “The non-trivial extraction of implicit, previously unknown, and potentially useful information from data.” “The non-trivial extraction of previously unknown, interesting facts from an invariably large collection of texts.” CNI, 3rd April 2006 Slide 4

  5. ... Text Mining? • Typical Data Mining Functions: • Classification • Association Rule Mining • Clustering • Useful when applied to texts, but doesn't fulfill the definition as they don't discover “facts”. • Information Retrieval also doesn't discover facts. CNI, 3rd April 2006 Slide 5

  6. ... Text Mining? Need to understand the meaning of the text: Part of Speech tagging Clauses Named Entity Recognition Find correlations of entities Infer information from logical chains Result: New Knowledge CNI, 3rd April 2006 Slide 6

  7. Other Benefits Plus a lot more: Improved document classification Automatic semantic annotation of documents Improved access -- search by semantics and concepts Improved clustering of documents by concept Summarization Visualization techniques CNI, 3rd April 2006 Slide 7

  8. Event Extraction Extract events from the text along with information about the participants Can be modeled as relationships between named entities Extracting events allows discovery of hidden temporal correlations eg: Google refuses to announce plans. Google's stock falls. Improves understanding of the semantics, improving the functions based around those semantics CNI, 3rd April 2006 Slide 8

  9. NaCTeM Hosted at University of Manchester Participants: Universities of Manchester, Liverpool, Salford Plus: San Diego Supercomputer Centre, University of Tokyo, University of Geneva, University of California Berkeley Six full time posts for 3 years (2005-2007) Plus active board of directors and experts Current Director: Professor Jun'ichi Tsujii from U.Tokyo Funding: JISC, BBSRC, EPSRC CNI, 3rd April 2006 Slide 9

  10. NaCTeM Aims Provide text mining oriented services Facilitate access to text mining resources User support, advice, training and consultancy Participate in international research Formulate best practice guidelines Increase awareness of text mining in all domains Develop links with industrial partners involved in text mining CNI, 3rd April 2006 Slide 10

  11. Components Liverpool: Cheshire3 (Information framework) Manchester: CAFETIERE (Entity recognition, event extraction) Salford: TerMine (Automatic term recognition) SDSC: Storage Resource Broker (Data grid) UC Berkeley: Cheshire, TM/IR expertise U.Tokyo: GENIA, ENJU (Text analysis tools) U.Geneva: User studies and evaluation CNI, 3rd April 2006 Slide 11

  12. Cheshire3 Information Processing Framework Liverpool and UC Berkeley Standards based: XML, SRU, Unicode, etc. Scalable: Single machine to Grid (PVM, MPI, SRB) Extensible: Python + C, Object Oriented with stable API Work ongoing to integrate Data Mining tools and other information processing applications CNI, 3rd April 2006 Slide 12

  13. Cheshire3 Examples Integrated tools from other participants in preparation for NaCTeM service infrastructure. Medline: 4350 records/second using 60 concurrent processes on SDSC's Teragrid cluster 440 seconds to index 1 field from 16 million MARC records Distributed network of Archival Descriptions in the UK NARA ERA prototype system with SDSC CNI, 3rd April 2006 Slide 13

  14. CAFETIERE Entity Recognition and Annotation University of Manchester Discovers named entities in part of speech tagged text Discovers temporal events referring to those entities Integration of ontologies and term processing Rules based CNI, 3rd April 2006 Slide 14

  15. CAFETIERE Example CNI, 3rd April 2006 Slide 15

  16. TerMine Automatic Term Recognition University of Salford/Manchester Discovers important terms Assigns 'C-value' score to rank terms Interaction with terminology databases for term management CNI, 3rd April 2006 Slide 16

  17. TerMine Example CNI, 3rd April 2006 Slide 17

  18. U. Tokyo Tools Natural Language Parsing University of Tokyo Tagger, Chunker, ENJU, GENIA Necessary for any text mining application Fast and accurate http://www-tsujii.is.s.u-tokyo.ac.jp/hiiragi/ http://www-tsujii.is.s.u-tokyo.ac.jp/CytoSailing/ CNI, 3rd April 2006 Slide 18

  19. Tokyo Tools Example CNI, 3rd April 2006 Slide 19

  20. Tokyo Tools Example2 CNI, 3rd April 2006 Slide 20

  21. Service Infrastructure • NaCTeM will allow UK researchers to perform text mining on their own data in combination with other accessible resources (eg other data sets, ontologies etc) • Requirements: • Lots of processing power • Lots of storage capacity • Easily extensible/configurable service framework • Access to cutting edge TM, DM and IR tools CNI, 3rd April 2006 Slide 21

  22. Service Infrastructure Processing provided by UK National Grid Service Data Storage via SDSC's Storage Resource Broker Important to store multiple versions of each document Cheshire3 provides the Grid enabled information infrastructure Plus information retrieval and data mining tools Manchester and Tokyo provide the text mining tools Stable tools integrated into Cheshire3 already CNI, 3rd April 2006 Slide 22

  23. Service Infrastructure • Initial NaCTeM services will be focused on the bio domain: • Bio-informatics is a growing field • Interest from both academic and corporate sectors • Large datasets/services available (MeSH, Medline, ...) • Web portal interaction • Then expand into other areas, such as Social Sciences and Historical text analysis. CNI, 3rd April 2006 Slide 23

  24. Future Work Services for other domains GUI Workflow configuration Integration of user developed services and applications Maximizing workflow potential with 'smart' components Standardizing annotation schemas Conference/Workshop Other? CNI, 3rd April 2006 Slide 24

  25. Thank You Questions? ... Reception! CNI, 3rd April 2006 Slide 25

More Related