270 likes | 470 Vues
Text Mining Services to Support e-Research. Brian Rea and Sophia Ananiadou National Centre for Text Mining www.nactem.ac.uk University of Manchester. Recent History Document Enrichment Information Retrieval and Text Mining Text Mining Applications Case Study: ASSERT Project
E N D
Text Mining Servicesto Support e-Research Brian Rea and Sophia Ananiadou National Centre for Text Mining www.nactem.ac.uk University of Manchester UK e-Science All Hands Meeting
Recent History Document Enrichment Information Retrieval and Text Mining Text Mining Applications Case Study: ASSERT Project Case Study: BBC Online News Feeds Future Opportunities Outline
Text mining discovers and extracts information hidden in unstructured texts. It aids the construction of hypotheses based upon associations between the extracted information Due to this it can often discover things overlooked by human readers What is Text Mining
Text mining is not based upon an understanding of document content. Instead it predicts the most likely meaning of a fragment of text based upon models of language. Text mining will generally not pick up on sarcasm, irony or other subtleties of language usage. Text mining tools must be tuned before use on different text types, styles or languages. What Text Mining is not
Number of MEDLINE references 14,792,890 Suppose, for example, that it takes one second to analyse one sentence…. Number of abstracts 7,434,879 Number of sentences 70,815,480 Number of words 1,418,949,650 Compressed data size 3.2GB Uncompressed data size 10GB Recent History 70 million seconds, that is more than 2 years UK e-Science All Hands Meeting
Rapid increase in the amount of literature means it is becoming impractical to read everything in many disciplines. • Text mining systems can begin to address this by automating some of the process. • Without any inherent understanding of language the system must use different methods to that of a human. • As such it can often discover facts or patterns that a human may easily miss. UK e-Science All Hands Meeting
Document Enhancement • How do we approximate an understanding of natural language? • Levels of annotation are built up in stages. • Tokenisation gives us words and boundaries. • Part-Of-Speech (POS) Tagging gives us a basic model with nouns, verbs, etc. • There are many methods for predicting POS • Training on hand coded documents is necessary to improve accuracy • Errors at this early stage can grow exponentially through the system UK e-Science All Hands Meeting
Document Enhancement • How do we fit these words together? • Grammars provide simple syntax rules for building up complex sentences based upon POS tag information. • Shallow Parsing – gives information about noun and verb phrases • Deep Parsing – generates complex representations of the underlying relationships between phrases UK e-Science All Hands Meeting
Document Enhancement Example: “The MPs discussed the policy with the ambassador” • This is a relatively simple example but many ways of interpreting it. • Parsing techniques choose the most likely meaning based upon complex internal models. • Complex sentences can take longer to process as many possibilities are available and need to be ruled out. UK e-Science All Hands Meeting
MedIE UK e-Science All Hands Meeting
Term Discovery • Keywords are often used when searching within documents. This reduces the noise created by common words that carry little information. • Text mining can take this a stage further and identify significant terms (multi-word units). • Terms can be used to: • Gain an overview of the document contents • Assist searching by allowing query expansion and browsing • Identify important concepts for generating ontologies UK e-Science All Hands Meeting
TerMine UK e-Science All Hands Meeting
Named Entity Recognition • Uses techniques to find common forms or patterns in text to identify items belonging to particular semantic categories. • Different methods can be used including rule-based, template driven or machine learning. • Some examples include: names, addresses, organisations, dates, times, quantities… UK e-Science All Hands Meeting
SemText UK e-Science All Hands Meeting
Document Similarity • One of the most common models of document similarity is the Vector Space model. • Each document is represented in a multi-dimensional space where each term acts as a dimension. • The distance in that dimension is represented by the contribution or strength of that term in the given document. • Similarity can be calculated using the cosine of the angle between the two vectors. UK e-Science All Hands Meeting
Dimensionality Reduction • Due to computation limitations it is impractical to search the entire document space. • Where possible we can reduce the space by mapping all synonyms of a term to a single label. • For larger scale reduction we can use Latent Semantic Indexing which merges terms that regularly co-occur together with remarkably good results. • Benefits of this include noise reduction and removal of redundant terms. • Drawbacks include the expensive matrix operations involved to generate the mapping rules UK e-Science All Hands Meeting
Online or Offline Processing • Many of the techniques introduced so far can be processed at any time, not just at run time. • This allows us to handle the major bulk of processing well in advance of our services becoming available. • For larger document collections the scale of this processing makes it impractical for a single machine. • We are currently in the process of preparing our tools to allow use on the national computing resources i.e. HPC and Grid UK e-Science All Hands Meeting
Associative Search • Relies upon the vector space similarity to identify a set of documents with related content to a target collection. • Single document targets are treated like a normal query. • Multiple document targets involve extra effort to identify the related set of terms that best represents the collection. • This process not only identifies similar documents but may also recognise previously unknown yet related areas. UK e-Science All Hands Meeting
Document Similarity UK e-Science All Hands Meeting
Information Extraction • IE brings together term discovery, pattern matching and named entity recognition to identify and extract facts. • We define the form of the information we are interested in as fact templates. • Each template has attribute slots that can be filled by named entities or other facts. • Example: Person_X is a programme manager for Programme_Y with JISC. UK e-Science All Hands Meeting
InfoPubMed UK e-Science All Hands Meeting
Document Summarisation • Two main methods of manual summarisation: • Abstractive – relies upon an understanding of the content to rewrite a new version in a shorter form • Extractive – draws upon key sections to form a readable shorter form • Preserve the important informative content • Reduce redundancy through knowledge of terms and synonyms • It is much harder across multiple documents • Potentially important to link back to key evidence UK e-Science All Hands Meeting
Document Collections Document Clustering Document Classification Multi-Document Summarisation Search Screen Synthesize Query Expansion Term Extraction Document Sectioning Sentence Extraction Case Study: ASSERT Automatic Summarisation for Systematic Reviews using Text Mining UK e-Science All Hands Meeting
Case Study: BBC News Feeds • Analyse, structure and visualise BBC news online, according to a user’s query using advanced text mining techniques • Concept discovery and retrieval • interface allows a user to enter a query across the document collection and automatically calculate a list of concepts specific to the query and ranked by perceived importance. • Creation of user oriented knowledge maps • Based on clusters of articles and their automatic concept categorisation. UK e-Science All Hands Meeting
Future Developments • Ongoing development of key text mining services to support the UK academic community • Further application of HPC and Grid technology • processing for document enhancement • handling data and processing for intermediary results • responsive and efficient service implementations • Transformation of components to web services and integration with work flow solutions • Investigation into interoperability issues between components and intermediary formats UK e-Science All Hands Meeting
Conclusions • NaCTeM has made strong progress in • Provision of core text mining services and support • Leveraging strengths in BioSciences out to social sciences, arts and humanities • Text Mining is integral to UK infrastructure for eResearch, but requires closer integration into existing research methodology and practice • Links with infrastructure are essential to support scalable solutions for future challenges • Interoperability between tools and formats is necessary for true flexibility between text mining components • IPR issues and policy require further investigation UK e-Science All Hands Meeting
How to contact us Visit the Text Mining Centre Website at http://www.nactem.ac.uk brian.rea@manchester.ac.uk sophia.ananiadou@manchester.ac.uk UK e-Science All Hands Meeting