Medical text data: are usually short and encapsulates complex concepts.

Adaptive Query Expanded Latent Semantic Analysis for Short Text Analytics Abhijit K Nag, Mohammed Yeasin CVPIA Lab, Department of Electrical and Computer Engineering University of Memphis Observations As the document size increases performance of different tools to analyze the dataset varies. For instance for short to medium size text LSA performs better, and for larger text document tree-based model are preferred. Semantic meaning Part-of-Speech tagging (tree-based models) Introduction Bag-of-word model with semantic meaning Query-Expanded Latent Semantic Analysis (QE-LSA) • Medical text data: • are usually short and encapsulates complex concepts. • often lack of background information • Social media texts • are similar in nature • filled with abbreviations and incomplete sentences. • Therefore, there is a need for an efficient and robust data driven retrieval method to extract meaningful and relevant information. System Model Query Expansion Document size (with complete sentences) Bag-of-word model Ex: Latent Semantic Analysis (LSA) Text Document Sample query expansion when k= 3 for ImageClef Dataset. Query 1 2 Text Corpus Query-Expansion 3 Our proposed method outperforms LSA in capturing some semantic meaning from the text. The latter makes QE-LSA more suitable for a wider range of document sizes, especially when the documents are very short and therefore sparse. 4 Challenges Retrieval System 5 • Extraction performance depends on the quality of the dictionary used. • Difficult to create a suitable dictionary that fits the given dataset. • Detection of relevant words for a given query to perform query expansion when domain is specific. • Identification of a method that focuses on the semantic meaning. 6 7 Conclusion Extracted Relevant Documents • Adaptive QE-LSA • outperforms conventional LSA with single key-word query structure. • performs better result where there is limited contextual knowledge. • is completely a data-driven approach that generates dynamic dictionary for a given dataset. • is a robust and its performance can be controlled through proper tuning of parameters. Dataset Features Results Methods Comparison between model with and without query expansion -Image Clef 2010 dataset is used[1] which contains 77k images and their corresponding descriptive text document. -For short text, documents with words between 6-50 words are selected[2] for simulation. • Creation of Text Corpus: • Generate customized list for stop words. • Generate dictionary words that satisfies specific criteria, Ex: • choose words with length 3-10. • Not containing only numbers, articles. • Convert the dataset to a tf-idf (term frequency- inverse document frequency) matrix. Rows represents terms and column represents the documents of dataset. • Apply sparse SVD(singular value decomposition) to create encoding matrix, U. • Query Expansion: • M = UUT,, that captures the relationship among words in semanticspace. • The diagonal entries of the M represent self-similarity and the off-diagonal entries reflect similar words based on their co-occurrences in the dataset. • The highest k entries from the row of M containing the query word are retrieved. • Extraction: • use k-word query to extract from the given dataset using LSA[3] and return documents: • Top 10 or 25 documents • All documents which cosine similarity is above a fixed threshold. References • http://www.imageclef.org/2010/medical • Wang, J.; Zhou, Y.; Li, L.; Hu, B. & Hu, X.Wang, H.; Shen, Y.; Huang, T. & Zeng, Z. (ed.) Improving Short Text Clustering Performance with Keyword Expansion, The Sixth International Symposium on Neural Networks (ISNN 2009), Springer Berlin / Heidelberg, 2009, 56, 291-298. • Landauer, T. K., Foltz, P. W., & Laham, D. Introduction to latent semantic analyses. Discourse Processes, 1998. Cosine similarity measure with different corpus Acknowledgement This work was supported by the Electrical and Computer Engineering Department at the University of Memphis, as well as by grant NSF-IIS-0746790. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding institution.

Medical text data: are usually short and encapsulates complex concepts.

Medical text data: are usually short and encapsulates complex concepts.

Presentation Transcript

Data Mining: Concepts and Techniques

Basic concepts in nursing science

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Medical & Surgical Asepsis

Complex PTSD

Chapter 6: Relational Data Modeling -- CONCEPTS, PRINCIPLES & APPROACHES--

Unit 5 Concepts 4 & 5 Test Re-take

4.RL.1

300+ Frequently Used Templates

Anchor Text : Dangerous Crossing by Stephen Krensky

PM Master Data 2

Beyond Text

Regression, correlation and liquid association in complex genomic data analysis

Mining Complex Types of Data

Introduction to Database Concepts and Access

Medical Instrumentation Application and Design, 4th Edition

Complex Variables

Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 10 —

Temple University – CIS Dept. CIS616– Principles of Data Management

Introduction to Database Concepts and Access

PPP (Point to Point Protocol)

Standard 10: The Thinking Standard - Stretching All Readers with Complex Text

Medical text data: are usually short and encapsulates complex concepts.