1 / 1

Medical text data: are usually short and encapsulates complex concepts.

Adaptive Query Expanded Latent Semantic Analysis for Short Text Analytics Abhijit K Nag, Mohammed Yeasin CVPIA Lab, Department of Electrical and Computer Engineering University of Memphis. Observations.

irish
Télécharger la présentation

Medical text data: are usually short and encapsulates complex concepts.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adaptive Query Expanded Latent Semantic Analysis for Short Text Analytics Abhijit K Nag, Mohammed Yeasin CVPIA Lab, Department of Electrical and Computer Engineering University of Memphis Observations As the document size increases performance of different tools to analyze the dataset varies. For instance for short to medium size text LSA performs better, and for larger text document tree-based model are preferred. Semantic meaning Part-of-Speech tagging (tree-based models) Introduction Bag-of-word model with semantic meaning Query-Expanded Latent Semantic Analysis (QE-LSA) • Medical text data: • are usually short and encapsulates complex concepts. • often lack of background information • Social media texts • are similar in nature • filled with abbreviations and incomplete sentences. • Therefore, there is a need for an efficient and robust data driven retrieval method to extract meaningful and relevant information. System Model Query Expansion Document size (with complete sentences) Bag-of-word model Ex: Latent Semantic Analysis (LSA) Text Document Sample query expansion when k= 3 for ImageClef Dataset. Query 1 2 Text Corpus Query-Expansion 3 Our proposed method outperforms LSA in capturing some semantic meaning from the text. The latter makes QE-LSA more suitable for a wider range of document sizes, especially when the documents are very short and therefore sparse. 4 Challenges Retrieval System 5 • Extraction performance depends on the quality of the dictionary used. • Difficult to create a suitable dictionary that fits the given dataset. • Detection of relevant words for a given query to perform query expansion when domain is specific. • Identification of a method that focuses on the semantic meaning. 6 7 Conclusion Extracted Relevant Documents • Adaptive QE-LSA • outperforms conventional LSA with single key-word query structure. • performs better result where there is limited contextual knowledge. • is completely a data-driven approach that generates dynamic dictionary for a given dataset. • is a robust and its performance can be controlled through proper tuning of parameters. Dataset Features Results Methods Comparison between model with and without query expansion -Image Clef 2010 dataset is used[1] which contains 77k images and their corresponding descriptive text document. -For short text, documents with words between 6-50 words are selected[2] for simulation. • Creation of Text Corpus: • Generate customized list for stop words. • Generate dictionary words that satisfies specific criteria, Ex: • choose words with length 3-10. • Not containing only numbers, articles. • Convert the dataset to a tf-idf (term frequency- inverse document frequency) matrix. Rows represents terms and column represents the documents of dataset. • Apply sparse SVD(singular value decomposition) to create encoding matrix, U. • Query Expansion: • M = UUT,, that captures the relationship among words in semanticspace. • The diagonal entries of the M represent self-similarity and the off-diagonal entries reflect similar words based on their co-occurrences in the dataset. • The highest k entries from the row of M containing the query word are retrieved. • Extraction: • use k-word query to extract from the given dataset using LSA[3] and return documents: • Top 10 or 25 documents • All documents which cosine similarity is above a fixed threshold. References • http://www.imageclef.org/2010/medical • Wang, J.; Zhou, Y.; Li, L.; Hu, B. & Hu, X.Wang, H.; Shen, Y.; Huang, T. & Zeng, Z. (ed.) Improving Short Text Clustering Performance with Keyword Expansion, The Sixth International Symposium on Neural Networks (ISNN 2009), Springer Berlin / Heidelberg, 2009, 56, 291-298. • Landauer, T. K., Foltz, P. W., & Laham, D. Introduction to latent semantic analyses. Discourse Processes, 1998. Cosine similarity measure with different corpus Acknowledgement This work was supported by the Electrical and Computer Engineering Department at the University of Memphis, as well as by grant NSF-IIS-0746790. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding institution.

More Related