190 likes | 309 Vues
This study presents an innovative augmented index leveraging neural networks to improve genomic information retrieval. The research addresses key challenges in the field, introducing a novel indexing approach that combines traditional TF-IDF methods with advanced keyword extraction techniques. We validate the effectiveness through experiments using a test collection from TREC queries, demonstrating a significant increase in Mean Average Precision (MAP) compared to the baseline index. These findings offer valuable insights into enhancing data retrieval processes in genomic research and related fields.
E N D
Building an Augmented Index for Genomic Information Retrieval HohyonRyu, Xiangming Mu, Kun Lu University of Wisconsin-Milwaukee School of Information Science Information Intelligence & Architecture Research Lab
Augmented Index using Neural Networks Document Collection Neural Networks TF, IDF Baseline Index Training Part of speech Word Location (FO, LO, WD) TF, IDF Keywords and Keyphrases Author-assigned Keyphrase A: information B: science Augmented Index
Keyword Extraction: Learning (Training) Author Assigned Keyword? 1 or 0 Hidden Layer TF*IDF Part of Speech First Occurrence Last Occurrence Word Distribution Various Features of Each Word
Keyword Extraction: Learning (Training) Keyword Suitability Score 0≤x≤1 Hidden Layer TF*IDF Part of Speech First Occurrence Last Occurrence Word Distribution Various Features of Each Word
Text Retrieval Experiment 26 TREC Queries Indri Search Engine (Based on Lemur and Language Modeling) Baseline Index MAP Mean Average Precision Augmented Index MAP Mean Average Precision
Results + 4.54% (df=25, p<0.01) + 3.12% (df=25, p<0.05)
AP Difference by Topic (for top 50 returned documents)
Topic 176 Retrieval by the augmented Index Baseline Retrieval MAP=0.25 MAP=0.32 1 2 … 6 7 8 … 14 15 16 17 18 19 20 1 2 … 6 7 8 … 14 15 16 17 18 19 20 12426234 CFTR: 94cystic: 59degradation: 3fibrosis: 60Sec61: 0 16166089 CFTR:191cystic: 21degradation: 9fibrosis: 21Sec61: 20 Neural network processing 16166089 CFTR:182cystic: 6degradation: 9fibrosis: 6Sec61: 14 12426234 CFTR: 106cystic: 71degradation: 3fibrosis: 72Sec61: 0 Irrelevant Document Relevant Document
Topic 170 Retrieval by the augmented Index Baseline Retrieval MAP=0.87 MAP=1 1 2 3 4 5 … 17 18 19 20 21 … 38 39 40 1 2 3 4 5 … 17 18 19 20 21 … 38 39 40 11799116 CFTR: 247endoplasm: 15reticulum: 16 11799116 CFTR: 238endoplasm: 3reticulum: 4 Neural network processing 15459206 CFTR: 12endoplasm: 5reticulum: 5 15459206 CFTR: 12endoplasm: 5reticulum: 5 Irrelevant Document Relevant Document
Topic 183 Retrieval by the augmented Index Baseline Retrieval 16106028 NM23: 14development: 4gene: 14mutation: 0tracheal: 0 MAP=0.59 MAP=0.56 10952986 NM23: 91development: 1gene: 1mutation: 2tracheal: 0 … 10 … 23 … 30 31 32 33 34 35 36 37 38 … 44 … 10 … 23 … 30 31 32 33 34 35 36 37 38 … 44 Neural network processing 14960567 NM23: 174development: 9gene: 9mutation: 45tracheal: 0 16106028 NM23: 2development: 4gene: 2mutation: 0tracheal: 0 10952986 NM23: 109development: 1gene: 1mutation: 2tracheal: 0 14960567 NM23: 159development: 9gene: 9mutation: 21tracheal: 0 Irrelevant Document Relevant Document