Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering

Indexing and Retrieval for Genomic Databases Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun

Agenda 1. Introduction 2. What are the problems? 3. What are other people doing? 4. Indexed Genomic Retrieval with CAFÉ 5. Experimental Results 6. Conclusion

A T C G 1. Introduction • Biological sequence databases contain several sequences of both DNA and Protein. • DNA (Deoxyribonucleic Acid) is the primary genetic material in all living organisms • A molecule composed of two complementary nucleotide strands connected by base pairs that each base will pair with only one another: adenine (A) pairs with thymine(T)guanine (G) pairs with cytosine(C)

1. Introduction (1) • A DNA sequence consists of • 4 alphabets : A G C T • 1 extra alphabet : N for unknown bases • DNA sequence database > gi|1786692|gb|AE000155|ECAE000155 Escherichia coli , tesA, ybbA genes from base s 510705 to 522297 (section 45 of 400) of the complete genome TAGAATAGATGAGAATTAGTCTGTTCTACGAAATAGACGAGAATTAGTCTAGTCTAAAT AGACTAGAAATAGTCTAGTCTACGAAATAGACTAGAAATAGCCTAGTTCTGTTCTACGA AATAGACTAGAAATAGTCTAGTCTACG > gb|L02373|ECORHSCA Escherichia coli Rhs core genes, complete cds TAGAATAGATGAGAATTAGTCTGTTCTACGAAATAGACGAGAATTAGTCTAGTCTAAAT AGACTAGAAATAGTCTAGTCTACGAAATAGACTAGAAAATAGACTAGAAATAGTCTAGT CTACGAAATAGACTAGAAATAGCCTAGTTCTGTT : : Alphabet ‘ > ’ separates each sequence and identifies its information

2. What are the problems? 2.1 Databases and query sequences contain low quality sequences therefore all techniques also must improve accuracy of querying results 2.2 All techniques also require long computation time

2.1 Low quality DNA sequences • Substitution, Insertions, Deletions • Exact-match is not very efficient • Similarity search is required • All algorithms will find all segment pairs whose scores must be improved by insertions and deletions Query: 3LTRYCA - -GFTSLLKCNDADTIYDG28 | | | | | | | | | | | | | | | | | | | Subject : 3325LTRYCAPAGFXALLKCNDADT--DG 3350

2.2 Long computation time required • Various and huge data size of database • A database contains many different sequences, of variable lengths which requires local similarity for database search

3. What are other people doing? 3.1 SSERACH Algorithm • Using Dynamic Programming (DP) • Very Slow, Very sensitive 3.2 BLAST Algorithm • Blast 1.4 (Old version): ungapped alignment • Speed, sensitive • Blast 2.0 (New version): gapped alignment • High Speed, less sensitive 3.3 FASTA Algorithm • Using DP-based Techniques: gapped alignment • Slow, more sensitive

{ } Di-1,j Di,j = max Di,j-1 Di-1,j-1 + s(ai,bj) 3.1 SSEARCH Algorithm Edit distance and Dynamic Programming • Assume that the given two sequences are A and B • n and m are the length of sequence A and sequence B, respectively • s (an,bm): similarity score between two aligned sequence a and b • Identical aligned pairs have a positive score 1 and non-identical pairs have a score 0 • Distance Matric D : Di,0 = Dj,0 = 0 for i = 0,1,…,n and j = 0,1,…,m • Time complexity is O(n*m)

di-1,j-1 di-1,j di,j-1 di,j sequence a - A C G A C A - 0 0 0 0 0 0 0 s e q u e n c e b Match A 0 1 1 1 1 1 1 G 0 1 1 2 2 2 2 Insert C 0 1 2 2 2 3 3 A 0 1 2 2 3 3 4 Delete C 0 1 2 2 3 4 4 3.1 SSEARCH Algorithm (1) • Example: Pairwise alignment via DP • Sequence a : ACGACA • Sequence b : AGCAC Possible results of 3 alignments (1) a: ACGACA - b: A -G -CAC (2) a: ACG -ACA b: A -GCAC - (3) a: A -CGACA b: AGC -AC -

N A C A T C G A . . . . . G T C C T : : : : : 1 2 3 4 5 Keyword Tree 3.2 BLAST Algorithm for DNA • Sequence A : Length N and Sequence B : Length M Generating Keyword Tree W=12 M The list of words Scanning for exact matches hit M Similarity Scores for DNA: Match = 5, Mismatch = -4 (WU-BLAST) Match = 1, Mismatch = -3 (NCBI) hit extending Note:Extension consumes > 90% of all processing times.

N A C A T C G A . . . . . G T C C T : : : : : 1 2 3 4 5 Keyword Tree 3.3 FASTA Algorithm for DNA • Sequence A : Length N and Sequence B : Length M Generating Keyword Tree W=12 M The list of words Scanning for exact matches hit M N Alignment subsequences

4. Indexed Genomic Retrievalwith CAFÉ 4.1Indexing with Café 4.2 Coarse Searching with Café (Filtering) 4.3 Fine Searching with Café as the method of FASTA

4.1 Indexing with CAFÉ • Inverted indexes consist of two component: • A search structure • Posting lists • Example of an inverted index ACCC12,(3:144,154,962),38,(2:47,1045) The pattern occurs • 3 times in the 12th sequence, at offsets 144,154,and 962 • 2 times in the 38th sequence, at offsets 47 and 1045 • These indices are compressed for reducing space described in detail elsewhere.

4.2 Coarse Searching with CAFÉ • A novel Ranking technique using the index structure Score for ranking: COMBINED = COVERAGE- k*(LENGTH-COVERAGE) COVERAGE = 9 LENGTH = 9 COVERAGE = 21 LENGTH = 55 COVERAGE = 6 LENGTH = 55

Homologous -chain hemoglobin Example: Ranking by CAFÉ Human - Chimpanzee Human - Rat Human - Potato

5. Experimental Results 5.1 Test Data 5.2 Space 5.3 Retrieval Effectiveness 5.4 Speed

5.1 Test Data • PIR Database for assessing the accuracy of search system. • GenBank Database for assessing speed and index space requirements.

5.2 Space • Uncompressed index size ~9.7 times the collection size • Compressed index size (Café index) ~2.2 times the collection size • The retrieval of uncompressed nucleotide data reduces the speed of Café system

5.3 Retrieval Effectiveness

5.4 Speed

6. Conclusion • Café system affords much faster query evaluation than exhaustive searching. • Better accuracy than the most widely used search tool, BLAST 2. • Café indices are smaller than the annotated source databases and the indices of previous indexed systems.

The End

Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering