Vertical Data Organization in Bioinformatics Systems

Vertical Datain Bioinformatics SystemsWilliam Perrizo, Comp. Sci., NDSUwiliam.perrizo@ndsu.nodak.edu • In bioinformatics systems, as in all data systems, horizontal record-based data organization is ubiquitous. • In this presentation, a vertical data organization alternative is presented. • For many bioinformatics computations, this vertical alternative hold promise for improved scalability • across massive heterogeneous data sets • For vast quantities of micro-array experiment data • and for data from other high throughput techniques.

OUTLINE • Data Mining and Knowledge Discovery are important in bioinformatics. • These techniques can be viewed as less structured database querying. 3. As such one runs up against 2 curses immediately • Curse of cardinality solutions do not scale with respect to cardinality (the # of rows in the data table) • Curse of dimensionalitysolutions do not scale with respect to high dimension (the # of columns in the data table) • In this talk, a vertical data structure alternative is considered as a solution to these problems.

Curses • The curse of cardinality, is a well-known problem 2. The curse of dimensionality, on the other hand, is not as well known, since it is masked, in horizontal databases (relational databases), as the “curse of the slow join”, • DBs are decomposed into many narrow relations to achieve good design (e.g., 3rd normal form), • then, many joins are required to get answers. 3. In the vertical world both curses are very obvious.

Everyone is awash with datacardinality and dimensionality • Networks: hi-speed, DWDM,… 10 terabytes by 2004 ~ 1013 B • Remotely Sensed Imagery, 15 petabytes by 2007 ~ 1016 B • Astronomical data, Digital Sky Surveys 10 exabytes by 2010 ~ 1019 B • Sensor data, Nano-sensor,… 10 zettabytes by 2015 ~ 1022 B • WWW 10 yottabytes by 2020 ~ 1025 B • Genomic/Proteomic/… data10 gazillabytes by 2030 ~ 1028 B? • Stock Market data 10 super-gazillabytes by 2040 ~ 1031 B? 8. I had to make up the last 2 orders-of-magnitude. Cardinalities are overrunning our ability to name them!

Loop backs Smart files 4. Data Mining Data Mining OLAP Classification Clustering Rule Mining visualization 3. Relevant analysis 2. Data Warehousing 1. Data Cleaning and Integration:

Fractals, … 1. querying 2.Search and Aggregate 4.Data Prospecting Assoc. Rule Mining OLAP rollup, drilldownslice/dice SQL SELECT FROM WHERE complex queries (nested, EXISTS..) FUZZY query, BLAST search Walmart vs.KMart Data MiningversusQuerying 3.Machine Learning/data mining 5. Data Mining has barely scratched the surface.But, an early scatcher (in SCM) is now the world’s largest corporation, while a non-scratcher is under bankruptcy protection. supervised Learning – classify unsupervised Learning - cluster 6.Our Vertical data structures, Predicate-trees, (or P-trees)1 • are data-mining-ready, compressed data structures, intended to • address the curses of scalability (and dimensionality). 1 Ptree Technology is patent pending by North Dakota State University

Vertical structures, horizontal processing • Horizontal processing of vertically structured data (instead of the ubiquitous vertical processing of horizontal (record) structures. • Parallelization • Parallel software analysis engines on compute clusters. • Parallel greyware analysis engines on clusters of people, via the internet and visual data mining.

R[A1] R[A2] R[A3] R[A4] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 1.Current Practice Horizontal Records, Scanned vertically = R11 0 0 0 0 1 0 1 1 3. P-tree for R11 Record the truth of predicate “pure 1” in a tree recursively on halves, until purity. R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 pure1? false=0 pure1? false pure1? false=0 0 0 0 1 0 01 0 1 0 1 0 0 1 0 0 1 01 a. Whole pure1? No 0 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 b. Left ½? No, but pure  0 P11 0 0 0 0 1 01 c. Right ½? No.  0 0 0 0 0 1 0 0 10 01 0 0 0 1 0 0 0 0 0 0 0 1 01 10 0 0 0 0 1 10 0 0 0 0 0 0 1 0 0 ^ ^ ^ ^ ^ ^ ^ ^ ^ 0 0 1 0 1 d. Left ½ of rt ½? No0 0 0 1 0 1 01 e. Rt ½ of rt ½? Yes 1 0 1 0 1 0 f. lt½ of lt½ of rt½? 1 1 0 g. rt ½ of lt ½ of rt ½? 0 • 2. Alternatively, vertical structures anded horizontally • vertically project each attribute, • vertically project each bit position, • compress bit slice into a basic P-tree. R(A1 A2 A3 A4) 2 7 6 1 6 7 6 0 2 7 5 1 2 7 5 7 5 2 1 4 2 2 1 5 7 0 1 4 7 0 1 4

R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0 01 0 1 0 0 1 01 0 0 0 0 1 01 0 0 0 0 1 0 0 10 01 0 0 0 1 0 0 0 0 0 0 0 1 01 10 0 0 0 0 1 0 0 0 0 1 10 0 1 0 0 0 1 0 1 0 0 1 0 1 01 0 1 0 R[A1] R[A2] R[A3] R[A4] R(A1 A2 A3 A4) 2 7 6 1 6 7 6 0 2 7 5 1 2 7 5 7 5 2 1 4 2 2 1 5 7 0 1 4 7 0 1 4 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 1. How are P-trees used? e.g., To count rows with A1=7=1112 and A3=1=0012 2. Complement P-tree corresponding to each 0-bit, then AND horizontally: P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 ^ ^ ^ ^ ^ ^ ^ ^ ^ 1 1 1 0 0 1 3. P11^P12^P13 ^ P’31^P’32^P33 yields the predicate tree: 4. The single pure1 node accounts for 2rows, so the answer is 2

1. The previous top-down construction is good for illustration purposes, but the following bottom-up construction is much more efficient for implementations: 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 2. One scan of the bit slice and one in-order traversal of the tree, collapsing pure siblings along the way. R11 0 0 0 0 1 0 1 1 P11 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0

1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 1 1 1 1 0 0 0 1 0 1 1 0 1 Two-Dimensional P-trees: 1. E.g., A bit-band from a microarray experiment representing the predicate: log-ratio > threshold 2. Top-down construction of the 2-D P-tree: Record truth of predicate“pure 1” in a tree recursively on quarters, until purity is achieved.

1 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1 0 0 0 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 • Bottom-up construction of this 2-D P-tree? • in-order tree traversal (collapsing pure siblings) • Z-order (Peano-order) array traversal

3-Dimensional P-trees: used in the Gene-Experiment-Organism Data Warehouse cube.

Logical Operations on P-trees Ptree 1 Ptree 2 AND result OR result P-tree AND is much faster than bit-by-bit AND since any one pure0 operand node makes corresponding result node pure0. The more operands, the greater the benefit due to this shortcut (addresses curse of dimensionality).

Some other useful P-trees 1. Basic Ptrees P11, P12, … , P3,2 AND 2. Value Ptrees (predicate: purely target value in target attribute P1, 5 =P1, 101 =P11 AND P12’ AND P13 AND 3. Tuple Ptrees (predicate: purely target tuple) P(1, 2, 3) =P(001, 010, 111) =P1, 001AND P2, 010 AND P3, 111 AND/OR 4. Range P-trees (predicate: purely in target rectangle) P([13],, [0.2]) =(P1,1 OR P1,2 OR P1,3) AND (P3,0 OR P3,1 OR P3,2)

DataMIME™ SystemInformation with no noise 7.Information 1.RAW DATA 6. Ptree (Predicates) Query Language PQL 2. Data Integration Language DIL Internet 3. DII (Data Integration Interface) 5.DMI (Data Mining Interface) 4.P-tree Bioinformatics data warehouse lossless, compressed, distributed, vertically-structured database

SubCell-Location Myta Ribo Nucl Ribo g0 g1 g2 g3 o0 o1 o2 1 1 1 1 1 0 0 1 0 1 0 0 1 0 1 1 o3 0 1 0 1 1 0 0 1 0 1 0 1 1 0 0 0 e0 1 0 1 1 0 1 1 1 1 1 0 1 1 0 1 0 e0 e1 e1 e2 e2 e3 e3 Bioinformatics Data Warehouse Schema (basic star schema) Function apop meio mito apop StopCodonDensity .1 .1 .1 .9 PolyA-Tail 1 1 0 0 3.Gene Dimension Table 5. Gene Organismdimensn table Organism Species Vert Genome Size human Homo sapiens 1 3000 17, 78 12, 60 Mi, 40 1, 48 10, 75 0 0 7, 40 0 14, 65 0 16, 76 0 9, 45 Pl, 43 fly Drosophila melanogaster 0 185 yeast Saccharomyces cerevisiae 0 12.1 mouse Mus musculus 1 3000 1.Organism Dimension Table LAB PI UNV STR CTY STZ ED AD S H M N 4.Gene Experiment Organism cube 3 2 a c h 1 2 2 b s h 0 2 4 a c a 1 2 4 a s a 1 2.Experiment Dimension Table (MIAME)

SubCell-Location Myta Ribo Nucl Ribo Function apop meio mito apop StopCodonDensity .1 .1 .1 .9 PolyA-Tail 1 1 0 0 g0 g0 g1 g2 g3 g1 g2 1 0 0 1 0 1 1 g3 0 1 1 0 1 0 1 1 1 0 1 0 0 g3 g2 g1 g0 1. The Basic Star Schema is part of aConstellation schema: 2. multiple fact cubes, including 2-way, 3-way interaction pyramids, pathway cubes, phylogenetic tree/ring… Genes 3. e.g., an added 3-way Protein-Protein-Protein Interaction Pyramid

o0 o0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 o1 o2 o3 o1 o2 o3 Organism Species Vert Genome Size human Homo sapiens 1 3000 fly Drosophila melanogaster 0 185 yeast Saccharomyces cerevisiae 0 12.1 mouse Mus musculus 1 3000 o3 o2 o1 o0 Phylogenetic tree/ring as a DW constellation cube Phylogentic tree/ring  root dimension Organisms  level-2 dimension  level-1 dimension

g0 g1 g2 g3 o0 1 0 0 1 g0 g0 g0 o1 g0 g0 g0 g0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 g1 g1 g1 1 1 0 o2 g2 g2 g2 0 1 o3 g3 g1 g1 g1 g1 g3 g3 0 e0 SubCell-Location Myta Ribo Nucl Ribo g2 g2 g2 g2 Function apop meio mito apop e0 e1 StopCodonDensity .1 .1 .1 .9 g3 g3 g3 g3 PolyA-Tail 1 1 0 0 e1 1 0 0 1 0 1 1 e2 0 1 Organism Species Vert Genome Size 1 1 0 0 e2 human Homo sapiens 1 3000 1 1 e3 fly Drosophila melanogaster 0 185 1 0 1 e3 0 yeast Saccharomyces cerevisiae 0 12.1 0 mouse Mus musculus 1 3000 LAB PI UNV STR CTY STZ ED AD S H M N 3 2 a c h 1 2 2 b s h 0 g3 g3 g3 g3 g2 g2 g2 g2 g1 g1 g1 g1 g0 g0 g0 g0 2 4 a c a 1 2 4 a s a 1 Prot-Prot-Prot-ints The Constellation schema Genes PPIs Phylogentic tree/ring Pathways Organisms Experiments Gene-Exp-Org

g0 g1 g2 g3 o0 o1 o2 1 1 1 1 1 0 0 1 0 1 0 0 1 0 1 1 o3 0 1 0 1 1 0 0 1 0 1 0 1 1 0 0 0 1 0 1 1 0 1 1 1 1 1 0 1 1 0 1 0 e0 e1 e2 e3 • 1. Each of these cubes will become massive! (curse of cardinality) • e.g., GEO cube may have |genes| * |experiments| * |organisms| ~ 100K * 1K * 1 K = 100 billion cells • In order to data mine it, relevant attributes from other cuboids in the constellation schema will need to be joined to it, making the resulting even more massive. • P-trees will drastically compress these data • The basic P-trees for this data set can be created directly from the component basic P-tree sets(without requiring joins). Gene-Experiment-Organism (GEO) cube

An example Data mining application,Nearest Neighbor Search (NNS). One of the most common data mining techniques • In NNS classification, a sample, a=(a1,…an) is assigned a class cC based on votes by its nearest neighbors in a training database, D( A1,…,An, C). • C is the class or attribute of interest e.g., • for tissue analysis, C may be {cancer, no_cancer}; • for sequence alignment C may be the function, • or we may just want to see neighbors ordered by closeness to a. • NNS classification employs a distance measure (similarity score) on the A1,…,An attributes to determine the nearest neighbors (e.g., based on Blossum 62 or…)

Sequence alignment versus Nearest Neighbor Search • In NNS, onechooses a “distance” measure to determine neighbors. a. by 1st choosing a “dimension distances” for each dimension • |a-b| for numeric dimensions • mismatch=1 / match=0 for categorical dimensions… b. 2nd choosing an inter-dimension distance (e.g., Lq q=1,2,…,∞ i.e., Manhattan, Euclidean,…,Max). • Hamming: mismatch=1/match=0 and Manhattan. • In sequence alignment the distance function is Hamming, but with an inter-dimension modification to accommodate indel evolutionary events (e.g., the Blossum twist).

Is sequence alignment really Nearest Neighbor Search? • Current sequence alignment practice can be viewed as standard NNS on an expanded training database to include all evolutionary homologs of DB subject sequences. • i.e., currently, the query (the unclassified sample) and a subject (the database sequence), • are treated symmetrically and • the evolutionary events along both branches leading to a common ancestor are scored. • these scores determine the distance measure subject query

subject query • If the DB is expanded to include all subject replicas representing potential paths up and down to the query (homologs), • and if sequences include elements representing n-gaps, • then sequence alignment is NNS 6. Then the many NNS methods developed over its long history can all be brought to bear on the problem, including recent strong offerings, e.g., Local Support Vector Machine? subject subject replica

A simple NNS Classification example comparing the horizontal and the vertical approaches Key a1 a2 Ca3 a4 a5 a6 t12 0 0 1 0 1 1 0 t13 0 0 1 0 1 0 0 t15 0 0 1 0 1 0 1 t16 0 0 1 1 0 1 0 t21 1 1 1 1 0 1 0 t27 1 1 1 0 0 1 1 t31 1 0 1 1 0 1 0 t32 1 0 1 0 1 1 0 t33 1 0 1 0 1 0 0 t35 1 0 1 0 1 0 1 t51 0 0 0 1 0 1 0 t53 0 0 0 0 1 0 0 t55 0 0 0 0 1 0 1 t57 0 0 0 0 0 1 1 t61 1 0 0 1 0 1 0 t72 0 0 0 0 1 1 0 t75 0 0 0 0 1 0 1 • Given a training database  • Search 3 nearest neighbors of a=(a1,a2,a3,a4,a5,a6)=(0,0,0,0,0,0) using Hamming distance (number of mismatches) • Assuming horizontal data 1st (using vertical scans) • Assuming vertical data 2nd (using horizontal ANDs)

distance sample 0 0 0 0 0 0 t12 0 0 1 0 1 1 0 2 t13 0 0 1 0 1 0 0 1 t15 0 0 1 0 1 0 1 2 t53 0 0 0 0 1 0 0 1 0 1 4. 1 wins! dis=1, replace t15 dis=2, don’t replace dis=4, don’t replace 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 don’t replace any of these 6, since distances exceed 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2. Only 1 of the many tuples at dist=2 voted! 3. To find others, a 2nd scan. 0 0 0 0 0 0 don’t replace any of these 5, since distances exceed 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Key a1 a2 Ca3 a4 a5 a6 t12 0 0 1 0 1 1 0 t13 0 0 1 0 1 0 0 t15 0 0 1 0 1 0 1 t16 0 0 1 1 0 1 0 t21 1 1 1 1 0 1 0 t27 1 1 1 0 0 1 1 t31 1 0 1 1 0 1 0 t32 1 0 1 0 1 1 0 t33 1 0 1 0 1 0 0 t35 1 0 1 0 1 0 1 t51 0 0 0 1 0 1 0 t53 0 0 0 0 1 0 0 t55 0 0 0 0 1 0 1 t57 0 0 0 0 0 1 1 t61 1 0 0 1 0 1 0 t72 0 0 0 0 1 1 0 t75 0 0 0 0 1 0 1 1. Scan for the closest three training tuples using simple replacement.

dis=2, 2 more votes for C=1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Dis > 2, 4 non-voters 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 dis=2, 1 vote C=1 0 0 0 0 0 0 0 0 0 0 0 0 dis=3, non-voter 0 0 0 0 0 0 dis=2 , 1 vote C=0 dis=2, 2 more votes for C=0 0 0 0 0 0 0 0 0 0 0 0 0 dis=3, non-voter 0 0 0 0 0 0 dis=2, 2 more votes for C=0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 Key a1 a2 Ca3 a4 a5 a6 t12 0 0 1 0 1 1 0 t13 0 0 1 0 1 0 0 t15 0 0 1 0 1 0 1 t16 0 0 1 1 0 1 0 t21 1 1 1 1 0 1 0 t27 1 1 1 0 0 1 1 t31 1 0 1 1 0 1 0 t32 1 0 1 0 1 1 0 t33 1 0 1 0 1 0 0 t35 1 0 1 0 1 0 1 t51 0 0 0 1 0 1 0 t53 0 0 0 0 1 0 0 t55 0 0 0 0 1 0 1 t57 0 0 0 0 0 1 1 t61 1 0 0 1 0 1 0 t72 0 0 0 0 1 1 0 t75 0 0 0 0 1 0 1 Second scan for all other tuples at dist=2 Already voted Already voted

Vertical method: form interval-Ptree P=orPiwhere Pi =AND(j≠i)P’j AND Pi P6 0 1 a6 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 a1 11110000001111011 a211110011111111111 a311100101110111011 a400011110001001100 a50110000011011 0001 .—complements--. Key t12 t13 t15 t16 t21 t27 t31 t32 t33 t35 t51 t53 t55 t57 t61 t72 T75 a1 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 a2 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 a3 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 a4 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a5 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0 a6 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 a211110011111111111 C00000000001111111 a311100101110111011 a400011110001001100 a50110000011011 0001 a611011011101100110 a1 11110000001111011

Vertical method, find dis=1 nbrs by P=orPi Pi =AND(j≠i)P’j AND Pi Finally P=orPi Similarly, P5 thru P1 are 5-way ANDs C=1 vote=rootcount(PC^P) C=0 vote=rootcount(P’C^P) 0 1 P P4 P3 P5 P2 P1 a1 11110000001111011 a5 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0 a611011011101100110 a211110011111111111 a311100101110111011 a4 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a50110000011011 0001 a611011011101100110 a3 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 a611011011101100110 a50110000011011 0001 a400011110001001100 a50110000011011 0001 a311100101110111011 a400011110001001100 a2 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 a611011011101100110 a50110000011011 0001 a400011110001001100 a311100101110111011 a1 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 a611011011101100110 0100000000010 0000 C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 C00000000001111111 a1 11110000001111011 a1 11110000001111011 a400011110001001100 a311100101110111011 a211110011111111111 a211110011111111111 a211110011111111111 a1 11110000001111011 OR .—complements--. P6 Key t12 t13 t15 t16 t21 t27 t31 t32 t33 t35 t51 t53 t55 t57 t61 t72 T75 a1 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 a6 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 a2 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 a3 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 a4 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a5 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0 a6 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 a211110011111111111 C00000000001111111 a311100101110111011 a400011110001001100 a50110000011011 0001 a611011011101100110 a1 11110000001111011 a211110011111111111 a50110000011011 0001 a400011110001001100 a311100101110111011 a1 11110000001111011

Similarily, find dis=2 nbrs by P=orPi*k Pi*k=AND(j,k≠i)P’j^P’k^Pi 0 1 P3*4 P4*6 P2*6 P5*6 P1*2 P1*3 P1*4 P1*5 P1*6 P2*3 P2*4 P3*5 P3*6 P4*5 P2*5 a1 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 a1 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 a3 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 a1 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 a4 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a1 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 a5 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0 a6 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 a1 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 a4 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a2 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 a5 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0 a2 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 a6 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 a2 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 a4 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a3 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 a3 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 a5 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0 a3 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 a6 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 a5 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0 a4 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a6 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 a4 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a5 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0 a6 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 a3 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 a2 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 a211110011111111111 a400011110001001100 a311100101110111011 a311100101110111011 a50110000011011 0001 a611011011101100110 a611011011101100110 a50110000011011 0001 a211110011111111111 a311100101110111011 a400011110001001100 a50110000011011 0001 a211110011111111111 a211110011111111111 a311100101110111011 a400011110001001100 a611011011101100110 a400011110001001100 a50110000011011 0001 a611011011101100110 a611011011101100110 a50110000011011 0001 a400011110001001100 a211110011111111111 a1 11110000001111011 a311100101110111011 a50110000011011 0001 a611011011101100110 a1 11110000001111011 a400011110001001100 a400011110001001100 a400011110001001100 a2 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 a311100101110111011 a611011011101100110 a611011011101100110 a50110000011011 0001 a1 11110000001111011 a50110000011011 0001 a311100101110111011 a1 11110000001111011 a50110000011011 0001 a311100101110111011 a311100101110111011 a611011011101100110 a311100101110111011 a1 11110000001111011 a1 11110000001111011 a211110011111111111 a1 11110000001111011 a400011110001001100 a211110011111111111 a400011110001001100 a1 11110000001111011 a211110011111111111 a211110011111111111 a211110011111111111 a1 11110000001111011 a1 11110000001111011 a611011011101100110 a50110000011011 0001 C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0

Vertical Data Organization in Bioinformatics Systems

Vertical Data Organization in Bioinformatics Systems

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

Outline

Outline

outline

outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline:

Outline

Outline

OUTLINE: