Basics of Comparative Genomics Dr G. P. S. Raghava

Basics of Comparative Genomics Dr G. P. S. Raghava

AIM:To understand Biology of Organisms • Importance: More than 100 genomes sequenced, more than 250 in progress • Definition: Comparison of set of proteins of one genome to another genome + comparision of gene location, gene order and gene regulation • Application • Visualization of information on genome • Genome annotation (Prediction of gene, repeats, regulation region) • Evolutionary information (gene loss, duplication, horizontal gene transfer, ancestor) • Essential genes for cell survival • Classification of genes based on function • Tools and Databases

What is comparative genomics? • Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease • Understand the uniqueness between different species

Why Comparative Genomics ? • It tells us what are common and what are unique betweendifferent species at the genome level. • Genome comparison may be the surest and most reliableway to identify genes and predict their functions andinteractions. – e.g., to distinguish orthologs from paralogs • The functions of human genes and other DNA regions canbe revealed by studying their counterparts in lowerorganisms.

What is compared? • Gene location • Gene structure • Exon number • Exon lengths • Intron lengths • Sequence similarity • Gene characteristics • Splice sites • Codon usage • Conserved synteny

Few facts from genome comparision • High degree of conservation of microbial proteins (~70% ancestral conserved region) • Protein related with ENERGY process are generally found all genomes • Proteins related to COMMUNICATION repersent repersent most distinctive function in each genome • INFORMATION related protein have complex behaviour • High frequence (~10%) non-orthologous gene displacement

Few Terminologies • Homology :- Homology is the relationship of any two characters ( such as two proteins that have similar sequences ) that have descended, usually through divergence, from a common ancestral character. Homologues are thus components or characters (such as genes/proteins with similar sequences) that can be attributed to a common ancestor of the two organisms during evolution.

Homologoues can either be orthologues xenologues, paralogues or. • Orthologues are homologues that have evolved from a common ancestral gene by speciation. They usually have similar functions. • Paralogues are homologues that are related or produced by duplication within a genome followed by subsequent divergence. They often have different functions. • Xenologues are homologous that are related by an interspecies (horizontal transfer) of the genetic material for one of the homologues. The functions of the xenologues are quite often similar.

Analogues • Analogues are non-homologues genes/proteins that have descended convergently from an unrelated ancestor. They have similar functions although they are unrelated in either sequence or structure.

Frequently used terms • Homology • Orthologous: Common ancestral gene. They usually have similar functions • Paralogous: duplication of gene within genome have usually different functions • Xenologous: That are related by an interspecies (horizontal gene transfer) of the genetic material, have similar function • Analogous: Not evolve from same ancestor • Similarity: sequence similarity • Percent Identitity

Visualising Genome Information

Genome Annotation The Process of Adding Biology Information and Predictions to a Sequenced Genome Framework

All-against-all Self-comparison • How? • Making a database of the proteome • Use each protein as a query in a similarity search against the database (BLAST, WU-BLAST or FASTA) • Generate a matrix of alignment scores (P or E value) : A conservative cutoff E value : 10e-6 • Why? • Number of Gene Families This comparison distinguishes unique proteins from proteins arisen from gene duplication, and also reveals the # of gene families. • Paralogs Significantly matched pairs of protein sequences may be paralogs.

Between-Proteome Comparisons : Why? • To identify orthologs, gene families, and domains • Orthologs: (proteins that share a common ancestry & function) • A pair of proteins in two organisms that align along most of their lengths with a highly significant alignment score. • These proteins perform the core biological functions shared by the two organisms. • Two matched sequences (X in A, Y in B) may not be orthologs (Y and Z are paralogs in B, X and Z are orthologs) • Identify true orthologs • highest-scoring match (best hit) • E value < 0.01 • > 60% alignment over both proteins

Between-Proteome Comparisons: How? • Choose a yeast protein and perform a database similarity search of the worm proteome (WU-BLAST): a yeast-versus-worm search • Group the worm seqs that match the yeast query seq with a high P value (10-10 to 10-100), also include the yeast query seq in the group • From the group made in 2, choose a worm seq and make a search of the yeast proteome, using the same P limit • Add any matching yeast seq to the group made in 2 • Repeat 3 & 4 for all initially matched seqs in the group • Repeat 1-5 for every yeast protein • As 1-6, perform a comparable worm-versus-yeast search • Coalesce the groups of related seqs. and remove any redundancies so that every sequence is represented only once. • Eliminate any matched pairs in which less than 80% of each seq is in the alignment

Figure 1 Regions of the human and mouse homologous genes: Coding exons (white), noncoding exons (gray}, introns (dark gray), and intergenic regions (black). Corresponding strong (white) and weak (gray) alignment regions of GLASS are shown connected with arrows. Dark lines connecting the alignment regions denote very weak or no alignment. The predicted coding regions of ROSETTA in human, and the corresponding regins in mouse, are shown (white) between the genes and the alignment regions.

Target Validation • Target validation involves taking steps to prove that a DNA, RNA, or protein molecule is directly involved in a disease process and is therefore a suitable target for development of a new therapeutic compound. • Genes that do not belong to an established family are critical to many disease processes and also need to be validated as potential drug targets.

Basics of Comparative Genomics Dr G. P. S. Raghava