Gene Families and Functional Annotation

Once genes have been id.ed they need to be functionally annotated A computational first step is to group genes w/ other genes - some of which will hopefully have known fx.s Once genes are classified, we can begin to examine whether certain genes are missing or overrepresented in the given genome - possibly reflecting the niche of the organism As w/ earlier computational analyses, functional annotation based solely on in silico analyses is only a first step Gene Families and Functional Annotation

Sequence-similarity searches are a first pass in classification BLAST - Basic Local Alignment Search Tool BLASTn - nucleotide BLASTp- protein BLASTx - translates a nucleotide sequence into all possible reading frames and scans these against a protein database All give a Expectation, E, value score - to evaluate the significance of the match In both eukaryotes and prokaryotes, 1/3 to 1/2 of searched genes do not match a protein = orphan genes Gene Families and Functional Annotation

Proteins are made up of combinations of distinct structural units or domains Genes can be grouped based on the domains they contain These groupings depend on structural similarity - sequence similarity alone may be insufficient Protein Structural Domains

Gene clustering by seq. similarity BLAST searches generally return matches from more than one protein from more than one species This happens if the query protein is part of a gene (protein) family or contains multiple domains found in other proteins

BLAST output can be interpreted as a match to one or more protein domains - Searches of closely related sp. often id. genes/proteins w/ similar domain structure Domains shuffle over evolutionary time and are often found in different combinations across more distant comparisons Domains do tend to follow biologically reasonable patterns - DNA binding domains w/ other DNA binding domains, transmembrane domains w/ intra and extracellular domains

Genes can be classified by domain content The Enzyme Commission (EC) hierarchical classification of enzymes - each enzyme is assigned a number that reflects sub-classification of function, e.g. ADH is EC1.1.1.1 Other classification schemes are not as obvious - protein function is often context-specific PFAM - protein database that allows access to biochemical properties of predicted proteins Gene clustering by seq. similarity

InterPro - classifies individual protein domains Gene clustering by seq. similarity

Protein functional prediction ≠ assignment of genes to families Protein function prediction allows general conclusions about protein function and genome content based on protein domains Classification of gene families involves distinguishing between paralogs and orthologs Gene clustering by seq. similarity

Enzymes Signal transduction (receptors and kinases) Nucleic acid binding (transcription factors, nucleic acid enzymes) Structural (cytoskeletal, extracellular matrix, motor proteins) Channel (voltage and chemically gated) Immunoglobins Calcium-binding proteins Transporters Subclasses vary - as do the representation w/in each genome Major Classes of Protein Function

Alignment searches (BLAST) identify genes w/ similar sequence to the query If searches id. a single gene, or genes w/ a single fx then functional assignment to query seq. is simple - but searches often id lg # of seq.s w/ multiple functions The most similar sequence is not nec. the seq. w/ which the query seq. shares a fx Gene Clusters

One approach is to try and define as large a protein family as possible (including many possible functions) PSI-BLAST can be used to identify a large set of potential protein family members A BLAST search is conducted to create an initial protein sequence alignment - which is then used to initiate a fresh search The process is then iterated until no further matches are id.ed - this reduces the degree of seq. similarity required for inclusion in the family A “true” family of genes ought to be bounded by a significance cut-off to limit the proteins included Gene Clusters

Clusters of orthologous genes, COGs, can be used to classify proteins COGs are created by id.ing the best hit for each gene in complete pairwise comparisons across a set of genomes Gene Clusters

185,000 proteins from 66 microbial genomes id.ed 4,873 COGs - 75% of all predicted microbial proteins 50% of 110,00 proteins from fly, nematode, human, ariabidopsis, yeasts and a microsporidian form 4,852 COGs Gene Clusters COG0837

COGs include both orthologs and paralogs In (a) HuA and HuA’ are paralogs - distinguishing which retains the ancestral fx is not as simple as determining which has the most similar seq. Gene Clusters

HuA and MmA differ in 5 a.a., none affect fx HuA’ and MmA differ in 4 a.a., but one of which changes the charge of a critical residue Clustering based on similarity would lead to erroneous fx classification Gene Clusters

Clustering groups genes by seq. similarity Phylogentic analyses ascertain how groups of similar genes are related by descent In the HuA, MmA example, the 2 A’ genes can either result from one (orthologs) or two (paralogs) duplication events Paralogs are less likely to share a function Gene Phylogenies

Often gene fx can be inferred from phylogenetic analysis The first step is aligning the sequences A gene tree is then constructed using some algorithm Duplications and gene relatedness are then ascertained In the example on the lft, an ancient duplication splits 2 fx.al grps, on the rt protein 2 likely has the same fx as 5 and 6 Gene Phylogenies

Molecular function alone may not predict/describe biological fx (think crystallins) The Gene Ontology (GO) annotates and groups genes using a multi-character approach including cell biological and molecular fx and/or subcellular localization The GO project uses defined vocabulary and a hierarchical structure to classify genes and includes links indicating the type of evidence for the classification Gene Ontology

In this example, the gene INNER NO OUTER is at the center w/ the 3 separate classifications radiating out from it GO network

The GO vocabulary includes 7000 terms describing molecular fx, 5000 describing biological process, some annotations include as many as 12 levels w/ in hierarchy terms This is too deep for efficient computational searches - other simplified systems are also being developed to allow computationally screen and classify genes Gene Otology

Molecular Phylogenetics • Homology = similarity due to common ancestry • The Gpdh gene sequence from two different species are homologous sequences • All comparisons made in molecular evolution (biology) are based on comparing homologous sequences = apples to apples • Sequences must be aligned to allow comparison = homologous bases lined up in columns Human MVHLTP Baboon MVHLTP Cow MLTP Sheep MLTP Mouse MVHLTP Human MVHLTP Baboon ...... Cow .--... Sheep .--... Mouse ...... The cow and sheep β globin proteins are 2 a.a. shorter than the other sequences, so gaps are added to align the seqeunces

Gene Trees • Accumulation of sequence differences through time is the basis of molecular systematics, which analyses them in order to infer evolutionary relationships • A gene tree is a diagram of the inferred ancestral history of a group of sequences • A gene tree is only an estimate of the true pattern of evolutionary relations • UPGMA and Neighbor joining = simple ways to estimate a gene tree • Bootstrapping = sampling w/ replacement, a common technique for assessing the reliability of a node in a gene tree • Taxon = the source of each sequence

Analyses of a set of genes produces an unrooted tree Trees can be rooted, assigned polarity, by assignment of an outgroup - a sequence that is known to be more distantly related than any within the rest of the analysis (the ingroup) Tree branch length denotes the amount of change along that branch in some tree building methods Rooted and Unrooted Trees 3 distinct unrooted trees

The 3 primary methods (algorithms) for building gene trees are: 1. Parsimony - a character-based approach that surveys every possible tree topology. The most parsimonious tree is the topology that requires the minimum # of steps (changes) in a data set Position 1 of this example - tree1 requires 1 change, tree2 2 changes and tree3 2 changes. When the 4 positions are summed tree 3 is found to be the best (shortest) Tree Building methods

The 3 primary methods (algorithms) for building gene trees are: 2. Maximum Likelihood - also a character-based approach, surveys every possible tree topology and assigns all topologies a maximum likelihood estimate (score) based on a model of evolution describing the probability of changes (mutation) through time. The ML tree is the one with the highest probability This method can be accurate, but is computationally expensive Tree Building methods

The 3 primary methods (algorithms) for building gene trees are: 3. Distance Methods - are not character based, instead they calculate pairwise distances across entire aligned sequences and construct data matrixes. Trees are built by grouping pairs with the shortest distances between them. These methods can also incorporate complex evolutionary models This method is computationally cheap, will always return and answer, but are not always accurate. The simplest distance method, Unweighted Pair Group Method with Arithmatic Mean, UPGMA, simply counts the number of sequence changes in all pairwise comparisons Tree Building methods

UPGMA Trees

UPGMA Tree Construction Hu Ba Co Sh Mo Ha Ch Hu 2 6 98913 Ba 71071013 Co 3 11 12 16 Sh 12 9 15 Mo 7 16 Ha 14 1.0 Hu 2/2 = 1.0 Ba 1.0 HuBa Co Sh Mo Ha Ch HuBa 6.59.57.59.5 13 Co 3 11 12 16 Sh 12 9 15 Mo 7 16 Ha 14 1.5 Co 3/2 = 1.5 Sh 1.5

1.0 Hu UPGMA Tree Construction Ba 1.0 HuBa Co Sh Mo Ha Ch HuBa 6.5 9.5 7.5 9.5 13 Co 3 11 1216 Sh 12915 Mo 7 16 Ha 14 1.5 Co Sh 1.5 HuBa CoSh Mo Ha Ch HuBa 8 7.5 9.5 13 CoSh 11.510.515.5 Mo 7 16 Ha 14 3.5 Mo Ha 3.5 7/2 = 3.5

1.0 1.5 Co Hu UPGMA Tree Construction Sh Ba 1.5 1.0 HuBa CoSh Mo Ha Ch HuBa 8 7.5 9.5 13 CoSh 11.5 10.5 15.5 Mo 7 16 Ha 14 3.5 Mo Ha 3.5 1.0 3.0 Hu HuBa CoSh MoHa Ch HuBa 8 8.5 13 CoSh 11 15.5 MoHa 15 Ba 1.0 8/2 = 4 1.5 Co 2.5 Sh 1.5

UPGMA Tree Construction HuBa CoSh MoHa Ch HuBa 8 8.513 CoSh 1115.5 MoHa 15 1.0 3.0 Hu .875 Ba 1.0 1.5 Co 2.5 Sh ((HuBa)(CoSh)) MoHa Ch ((HuBa)(Cosh)) 9.75 14.25 MoHa 15 1.5 3.5 Mo 1.375 Ha 3.5 9.75/2 = 4.875

UPGMA Tree Construction ((HuBa)(CoSh)) MoHa Ch ((HuBa)(Cosh)) 9.75 14.25 MoHa 15 1.0 3.0 Hu .875 Ba ((HuBa)(CoSh))(MoHa) Ch ((HuBa)(Cosh))(MoHa) 14.625 1.0 1.5 Co 2.4375 2.5 Sh 14.625/2 = 7.3125 1.5 3.5 Mo 1.375 Ha 3.5 Ch 7.3125

1.0 a 3.0 .875 b 1.0 1.5 c 2.44 2.5 d 1.5 3.5 e 1.375 f 3.5 g 7.3125 Final UPGMA Tree

Phylogenetic trees are representations summarizing a reconstructed evolutionary history A phylogenetic tree is a diagram that proposes a hypothesis for reconstructed evolutionary relationships between a set of objects (taxa or OTUs) Phylogenetic trees can represent relationships between species or genes Phylogenetic Trees

Phylogenetic Trees OTUs are connected by a set of lines - branches or edges External nodes or leaves are existing OTUs or extinct objects tht did not give rise to descendents Internal nodes represent ancestral states hypothesized to have occurred during evolution

Human Monkey Rat Mouse Chicken Strugeon Internal nodes can represent speciation or gene duplication events A gene tree does not necessarily coincide with a species tree Gene duplications will cause a gene tree to differ from a species tree Platy Zebrafish Lamprey Hagfish

Human Human Monkey Monkey Rat Rat Mouse Mouse Chicken Chicken Strugeon Strugeon Resolution Platy Platy Zebrafish Zebrafish Lamprey Lamprey Hagfish Hagfish Trees may be fully or only partially resolved Every node in a fully resolved tree is bifurcating or dichotomous Some nodes in unresolved trees are multifurcating or polytomous

Human Monkey Rat Mouse Chicken Human Monkey Rooting Human Rat Mouse Rat Monkey Mouse Mouse Human Monkey Rat Unrooted trees establish the relationships among taxa, but not the evolutionary pathway For 4 taxa there are 3 unrooted trees, but 15 rooted trees

Human Monkey Rat Mouse Human Human Monkey Chicken Monkey Monkey Human Rat Rat Rat Rooting Human Rat Mouse Mouse Mouse Mouse Monkey Mouse Rat Rat Mouse Human Human Monkey Monkey Unrooted trees establish the relationships among taxa, but not the evolutionary pathway For 4 taxa there are 3 unrooted trees, but 15 rooted trees

Human Human Monkey Monkey Rat Rat Types of Trees Mouse Mouse Cladograms show the genealogy of taxa, but do not include timing or divergence (branch lengths have no meaning)

Human Monkey Rat Types of Trees Mouse Additive trees show the genealogy of taxa and branch lengths represent divergence between taxa Comparison of branch lengths gives a meaningful estimate of evolutionary divergence

time Human Monkey Types of Trees Rat Mouse Ultrametric trees are similar to additive trees, but assume a constant rate of change between characters used to build the tree - a molecular clock Comparison of branch lengths gives a meaningful estimate of evolutionary divergence Ultrametric trees are always rooted

time Human Monkey Outgroups Rat Mouse Chicken The most accurate way to root a tree is to use an “outgroup” a taxon or group of taxa more distantly related than any member of the “ingroup”

Phylogenetic relationships can be represented as graphical trees, tables or parenthetical statements (Newick or New Hampshire format) Representing Phylogenies ((raccon, bear),((sea_lion, seal), ((monkey,cat), weasel)), dog); ((raccon:0.20, bear:0.07):0.01,((sea_lion:0.12, seal:0.12):0.08, ((monkey:1.00,cat:0.47), weasel:0.18)), dog:0.25);

Many tree building algorithms will give a single, fully resolved, tree from any data set. Nodes will all be equally represented even if one is supported by many characters and another by very few. How to quantify support for any given tree? We can’t re-run evolution. We can sample many different genes and we can bootstrap our data. Bootstrapping is sampling a data set, with replacement, to generate a new data set. We then use this new set in a phylogenetic analysis - and repeat this process hundreds or thousands of times. We can then present bootstrap scores at each node, the % of bootstrap trees that contained that specific node Bootstrapping

1- G A D D Y T T K L P 2- G V E D Y T T K - P 3- G A D D Y T T R L P 4- C V E D Y T T R - P 1- T K L L T P D A D G 2- T K - - T P E V D G 3- T R L L T P D A D G 4- T R - - T P E V D C Bootstrapping 1- L P Y D A D D P T G 2- - P Y E V D E P T G 3- L P Y D A D D P T G 4- - P Y E V D E P T C 1- G P K D K K T P D P 2- G P K D K K T P E P 3- G P R D R R T P D P 4- C P R D R R T P E P

1- G A D D Y T T K L P 2- G V E D Y T T K - P 3- G A D D Y T T R L P 4- C V E D Y T T R - P 1- T K L L T P D A D G 2- T K - - T P E V D G 3- T R L L T P D A D G 4- T R - - T P E V D C Bootstrapping 1 2 3 4 1 2 3 3 1 4 4 5 2 4 1 1 2 3 4 1 2 4 3 1 5 4 6 2 5 1 3 3 2 2 4 4 1- L P Y D A D D P T G 2- - P Y E V D E P T G 3- L P Y D A D D P T G 4- - P Y E V D E P T C 1- G P K D K K T P D P 2- G P K D K K T P E P 3- G P R D R R T P D P 4- C P R D R R T P E P 1 2 3 4 1 2 4 3 0 4 4 4 1 5 1 2 3 4 1 2 1 3 3 4 4 5 4 2 1 1 3 2 2 3 4 4

In this example, bear and raccoon form a pair in 50% of the data sets We can choose to present a tree that condenses branches of less than some threshold bootstrap support - a condensed tree Bootstrapping and Condensed Trees

Some tree building methods will produce multiple equally “good” trees A consensus tree shows the features that are shared by all or some trees. A strict consensus tree only includes features found in all trees A majority-rule consensus tree includes features found ≥ a set % Consensus Trees

Reconciled Trees Tree showing duplications Species tree Reconciled trees attempt to combine gene trees and species trees, clearly identifying both speciation and duplication events

Gene Families and Functional Annotation