Bioinformatics Workshop 2 Recap & Warm-Up Exercise

Bioinformatics Workshop 2Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by retrieving the amino acid sequence for Human claudin-2 from Entrez Gene (see list of useful websites) – then by appropriate use of various BLAST flavours, search parameters and notions of orthology see if you can get to an answer. Use the scratch-pad.html (first item in list of useful websites) to keep notes, accession numbers, sequences, etc. as you go along.

Answer to: Recap & Warm-Up Exercise 1. Get fasta protein sequence >gi|9966781|ref|NP_065117.1| claudin 2 [Homo sapiens] MASLGLQLVGYILGLLGLLGTLVAMLLPSWKTSSYVGASIVTAVGFSKGLWMECATHSTGITQCDIYSTL LGLPADIQAAQAMMVTSSAISSLACIISVVGMRCTVFCQESRAKDRVAVAGGVFFILGGLLGFIPVAWNL HGILRDFYSPLVPDSMKFEIGEALYLGIISSLFSLIAGIILCFSCSSQRNRSNYYDAYQAQPLATRSSPR PGQPPKVKSEFNSYSLTGYV 2a. tBLASTn against ‘est_others’ database + Xenopus laevis 2b. tBLASTn against ‘est_others’ database + Xenopus tropicalis (this gives us the ESTs in each species which best match our human protein) 3. Get the top EST sequence for each species, and search each in turn against the human proteins: BLASTx against ‘nr’ + Homo sapiens (this is a check for orthologs) best laevis EST (CF520733.1 ) gave top 2 human hits: gi|6912314|ref|NP_036262.1| claudin 14 [Homo sapiens] >gi|215... 274 1e-73 gi|9966781|ref|NP_065117.1| claudin 2 [Homo sapiens] >gi|1568... 197 2e-50 So this EST was probably Xl claudin 14. The best trop EST (DT398005.1) gave top 2 human hits: gi|4502875|ref|NP_001297.1| claudin 3 [Homo sapiens] >gi|1635... 340 3e-93 gi|4502877|ref|NP_001296.1| claudin 4 [Homo sapiens] >gi|1265... 317 2e-86 So this EST was probably Xt-claudin 3.

BLAST Parameters Exercises 5. E-Value maximum for reporting Open the file example-sequences.html. Copy the sequence >sumo-binding-motif and go to the NCBI BLAST Home Page. Go to the PROTEIN BLAST section, BLASTp, and paste the sequence. Run the search with the default values. Now re-run the search: setting the maximum E-value in the box -> 100 setting the maximum E-value in the box -> 1000 setting the maximum E-value in the box -> 10000 What difference does this make? Have you found related proteins in your results?

Bioinformatics Workshop 2Identifying Unknown Genes … • Open a web browser and type in the URL: • informatics.gurdon.cam.ac.uk/online/workshops • bookmark this page • Click on the link to the file: • useful-websites.html • bookmark this page too • it also contains links to the example sequence files used in the workshop, and the presentations themselves

Part 1:Genome Browsers Now that most model organisms have had their genomes sequenced, we can get a lot more information about how the gene works, than by just doing a BLAST search against the protein databases. Even if ‘your’ favourite genome is still just in ‘scaffolds’ and not yet assembled into chromosomes, we can still add a lot of value. The main tasks that one does to a genome before releasing it to the user community is to annotate it. In practice this means adding gene models, based on known expressed sequences, both in the same organism and other fairly closely related ones, and possibly also purely predicted ones based on sequence composition analysis and ‘features’ like start and stop codons, and splice sites. And then known mapping markers, SNPs, etc, etc. With ~3,000,000,000 nucleotides in the genome sequence (human) this present a considerable challenge to display on a web browser page, which is of course the preferred option. Most genome browsers (software designed to display genome based data in a web broswer) have taken roughly the same approach, which we’ll take a quick look at…

Gene model gene model genome Aligned cDNA Aligned ESTs

24000 25000 27000 26000 + navigate zoom - Schematic Genome Browser Mus musculus, chromosome 12 genome TRACKS Your sequence Genes ESTs conservation Human Fish

How to Use UCSC Browser

Displaying your own data You can also use the UCSC browser to display you own data… Not just your blasted sequence. Simply create a text file in one of several specified formats, e.g. ------------------------------------------------------------------------------------------------------------- browser position chr1:1,000,000-1,050,000 track name=track1 visibility=1 description="My display data" itemRgb="On" priority=1 chr1 1006500 1008500 1006500 0 + 1006500 1008500 0,0,255 chr1 1011500 1012750 1011500 0 + 1011500 1012750 0,100,150 chr1 1015250 1016500 1015250 0 + 1015250 1016500 0,100,150 chr1 1018000 1021000 1018000 0 + 1018000 1021000 0,170,80 chr1 1024500 1028000 1024500 0 + 1024500 1028000 80,170,0 : : ------------------------------------------------------------------------------------------------------------- And load via the ‘Genomes’ / ‘manage custom tracks’ facility. These mechanisms are well documented on the UCSC site.

Exercises 1. Find the web site for the Santa Cruz Genome Browser (sometimes called the Golden Path), and investigate the three genes for which you have the full length cDNA sequence, or the protein sequence, in the file example-sequences.html >TNeu084i05 (Xenopus) How many exons does the gene appear to have? Has it been mapped already? Are there any likely upstream regulatory elements (look for conservation across species)? Are there other genes near by? >TGas122d03 (Xenopus) Is this a relatively unique gene, or a member of a gene family? What can we learn from the comparison with human genes? Are there any differences between the gene model predicted from your cDNA, and the existing predictions? >hsp70-5 (human) Starts with the protein sequence. How might this be better?

Exercise 1. Results >TNeu084i05

Exercises 2. Now go to the two other main genome browsers, Ensemble and NCBI – find the Xenopus genome (at the moment you won’t find it at NCBI, so use the mouse genome instead), and see if you get the same sort of functionality from them. Use the same two sequences. Are there different features? Are they easier/harder to use?

database of proteins in other species BLAST Cyclin-A FoxA1 cdc25 alpha-tubulin Predicted protein Sprouty-2 calmodulin KIAA10786568 frizzled Wint8 Troponin T3 Part 2: Identifying Novel Proteins sequence to analyse what is its function? FUNCTIONAL ANNOTATION Gravin-like

Different Possible Outcomes Suppose you have a cDNA sequence and you run BLASTx: 1 - genes of identifiably same function in several different species 4e-014 - polyunsaturated fatty acid elongase [Xenopus laevis] 7e-140 - fatty acid elongase 2 [Rattus norvegicus] 1e-140 - ELOVL6 protein [Homo sapiens] 2 - genes of unknown function in several different species 2e-103 - unnamed protein product [Tetraodon nigroviridis] 3e-115 - 2310009N05Rik protein [Mus musculus] 5e-117 - hypothetical protein FLJ22378 [Homo sapiens] 3 - genes with no significant BLASTx hits in other species 7.3 - 1-deoxy-D-xylulose 5-phosphate synthase [Chlamydophila abortus] 4.7 - PREDICTED: similar to tweety 2 isoform 1 [Bos taurus] 4 - significant BLASTx hits in phylogenetically distant species 2e-200 – coat maintenance protein [Escherichia coli] KNOWN NOVEL ORPHAN OUCH..!

Different Ways not to Know Anything Your lack of knowledge about protein function, having directly compared your sequence with all known proteins in the database, will manifest itself in two rather different ways. 1. It looks like a NOVEL gene – we find plenty of evidence for orthologous genes, but these are just different ways of saying but we know nothing about their function either. 2. It looks like an ORPHAN gene – this is a sign that this protein may only exists in your organism. The phenomenon is quite well documented (see reference). Obviously these are going to be quite tough to work on, as nothing like them has been seen before… Special case. There are good BLASTx matches with phylogenetically DISTANT organisms – check for contamination! An Evolutionary Analysis of Orphan Genes in Drosophila. Domazet-Loso T, Tautz D. Genome Res. 2003 Oct; 13(10): 2213-2219.

Indirect Functional Identification So you’ve found a gene you’re interested in, you’ve blasted it against the biggest protein database you can find, and have still got no real clues as to what its function might be. What do you do next… (make sure you really have a gene on your hands) 1. LOOK FOR MORE DISTANTLY RELATED GENES WITH ANNOTATION If there are believable BLASTx matches, but they are all predicted genes with no functional annotation, it might still be possible to use them as stepping stones to other, more informative, BLASTx matches which would not show up as similar to the original sequence. Think of this as traversing the phylogenetic tree. 2. FIND PARTIAL OR INDIRECT DATA – DOMAINS, EXPRESSION, ETC. Accumulate as much partial data about the sequence in the hopes that it sheds light on the function. This will include functional protein domains, expression data, genomic alignment and secondary structure. It’s unlikely that you will become casually involved with higher order structures as solving or comparing these is a complex and specialised task.

species E – function known species D species C species B your species Phylogenetic Stepping Stones Consider a gene which has the same function across many phyla, and suppose we consider a phylogenetic tree based on sequence similarity: It’s possible that the sequence of the gene in your species is sufficiently similar to its orthologs in species B and C that these will show up in a BLAST search, but not in species D or E. But the sequence of the gene in species C is more similar to those in D or E. So once you get to C, and BLAST from there you might get to E, which happens to have been researched and its function known. This could be done manually, but it has been formalised in PSI BLAST, which uses iterative rounds of BLAST searching to build a more generalised model of the gene sequence, and uses this ‘evolving’ model to gradually traverse the tree. Although if not used carefully it can go horribly wrong…

PSI BLAST (Position Specific Iterated – since you asked) Initial Query SREFTHYQWERLIKKTYFARFHNCMLISFSWER Matches from database SREKTSYQAERLIIWERFARFHICMLIPQSWER SREKDSYQUERLIPWTYFARFHNCMLIPKSWER New Composite Query SREFTHYQWERLIKKTYFARFHNCMLISFSWER K S A IWER I PQ D U P T K 2nd Round Matches from database SREKTSYQAERLIIWERFARFHICMLIPQSWER SREKDSYQUERLIPWTYFARFHNCMLIPKSWER PRAKDTRQIQRLSYWTTFLLFVITSLQRKITER PRAKDTRQIQRLSYWTTFLLFVITSLQRKITER And so on…

PSI BLAST Round 1 results

Finally some function!

Functional Domain Analysis Proteins are considered to have functional domains within them, specific regions of the protein which have specific tasks, and that these domains are recognisably conserved between different proteins, even though the overall similarities of the proteins may be quite low. Typical Diagram of Functional Domains on a Protein

Functional Domain If you can find functional domains, you may know something about the general behaviour of your protein, even if you don’t know exactly what its function is. But, as usual, be aware that non-significant matches are quite likely to be displayed in any analysis website – and at least look for some confidence score or other measure of significance. And treat everything with a degree of caution. Main specialised sites for this type of analysis are SMART and Pfam. Which have considerable overlapping functionality. Also InterProScan which attempts to integrate all the available tools… The search methods are rather different from BLAST, and rely primarily on building up a model of the functional domain from known examples. The model is then a generalised pattern for a given domain, and your unknown sequences are searched against the models, using rather more advanced methods, typically involving Hidden Markoff models.

Functional Domains and Hidden Markoff Models Once a functional domain has been identified in a number of sequences, we can build a model of it. By which we just mean a summation of our understanding of the linear sequence variants. 1234567890 YSCMVGHEAL FSCVVGHEAL 1 2 3 4 5 6 7 8 9 0 YTCKVDHETL model YF ST C?V?H~E?~L FTCQVTHEGD YSCRVKHVTLscore 5 5 10 10 10 8 8 YTCVVGHEAL The scores may be arbitrary but they constitute the Hidden Markoff Model by which we evaluate other proteins to see if they contain this domain. As you accumulate more examples the model gets more refined, and hopefully more accurate… The higher the score of your test protein sequence against the model the more likely it is presumed to contain the domain. The model will also allow for the possibility of (expensive) gaps if the spacing of your real sequence doesn’t fit the model. Known variable regions can be modelled as cheaper gaps. Once a functional domain has been identified in a number of sequences, we can build a model of it. By which we just mean a summation of our understanding of the linear sequence variants. 1234567890 YSCMVGHEAL FSCVVGHEAL YTCKVDHETL FTCQVTHEGD YSCRVKHVTL YTCVVGHEAL The scores may be arbitrary but they constitute the Hidden Markoff Model by which we evaluate other proteins to see if they contain this domain. As you accumulate more examples the model gets more refined, and hopefully more accurate… The higher the score of your test protein sequence against the model the more likely it is presumed to contain the domain. The model will also allow for the possibility of (expensive) gaps if the spacing of your real sequence doesn’t fit the model. Known variable regions can be modelled as cheaper gaps.

Problems with Models by Example There are two conceptual problems with building models from examples. The likelihood is that the behaviour of the protein domain is related to the three dimensional shape of the molecule, and the nature of its interactions with other molecules, and as we are not taking these into account at all, we cannot expect our model to be very realistic. Secondly, the model is (by its nature) highly biased towards the examples already found, and further examples found with the help of the model will tend to reinforce any initial bias. So our model may tend to grow away from the actual consensus across all possible proteins, and lock us out of whole subsets of data. Incidentally this problem of bias is very similar to what can happen with PSI BLAST if your choice of proteins to include in your growing model diverge from your original sequence too much, and can quickly take you off into strange territory…

Using SMART

Exercise 1: Using Pfam and SMART Online Scratch Pad For the following exercises, you may find a scratch pad useful for keeping information from previous stages of a search. If you open up the file scratch-pad,html you’ll find you can keep text data in the outlined box. You cannot save the data, and it’ll vanish if you close the window, or refresh it! Go to the example-sequences.html file and the Protein Domain Searches section, and copy the sequence for >igf4D. Then go to the SMART web site, paste your sequence, tick at least the signal peptides box, and then run the search. While that’s running, go to the Pfam site (in a new browser window) and search the same sequence there.Compare the two results sets. Is there any difference? Should we expect any? Now go to the NCBI BLAST page, and do a protein-protein BLASTp – this may be a useful way of getting to the same data. What could you have learned about the function of this gene? If you are ahead of the rest of the group, check out the results for the much longer >titin sequence.

Using SMART

Exercise 2: Random Sequences Again We recall that random DNA sequences gave us alignments against real proteins when using BLASTx, and that E-values can gave us a good idea whether alignments are biologically meaningful or not. This becomes even more important when searching for subtler matches – generally shorter sequences with considerable variation allowed at most positions. Go to the file random-protein-sequences.html and copy the sequence assigned to you. Go to whichever of Pfam or SMART web sites you preferred, and run the search on your sequence. Did you find any domain hits? Were they significant? Was it possible to tell? Look at the actual alignments, if you can find out how to, and also see if you can find the model that the domain is based on. Repeat with a second sequence if you have time.

Functional Motifs in Proteins You may be more familiar with functional motifs in DNA sequences, e.g. transcription factor binding sites. Here for example is the (Xenopus) TBox motif: T[CG]A[CG]AC[CG]T But short motifs are also present in protein sequences, e.g: FHA domain interaction motif 1: T..[ILA] ( Forkhead-associated (FHA) domain binds phosphothreonine or phosphoserine containing peptides ) The general problem with motifs is the number of false positives, as they are generally pretty short. For the above example we can easily see that (approx) every 20th amino acid will be a T, and about 1 in 7 of these will have ILorA in the third position following. So this motif should appear about every 140 amino acids in a random sequence… This implies a pretty high rate of (probably) false positives – and the almost certain need for confirmatory biology!

The ELM Server Eukaryotic Linear Motif The ELM server (http://elm.eu.org/) “ELM is a resource for predicting functional sites in eukaryotic proteins. Putative functional sites are identified by patterns (regular expressions). To improve the predictive power, context-based rules and logical filters are applied to reduce the amount of false positives.” We can judge the problem of interpreting these searches if we use a randomly generated sequence and send it to the ELM server…

Functional Motifs Reported by ELM in a Random Amino Acid Sequence

Secondary Structure Analysis The weak neighbour-neighbour interactions between amino acids in a protein molecule give rise to a small number of basic structural arrangements. The two main forms are linear helical structures (alpha-helices) or sheets of parallel chains (beta sheets), the intermolecular bonds stabilise the structures. We may consider that the larger scale structure of the whole protein is built from these smaller scale structures, and as such they may give us some insight into the role of the protein even in the absence of much functional data. 3-dimensional protein structures that you see pictures of, are often composed of alpha-helices and beta-sheets linked by less well structured sections of the protein. http://www.chemsoc.org/exemplarchem/entries/2004/durham_mcdowall/prot-3.html There are a large number of web pages devoted to analysing proteins for secondary structure, and even some which attempt to aggregate the results of several different methods (at PBIL).

Is it Really a Gene? If you are really getting nowhere with your functional analysis, it may worth checking whether you have got a gene at all. There are several circumstances in which this might arise. If you are using a physical reagent like a cDNA clone, it’s possible that it contains an incomplete mRNA sequence, and you are just looking at a plausible but unreal ORF in the 3’ UTR. Or it could contain an unspliced immature transcript. Or it could even be a contamination from some other, very different species, e.g. bacteria. You may learn a lot by aligning your sequence with the organism’s genome, to check that it is there and that it appears to have exons (if you would expect them). Or if you found the gene by some sort of mapping/positional analysis, and you are analysing sequences from gene models shown on the genome, check that there is real (e.g. EST) evidence for this gene – it may be purely theoretical, and entirely bogus…

Genomic Analysis It is possible that analysing the position of your gene on the genome can tell you something about its possible function. Genes sometimes function in ‘expression cassettes’, where neighbouring genes are either co-expressed, or under closely related (temporal or spatial) regulation. So if nearby genes are well characterised it would be worth considering this as a possibility. Equally, if there are obvious orthologs of this gene in other species, check out the genomic context there too. You should also be able to find out if your gene is a member of a gene family, or whether it shares small regions of coding sequence with other genes. Is there a way of doing tBLASTn or tBLASTx against the genome in your preferred browser?

Expression Data Genes that are co-expressed may well be involved in the same pathways, the more intricate the pattern of co-expression, the greater the likelihood. You may find genes of known function that yours is associated with. If you found the gene originally in an expression array experiment this may be an easy way in. Alternatively there is a growing amount of expression data out there in databases, although at the moment it’s pretty difficult to systematically mine it. Various efforts are underway to facilitate this (FlyMine, ArrayExpress) tho’ it’s not clear how effective these are yet. It may also be difficult to track ‘your gene’ down in the data sets. If your gene is from an EST or cDNA sequence, see if the ESTs are clustered and check out which libraries they come from. This may tell you whether your gene is expressed in specific stages/tissues, or whether it is more ubiquitous.

Exercise 3: Genuine Unknowns • The sequence file identification-example-sequences.htmlcontains 12 gene sequences from Xenopus tropicalis which superficially look hard to identify. The full cDNA sequence, is given along with the amino acid sequence translated from the presumed ORF. • Start with the first sequence, and accumulate data about it, then work your way on down the list… • Consider doing the following searches: • Check BLASTx/p – new sequences are arriving on the database all the time • Consider whether PSI BLAST might be useful • Check against the genome • Look for functional protein domains • Look for secondary structure • If you find anything that looks useful keep a note of it. But bear in mind that, in the real world, you may soon be thinking about going back to the laboratory for further experimental work!

Exercise 3: Results >u-one Xt6.1-CAAL21151.3 Dpy30, SCOP domains – PSI 2 rounds -> chloroplast enolase?ADP-ribosylation factor-like >u-two Xt6.1-CABJ8169.5 sipP, RUN, PDZ, PTB domains – PSI 2 rounds -> rap2 interacting protein x >u-three TEgg047e16 clear orphan, no domains, no results with PSI BLAST, Egg/Ova/Gas EST expression >u-four IMAGE:7016814 Globin domains, odd organisms, no hit on genome - worm contamination, adult whole body lib. >u-five IMAGE:5384335 signal peptide, seven transmembrane regions (!) >u-six TEgg044i21 signal peptide, coiled coils domain - PSI 2 rounds -> yeast-tht1 >u-nine CABE11813 long protein, no domains, no more additions after 2 rounds of PSI BLAST, all_predicted >u-ten TGas024h08 long protein, no domains, sort-of-name, PSI 2 rounds ->chloroplast RNA processing 1 1e-05...

Bioinformatics Workshop 2 Recap & Warm-Up Exercise