Bioinformatics: gene-protein-structure-function

Bioinformatics: gene-protein-structure-function Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK http://www.bioinf.man.ac.uk/dbbrowser/

Foreword • Predicting genes in uncharacterised genomic DNA is one of the main problems facing sequence annotators. De novo prediction methods (searching for splice-site consensus motifs, biased codon usage, etc.) have been only partially successful, & investigators have found that the surest way of predicting a gene is by alignment with a homologous protein sequence.

Overview • In silico structure &function prediction • the Holy Grail • a reality check • What methods are available • PROSITE, PRINTS, Pfam, etc. • Why not just use PSI-BLAST? • Expert systems & other integrated approaches • Conclusions

The Holy Grail of bioinformatics • ...to be able to understand thewordsin a sequence sentence that form a particular protein structure

The reality of sequence analysis • ...isn't so glamorous....but means we can recognise words that form characteristicpatterns, even if we don't know the precise syntax to build complete protein sentences

Science fact & fiction • The state of the art is pattern recognition • Sequence pattern recognition is easier to achieve & more reliable than fold recognition • which is ~50% reliable even in expert hands • Prediction is still not possible • & is unlikely to be so for decades to come (if ever) • Structural genomics will yield representative structures for many (not all) proteins in future • structures of new sequences will be determined by modelling • prediction will become an academic exercise • But, to debunk a popular myth, knowing structure alone does not inherently tell us function

In silico function prediction…a reality check • What is the function of this sequence? • What is the function of this motif? • the fold provides ascaffold,which can be decorated in different ways by different sequences to confer different functions - knowing the fold & function allows us to rationalise how the structureeffectsits function at the molecular level • What is the function of this structure?

What's in a sequence?

Methods for family analysis Single motif methods Fuzzy regex (eMOTIF) Full domain alignment methods Exact regex (PROSITE) Profiles (PROFILE LIBRARY) HMMs (Pfam) Identity matrices (PRINTS) Multiple motif methods Weight matrices (BLOCKS)

The challenge of family analysis • highly divergent family with single function? • superfamily with many diverse functional families? • must distinguish if function analysis done in silico • a tough challenge!

In the beginning was PROSITE TM domain • [GSTALIVMYWC]-[GSTANCPDE]-{EDPKRH}-X(2)-[LIVMNQGA]-X(2)-[LIVMFT]-[GSTANC]-LIVMFYWSTAC]-[DENH]-R

Diagnostic limitations of PROSITE • G_PROTEIN_RECEPTOR; PATTERN • PS00237; • G-protein coupled receptor signature • [GSTALIVMYWC]-[GSTANCPDE]-{EDPKRH}-X(2)-[LIVMNQGA]- • X(2)-[LIVMFT]-[GSTANC]-[LIVMFYWSTAC]-[DENH]-R • /TOTAL=1121(1121); /POS=1057(1057); /FALSE_POS=64(64); • /FALSE_NEG=112; /PARTIAL=48; UNKNOWN=0(0) • This represents an apparent 20% error rate • the actual rate is probablyhigher • Thus, a match to a pattern is not necessarily true • & a mis-match is not necessarilyfalse! • False-negatives are a fundamental limitation to this type of pattern matching • if you don't know what you're looking for,you'll never know you missed it!

Then came PRINTS TM domain loop region TM domain

Hierarchical family analysis loop region TM domain TM domain

What is PRINTS?(not the best thing since sliced bread, but....) • A db of diagnostic fingerprints that characterise proteins • family analysis ishierarchical, allowing fine-grained diagnoses • Fingerprints are groups of conserved motifs, used for iterative db searching • iteration refines the fingerprint • potency is gained from themutual contextofmotif neighbours • results are biologically more meaningful than from single motifs • results aremanually annotatedprior to inclusion in the db • PRINTS has many applications, e.g.: • basis ofBLOCKS&eMOTIF • EditToTrEMBL- to annotateTrEMBL • provide annotation & hierarchical protein classification inInterPro

Visualising fingerprints N C

Diagnosing partial matches

m-opioid receptor true m-opioid receptor k-opioid receptor

Why bother with family dbs? • Seq searches won't always allow outright diagnosis • BLAST & FASTA are notinfallible & often can't assign significant scores • outputs may be complicated by the multi-domain or modular nature of the protein, compositionally biased regions, repeats & so on • annotations of retrieved hits may be incorrect • Pattern dbs contain potent descriptors • so, distant relationships missed by pairwise search tools may be captured by one or more of the family or functional site distillations

Overview of resources • PROSITE (SIB) - 1144entries • single motifs (regexs) - best withsmall highly conserved sites • Profile library (ISREC) - ~300entries • weight matrices - good withdivergent domains & superfamilies • PRINTS (Manchester) - 1750entries • multiple motifs (fingerprints) - best forfamilies and sub-families • Pfam (Sanger Centre) - 3849entries • HMMs - good withdivergent domains & superfamilies • Blocks (FHCRC) - ~2608entries • multiple motifs (derived from InterPro & PRINTS) • eMOTIF (Stanford) • permissive regexs (derived from PRINTS & BLOCKS)

Designing a search protocol • Given a newly-determined sequence, want to know • what is my protein? • to what family does it belong? • what is its function? • how can we explain this in structural terms? • Given the variety of dbs available, rather than rely on just one, it is important to devise a search protocol • search thesequence & family dbs • estimate significance - compare results & find aconsensus

This does not simply mean.... • BLAST + PROSITE (e.g., on the Web) • or • FASTA + motifs/profiles (e.g., using GCG) • But this is still what most people do • including so-calledexpert systemsfor genome analysis

Expert systems for functional analysis...from genome data to biological knowledge GeneQuiz- Automatic protein function annotation MAGPIE-Automatic genome analysis PEDANT-Automatic analysis of proteins • How they describe themselves: GeneQuiz-Expert systemfor derivation of functional information MAGPIE-Automated Genome Project InvestigationEnvironment PEDANT-Completefunctional & structural characterisation of protein sequences • What they do: GeneQuiz-BLAST/FASTA, PROSITE, BLOCKS MAGPIE-BLAST/FASTA, PROSITE, BLOCKS PEDANT- BLAST/FASTA, PROSITE, BLOCKS

Challenges for expert systems D10226 R13F63 450 320 UL78_HCMVA R05H51 320

What GeneQuiz said… a thrombin receptor?

What GeneQuiz said later… *

Other integrated approachesThe European InterPro project • To simplify sequence analysis, the family databases are being integrated to create a unified annotation resource– InterPro • current release has5312entries • a central annotation resource, with pointers to its satellite dbs • initial partners were PRINTS, PROSITE, profiles & Pfam • new partners include ProDom, TIGRfam, SMART & hopefully others (e.g., BLOCKS, MetaFam) • lags behind its sources • major role in fly & human genome annotation

InterPro – method comparison

Ground rules for bioinformatics • Don't always believe what programs tell you • they're often misleading & sometimes wrong! • Don't always believe what databases tell you • they're often misleading & sometimes wrong! • Don't always believe what lecturers tell you • they're sometimes misleading & oftenwrong! • In short, don't be a naive user • when computers are applied to biology, it is vital to understand the difference between mathematical & biological significance • computers don’t do biology • they do sums • quickly!

Conclusions • Success of search protocols based only on BLAST & PROSITE is likely to be limited • beware‘expert’systems • understand the methods • No db is best - use several • different methods provide different perspectives • dbs aren’t complete & their contents don’t fully overlap • The more dbs searched, harder to interpret results • hence s/w being designed to give"intelligent"consensus outputs • The more computers are used to automate analysis, the greater the need for collaboration • between s/w developers, annotators & ‘wet’ experimentalists • Long way from having reliable analytical tools • but with the right approach…

Bioinformatics: gene-protein-structure-function