1 / 68

Functional Annotation

Functional Annotation. Episode 2: Preliminary Results. The Group. Recap. What is Functional Annotation The I mportance of Functional Annotation The Biology of H . haemolyticus Background for Functional Annotation Pros/Cons of Available Approaches Planned Approach Breadth Depth .

sabina
Télécharger la présentation

Functional Annotation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Functional Annotation Episode 2: Preliminary Results The Group

  2. Recap • What is Functional Annotation • The Importance of Functional Annotation • The Biology of H. haemolyticus • Background for Functional Annotation • Pros/Cons of Available Approaches • Planned Approach • Breadth • Depth

  3. Flowchart

  4. Flowchart

  5. Preliminary Results

  6. Subject Organisms • fucK : ncodingfuculose-kinase.  fucK deletion has been observed in some Hi isolates • Hpd: encoding a lipoprotein protein D,

  7. BLAST: Output and Parsing • Once the results received from gene prediction tools, we should blast them against different databases • The selected threshold: 0.005 • This is automatically done by the ad-hoc scripts utilizing the BioPerl lib, for all 6 strains • The results are then processed and the certain metrics elicited for further analysis

  8. BLAST v/s UniProt: Coverage

  9. BLAST v/s UniProt: M19107

  10. BLAST v/s UniProt: M21709

  11. Conserved Domain Database (CDD)

  12. Introduction • CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequence alignment models for ancient domains and full-length proteins.  • These are available as position-specific score matrices (PSSMs) for fast identification of conserved domains in protein sequences via RPS-BLAST. • The PSSMs are meant to be used for compiling RPS-BLAST search databases only.

  13. RPS-BLAST • Reversed Position Specific Blast • It searches a query sequence against a database of profiles (opposite of PSI-BLAST). • Use pre-computed lookup table for the profiles to allow the search to proceed faster (architecture dependent). • The CD-Search databases for RPS-BLAST: ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/

  14. Strategy

  15. FORMATRPSDB • Formatrpsdb is a utility that converts a collection of input sequences into a database suitable for use with RPS-Blast. • Formatrpsdb is designed to perform the work of formatdb, makemat and copymat simultaneously, without generating the large number of intermediate files these utilities would need to create an RPS Blast database.

  16. Build Database For scoremats that contain only residue frequencies, the scaling factor to apply when creating PSSMs Threshold for extending hits for RPS database Input file containing list of ASN.1 Scoremat filenames Create index files for database Base name of output database Title for database file

  17. RUN RPS-BLAST

  18. Results for CDD: COGs Organism: M19107 >10

  19. Results for CDD: COGs Organism: M21709 >10

  20. LipoP

  21. LiopP • LipoP classifies genes into 4 classes: • SpI: Signal peptide I • SpII: Lipoprotein signal peptide • TMH: N-terminal transmembrane helix (Not very reliable, It is used to avoid TMH being falsely predicted as signal peptides) • CYT: Cytoplasmic. (All the rest) • The classification system in LipoP uses HMM with four branches, one each for SpI, SpII, TMH, CYT. • Protein sets for training and testing was extracted from SWISS-PROT. • They consisted of lipoproteins, SPaseI-cleaved proteins, cytoplasmic proteins from the two Gram-negative phyllumsProteobacteria and Spirochetes. • Transmembrane proteins were taken from phyllumsProteobacteria and Gracilicutes.

  22. Output Example # M19107_final_1488 SpI score=11.1193 margin=11.320213 cleavage=31-32 # Cut-off=-3 M19107_final_1488 LipoP1.0:Best SpI 1 1 11.1193 M19107_final_1488 LipoP1.0:Margin SpI 1 1 11.320213 M19107_final_1488 LipoP1.0:Class CYT 1 1 -0.200913 M19107_final_1488 LipoP1.0:Class SpII 1 1 -1.80091 M19107_final_1488 LipoP1.0:Signal CleavI 31 32 11.119 # PISHA|SDLNQ M19107_final_1488 LipoP1.0:Signal CleavI 30 31 -2.18348 # SPISH|ASDLN M19107_final_1488 LipoP1.0:Signal CleavII 19 20 -1.80091 # TALFS|CGLLI Pos+2=G Sequence ID Type of prediction. Best means the highest scoring class, Margin gives the difference between the best score and the second best score, Class gives the score of other classes and Signal lines contain predicted cleavage sites. Feature type. Location in the sequence. For lines with a class prediction it is always 1. For cleavage sites it is the last amino acid of the signal peptide relative to the predicted cleavage site. Location same as above except that for cleavage sites it is the first amino acids after the cleavage site. Score. For the "Margin" type it is the difference between the best and the second best class score. For the cleavage sites the ±5 context is shown after the #, and for lipoprotein cleavage sites the amino acid in postition +2 is shown (which may determine whether the lipoprotein is attached to the inner or outer membrane) - An aspartic acid (D) in position +2 after the cleavage site of a lipoprotein means that it is attached to the inner membrane, and most other lipoproteins are attached to the outer membrane (“Testing the '+2 rule' for lipoprotein sorting in the Escherichia coli cell envelope with a new genetic selection”,Seydel et al (1999) Molecular Microbiology 34: 810-821)

  23. Results Hh Hi

  24. SignalP

  25. Biological background • Many different types of secretory signals are found. SignalP focused on prediction of classical signal peptides, which are the far most common type of signal peptide cleaved by signal peptidase I (SPase). • In bacteria signal peptide is targeted directly to the cell membrane.

  26. SignalP • SignalP3.0 was the best method among PrediSi, SPEPlip, Signal-CF, Signal-3L and Signal-BLAST. (Choo, K., Tan, T. & Ranganathan, S. BMC Bioinformatics 10, S2 (2009).) • SignalP4.0 is even better, and hence was included in our method. (SignalP 4.0: discriminating signal peptides from transmembrane regions Thomas Nordahl Petersen, et al. Nature Methods, 8:785-786, 2011)

  27. SignalP • SignalP 4.0 is a purely neural network–based method. • Two types of networks in SignalP 4.0: • SignalP-TM networks • SignalP-noTM networks • The decision to select network: If SignalP-TM predicts four or more positions as being transmembrane positions, SignalP-TM is used for the final prediction, otherwise SignalPnoTM is used.

  28. Results from SignalP

  29. Comparison between LipoP and SignalP • The results obtained from LipoP and SignalP were compared with the help of a script. • Both SpI and SpII were taken from LipoP and all the positive outputs were taken from SignalP. • They were also analyzed for similar cleavage sites.

  30. Comparison table

  31. 100 176 86 2 150 143 1 75 M19501 M19107 M21639 122 75 91 152 89 151 M21127 M21709 M21621 Signal P LipoP

  32. Comparison between LipoP and SignalP • Bottom-line: As was clearly visible by the Venn Diagram, the SignalP didn’t provided much of new information as compared to LipoP.

  33. Prediction of transmembrane helices in proteins TMHMM

  34. TMHMM

  35. Member signature databases Similar coverage in size; Different content

  36. Querying with InterProScan About • A wrapper of sequence analysis applications • Database and output files scanning • Bulk data processing • Efficient(parallel) internal architecture InterProScan Query Sequence

  37. Querying with InterProScan • Input • Nucleotide* or protein sequences • Recognized sequence format: raw, FASTA or EMBL • Reformat and translate(if necessary) *Nucleotide sequences will translated and scanned in all 6 frames without any further assumption

  38. Querying with InterProScan • Running InterProScan screenshot at<60s

  39. Querying with InterProScan

  40. Querying with InterProScan • Output • InterProScan makes results available in four formats: raw, ebixml, xml, txt, html • Parse InterProScan Output(BioPerl) • Bio::SeqIO::interpro • Interpretation of Output Data(example)

  41. Querying with InterProScan

  42. Preliminary Results 1,391 53 325

  43. Next Up • Major Challenge: Funneling all the annotation information into a consolidated GenBank/GFF3 entry. • Level 2!

  44. Level 2 Operons, Virulence Factors and Metabolic Pathways

  45. Likelihood of a pathogen causing disease Virulence

  46. H.haemolyticus • As the name of the species implies, is generally hemolytic on blood agar plates • Beta-hemolytic phenotype routinely used in the clinical setting to distinguish H.h from NTHi • NonhemolyticH. haemolyticus strains are being isolated > misidentified as NTHI Gene(s) encoding hemolysin Unknown (XinWangMeningitis Laboratory, CDC) Photograph from FromMicrobeLibrary.org

More Related