200 likes | 586 Vues
Removing redundancy in SWISS-PROT and TrEMBL. SWISS-PROT. is a curated protein sequence data bank established in 1986 by Amos Bairoch in Geneva and maintained collaboratively with EMBL since 1987 contains currently 75 000 protein sequence entries. Essential criteria for a sequence data bank.
E N D
SWISS-PROT • is a curated protein sequence data bank established in 1986 by Amos Bairoch in Geneva and maintained collaboratively with EMBL since 1987 • contains currently 75 000 protein sequence entries
Essential criteria for a sequence data bank • it must be complete with minimal redundancy • it must contain as much up-to-date information as possible on each sequence • all the information items must be retrievable by computer programs in a consistent manner • it should be integrated (cross-referenced) with other sequence related data banks
Annotation consists of the description of: • Function(s) of the protein • Post-translational modification(s) • Domains and sites • Secondary structure • Quaternary structure • Similarities to other proteins • Disease(s) associated with deficiencie(s) in the protein • Sequence conflicts, variants, etc.
TrEMBL • is a Computer-annotated supplement to SWISS-PROT • consists of entries in SWISS-PROT format • translations of CDS in the Nucleotide Sequence Database not in SWISS-PROT • the translation tools used are based on the program trembl written by Thure Etzold at the EMBL in Heidelberg
TrEMBLNEW Weekly update of TrEMBL which contains protein coding sequences derived from EMBLNEW TrEMBLNEW entries are moved into TrEMBL during the quarterly release building procedure
The Production of TrEMBL • Translation and entry creation • Sorting the entries • Automated post-processing of the SP-TrEMBL entries
Automated post-processing of TrEMBL entries • Redundancy removal: affects currently >10% of the entries • Improvements to annotation: affects currently >20% of the entries
Removing Redundancy Causes of redundancy and the detection of redundancy Removing redundancy
Causes of redundancy Different literature and sequence reports for the same protein Subfragments of longer sequences Mutations, polymorphism, variations and conflicts of a sequence are often given as separate entries in EMBL
Redundancy detection The Cyclic Redundancy Check (CRC32) calculates a nearly unique and very compact checksum for each sequence The Boyer-Moore sequence comparison algorithm for a fast string searching An algorithm that finds strings with errors ( Landau-Vishkin)
Removing Redundancy Identical full length proteins are merged in one entry Identical fragment proteins and subfragments of longer sequences from the same organism are merged
Removing Redundancy The ‘MERGE’ procedure - match CRC32 match TrEMBLNEW vs TrEMBLNEW (automatic merge) match TrEMBLNEW vs TrEMBL (automatic merge) match TrEMBLNEW vs SWISS-PROT (manual merge) - Subfragment assembly (LASSAP) match TrEMBLNEW vs TrEMBLNEW (automatic merge and manual check) match TrEMBLNEW vs TrEMBL (automatic merge and manual check) match TrEMBLNEW vs SWISS-PROT (manual merge)
Day 1Day 2 Day n EMBLNEW trembl Between releases PIDCheck SP + TREMBLPIDS (Work Release) Week 1Week 2 Week n TREMBLNEW TREMBLNEW SP Updates Replace PIDs in SP+TREMBL Building Release Merge TREMBL
Results EMBL Nucleotide Sequence Database (rel 55) has 326,000 CDS SWISS-PROT (rel 36) has 74,019 entries TrEMBL (rel 7) has 193,860 entries 110,000 CDS were already in 74,000 SWISS-PROT entries 207,000 CDS were in 194,000 TrEMBL entries 9,000 currently being processed due to redundancy procedures
Results Results of redundancy removal within TrEMBL 7 production - 743 were already in SWISS-PROT - 3380 were merged due to CRC32 matches - 4736 were removed by subfragment matches 8,859 entries were removed
SWISS-PROT at EBI Rolf Apweiler Sergio Contrino Wolfgang Fleischmann Henning Hermjakob Viv Junker Fiona Lang Claire O'Donovan Michele Magrane Maria Jesus Martin Nicoletta Mitaritonna Steffen Moeller Youla Karavidopoulou Gill Fraser Evguenia Kriventseva Collaborators Amos Bairoch Eric Glemet Jean-Jacques Codani Credits