Protein Information Resource

Protein Information Resource Oversight and Scientific Advisory Board Meeting November 14, 2005 Georgetown University Medical Center

Welcome and Introduction Vassilios Papadopoulos, Ph.D. Associate Vice President & Director, Biomedical Graduate Research Organization Georgetown University Medical Center David States, M.D., Ph.D. Chair, PIR Oversight and Scientific Advisory Board Professor & Director of Bioinformatics, University of Michigan

PIR/UniProt OverviewProject Overview, Organization, Infrastructure Cathy H. Wu, Ph.D. Director, PIR Professor, Georgetown University Medical Center

Protein Information Resource (PIR) Integrated Protein Informatics Resource for Genomic/Proteomic Research • UniProt Universal Protein Resource:Central Resource of Protein Sequence and Function • PIRSF Family Classification System: Protein Classification and Functional Annotation • iProClass Integrated Protein Database: Data Integration and Protein Mapping • Cyber Infrastructure (Interoperability and Dissemination): Ontology, XML, Object/Relational DB, J2EE Architecture http://pir.georgetown.edu

UniProt: Universal Protein Resource Central Resource of Protein Sequence and Function • International Consortium • Protein Information Resource (PIR) • European Bioinformatics Institute (EBI) • Swiss Institute of Bioinformatics (SIB) • NIH U01 Grant (NHGRI/NIGMS/NLM/NIMH/NCRR/NIDCR) • Phase I (09/02-08/05): $6 Million Annual • Bridge (09/05-?/06): $6.6M • Phase II (?/06-?/09): $6.6-8.0(?)M http://www.uniprot.org NHGRI

UniProt Archive (UniParc) Comprehensive sequence archive with sequence history Produced at EBI UniProt Reference Clusters (UniRef) Non-redundant reference clusters for sequence search Produced at PIR UniProt Knowledgebase (UniProtKB) Integration of PIR-PSD, Swiss-Prot and TrEMBL databases Stable, comprehensive, fully classified, richly and accurately annotated knowledgebase UniProtKB/Swiss-Prot: Produced at SIB UniProtKB/TrEMBL: Produced at EBI Literature-based and automated annotation at SIB, PIR, EBI UniProt Databases

UniProt Management Structure • Scientific Advisory Panel (SAP) to be established by NHGRI

UniProt Project Coordination • UniProt email discussion groups • Project Liaisons and Ad hoc teams • Tri-weekly teleconference calls • Tri-annual face-to-face Consortium meetings • January 12-13, 2006 at Geneva • April 10-11, 2006 at Georgetown University • Exchange visits of scientific and technical staff • Five PIR staff at SIB (1-2 weeks, Nov 05) for annotation integration • Retreats France, 2004

UniProt Activities at PIR • Integration of PIR-PSD into UniProtKB Swiss-Prot/TrEMBL • Incorporation of unique PIR entries • Incorporation of PIR annotations: references, experimental features with literature evidence tag • Functional annotation of UniProtKB proteins • Development of PIRSF family classification system & PIRSF curation => Comprehensive coverage of all UniProtKB proteins • Development of rule-based annotation system & PIRNR (name rule) /PIRSR (site rule) curation => Rule curation and integration into Swiss-Prot/TrEMBL annotation pipelines & propagation of annotations (e.g., name, GO, site) • Production of UniRef100/90/50 databases =>Enhancement & scaling • Creation of UniProt web site and help system => Unified UniProt web site & user community interaction

PIRSF Classification System Protein Classification and Functional Annotation • PIRSF: Evolutionary relationships of proteins from super- to sub-families • Curated families with name rules and site rules • Curation platform with classification/visualization tools • Deliverables: UniProtKB annotations, InterPro families, PIRSF reports, PIRSF curation platform PIRSF Work Group Meeting, April 2003

iProClass Integrated Protein Database Data Integration and Protein Mapping • Data integration from >90 databases • Underlying data warehouse for protein ID/name/bibliography mapping • Integration of protein family, function, structure for functional annotation • Rich link (link + summary) for value-added reports of UniProt proteins Funded by NSF

iProLINK Literature Mining Resource • Bibliography report: Annotated bibliography for UniProtKB proteins • BioThesaurus reports: Protein and gene names for UniProtKB proteins • RLIMS-P program: Tag PubMed abstracts for phosphorylation objects • Protein ontology DAG: PIRSF-based ontology Funded by NSF

NIAID Proteomic Admin Center • NIAID Proteomic Master Catalog & Complete Proteomes • iProXpress for Protein Function and Pathway Analysis • Gene/Peptide-Protein Mapping • Sequence Analysis & Data Mining • Function/ Pathway Discovery http://pir.georgetown.edu/ proteomics/ Funded by NIAID

Bioinformatics Infrastructure • NCI caBIG: PIR grid-enablement (Programming access to UniProtKB) • NSF TeraGrid: All-against-all BLAST (UniProtKB related sequences) • PIR Bioinformatics Framework • Software Framework: J2EE n-Tier Architecture with Object Models • Database Distribution: XML, FASTA, Relational (Oracle 9i, MySQL) • Other Deliverables: Object Models, Web Services Funded by NCI

Computing Environment • Computers: Two Sun V880, IBM P690, 100-CPU Linux Cluster, Compaq 4100 Alpha • Networking: Internet2, GU Network (1Gbps) • GU UIS Advanced Research Computing

PIR Environment • Funding: ~$3Million Annual Total (2/3 UniProt, 1/3 Other) • Home Institution: Georgetown University Medical Center (GUMC) • Subcontract: National Biomedical Research Foundation (NBRF) • New Location: Off-Campus (GU North Campus), 6250 SQFT Suite 1200, 3300 Whitehaven Street NW, Washington, DC 20007

PIR Organization • 25 Staff Members • 14 GU, 11 NBRF • 22 FTEs • 12.7 GU, 9.3 NBRF • 17 with Doctorate Degree • 11 GU Faculty • 2 Professors • 1 Research Associate Professor • 6 Research Assistant Professors • 2 Research Instructors

PIR Community Interactions(since 2004) • Presentations and Invited Seminars • NIH Proteomics Workshop (Bi-Annual) – Bioinformatics Day • Conference Demos/Posters: ISMB-05, US HUPO-05, SOFG04 • Over 20 Invited Presentations: Keystone, Human Brain Project Satellite Symposium, PDB Symposium, HUPO-05 • Policy Forums, Committees: NSF Plant Cyberinfrastructure, NIH Protein Structure Initiative, HUPO Proteomics Standards Initiative • Publications: Over 25 Refereed Papers and Book Chapters • Collaborations and Interactions • Collaborated and interacted with over 10 research institutions • Hosted face-to-face meetings for NIAID/caBIG projects • Paper and Grant Reviews • Reviewed over 20 papers for referred journals and conferences • Served on NSF/NIH grant review panels

PIR-Georgetown Interactions • Teaching • Courses: Bioinformatics (BCHB 521), Advanced Bioinformatics (BCHB 621) • Lectures: Medical Biochemistry, Protein Biomarker, Introductory Biology • Mentoring • Mentored 9 graduate students (PhD students, MS Internship projects) • Intercampus Seminars • Proposal Submission by PIR Young Investigators as PI • Six proposals to federal and other agencies

PIR/UniProt – Summary & Statistics Database Growth Database Usage Unified UniProt WebSite PIR UniProt Consortium Interactions Peter McGarvey, Ph.D.

UniProt: Universal Protein Resource http://www.uniprot.org

Database Growth

Customer Email Topicshelp@uniprot.org & pirmail@georgetown.edu 550 UniProt emails 720 PIR emails 1 Day Turnaround “PIR is a wonderful resource.” – Craig “Thank you for your prompt response, as always UniProt is on the ball!” – Fiona

PIR/UniProt – Unified UniProt Web Site • Dec. 03, Three Synchronized Sites based on PIR Design • Nov. 04, Established Goals for Unified Web Sites. • 2005, Back-end Data and Software Platform Developed. • Nov. 05, PIR Playing a Lead Role in Developing Specifications for the Interface. • June 06, Release of Unified UniProt Web Site Hosted by PIR and EBI

PIR/UniProt - Consortium Interactions • UniProt liaison group (discussion of high-level issues) • UniProt web site committee (Unified UniProt web site planning) • UniProt Link committee (working with external databases) • UniProt help-mail (answering user inquiries) • UniProt document committee (documentation, tutorials and FAQs) • UniProt XML group (XML documentation and maintenance) • UniProt group for automatic annotation pipeline • Manual curation of Swiss-Prot template sequences • Manual curation of site rules and controlled vocabularies • Development of automatic annotation rules • Development of protein naming guidelines • Incorporation of new protein families into InterPro • PIR routinely visits or hosts colleagues from EBI and SIB for discussions. • Biweekly update of UniRef, UniParc and UniProtKB databases

Protein Classification and Annotation Darren Natale, Ph.D. Team Lead, Protein Science, PIR Research Assistant Professor, GUMC

Protein Curation Activities • PIRSF – classification of homeomorphic proteins based on evolutionary relationships • PIRNR – family-based “Name Rules” that define the parameters for propagating specific name, EC and GO annotation to members • PIRSR – family-based “Site Rules” that define the parameters for propagating specific feature annotation to members

Specialized Tools (I) • Pfam/PIRSF Hierarchy • Domain Relatives • Domain Composition DAG Preserves these three features in a navigable format In edit mode, allows easy creation, destruction, and movement of PIRSFs

Specialized Tools (II) HPS KGPDC Phylogenetic Tree Classification/Annotation Alignment PIR Tree and Alignment Viewer (PIRTAV) HPS = 3-hexulose-6-phosphate synthase KGPDC = 3-keto-L-gulonate 6-phosphate decarboxylase

PIRSF Curation Pipeline • Uncurated level – computer-generated • Preliminary Curation Level • Curate membership (principle tools: BLAST results, iterative blastclust, on-the-fly HMM) • Curate domain architecture • Select seeds • Full Curation Level • Curate name and some references • Optional: write abstract indicating function, structure, etc. (Full level only) After name review session and HMM performance check, all information (HMM, membership, annotation) is sent to EBI for integration into InterPro.

PIRNR Curation Pipeline • Start with PIRSF curated to Full level • Define match criteria for application of the rule • Review protein name, synonyms, EC numbers, GO terms • Find those that are appropriate to propagate to members that match rule criteria After review of propagable information, send match conditions, exclusion conditions, and propagated fields to EBI for inclusion into automatic annotation pipeline. Results are displayed in EBI’s UniProt entry extended view.

PIRSR Curation Pipeline • Start with PIRSF with curated membership and seeds. At least one member must have solved structure. • Edit seed-to-structure alignment to define and retain conserved regions covering pertinent residues • Build Site HMM from concatenated conserved regions • Define feature annotation using controlled vocabulary with evidence attribution Apply rules to PIRSF members, create log files to send to SIB (UniProtKB/Swiss-Prot) or EBI (UniProtKB/TrEMBL). Results are incorporated into UniProtKB flat files.

Progress on Protein Curation Activities 1207 1001 83 428DE/GO/EC 342DE/GO 157DE 561 420 251 112 38 14 162 Preliminary 693 Full 352 Full + Desc 35 Active 34 Metal/Binding 14 Misc. 4222 1595 1266

Impact Measurements • PIRSFs integrated into InterPro • Sent: • PIRSF-unique: • PIRNR touches on UniProtKB/TrEMBL • Entries: • Annotation lines: • PIRSR touches on UniProtKB • Entries: • Feature lines: 1,775 840 60,300 281,400 41,000 ( 9,800) 100,000 (27,000)

Increasing Throughput & Impact Curated With Structure Full Active + Ligand To InterPro AutoAnno Active Increased specificity PIRSF PIRNR PIRSR • Emphasize Full/InterPro • Rules to EBI • Active sites • Comprehensive coverage • Curation “push” • Propagation at PIR • Add ligand-binding All three will be integrated into the Swiss-Prot annotation platform All three will be integrated into the Swiss-Prot annotation platform All three will be integrated into the Swiss-Prot annotation platform

UniRef Databases Hongzhan Huang, Ph.D. Bioinformatics Team Lead Protein Information Resource, GUMC

UniRef (UniProt Reference Clusters) • Non-Redundant Reference Clusters for Sequence Searching • Derived from UniProtKB and Selected UniParc Sources • UniRef100: 100% sequence identity • UniRef90: 90% sequence identity (1/3 size reduction from UniRef100) • UniRef50: 50% sequence identity (2/3 size reduction) Release 6.4 (Nov 05)

Sub-fragments UniRef100 • The most comprehensive sequence dataset for sequence similarity search • 3,176K sequences in UniRef100 vs. 3,022K sequences in NCBI nr • Source Sequences • Complete UniProtKB - Splice Variants as separate entries • Selected UniParc (e.g. Ensembl and RefSeq) • Non-Redundancy • Combine identical sequences from all species • Merge sub-fragments

UniRef90 & UniRef50 • Reduced sequence datasets for faster sequence similarity search • Representative sequence for each cluster • Clustering Algorithm • CD-HIT: Fast, top down, non-overlapping • PIR’s parallelized version running on Linux Cluster UniRef90: 1/3 size reduction UniRef50: 2/3 size reduction

UniRef50 Sequence Classification • Completely automated, biweekly-updated classification of all proteins • How good are the UniRef50 clusters? • Evaluated by all-against-all BLAST search results • 98% of the clusters are of good quality: each sequence matches every other sequences within the cluster • Problematic clusters • One long sequence bridges two or more non-related sub-clusters. • May be resulted from incorrect gene models, domain-fusion, polyprotein • New algorithm will be developed with length/overlap parameters to detect and regroup such clusters.

PIRCF Families (Computer-generated Families) UniRef50 Clusters PIRSF Families Merge related clusters Checked by curator Usages of UniRef Clusters • UniRef90/50 for comprehensive automated classification of proteins • Faster searches and less cluttered similarity search outputs • More even sampling of sequence space and reduction of search bias • UniRef for integrity check of database annotation • Uniref100 to annotate EST sequences • UniRef50 to detect incorrect gene models • UniRef90/50 for PIRSF family classification • UniRef90 to recruit new PIRSF family members • UniRef50 to create new PIRSF families

Literature Mining Zhang-Zhi Hu, M.D. Associate Team Lead, Protein Science, PIR Research Assistant Professor, GUMC

Complete UniProtKB bibliography mapping RLIMS-P text mining tool for protein phosphorylation BioThesaurus: protein/gene names iProLINKAn Integrated Resource for Protein Literature Mining

PIR/UniProt Protein Bibliography • 355,629 unique citations (PMID) are in iProClass for 2.4 million UniProtKB entries. • 166,950(47%)citations are currently in UniProtKB. • The additional 188,679 (53%) unique citations are taken from sources such as GeneRIF, SGD, MGI. Bibliography report: • curated citations • user submitted • computationally mapped

BioThesaurus report BioThesaurus– comprehensive collection of gene/protein names from multiple sources and their associations with database entities. Applications of BioThesaurus • Gene/protein names mapping • Search synonyms • Resolve name ambiguity • Database annotation • Error detection: conflicting names in UniProtKB • Literature mining • Query expansion: synonyms and text-variants allow for expanded search results IAPP IAPP named in 18 entries

kinase substrate sites PMID mapping Rule-based LIterature Mining System for Protein Phosphorylation RLIMS-P – RLIMS-P report – PMID:1939059 MEDLINE abstract (PubMed ID) P12957 RLIMS-P Phosphorylation feature extraction UniProtKB entry mapping • 1876UniProtKB entries are currently annotated with 4042phosphorylation sites. • 105Kunique citations (PMID) are in UniProtKB/Swiss-Prot • Batch processing by RLIMS-P yielded 4690abstracts with phosphorylation information, 913 of them with site information, including 214in UniProtKB entries with no annotated phosphorylation features. UniProtKB site feature annotation & evidence attribution

NIAID Biodefense Proteomics Program Peter McGarvey, Ph.D.

NIAID Biodefense Proteomics Program • 7 Proteomics Research Centers: Identifying Targets for Therapeutic Interventions “..discovering targets for potential candidates for the next generation of vaccines, therapeutics, and diagnostics” • Administrative Resource Center: Support research centers, public distribution of results and protocols ..establish a Scientific Working Group, Interoperability Working Group, Data infrastructure and promote awareness of the project so scientists worldwide can utilize these resources.

Administrative Resource • Project Management - Social & Scientific Systems (SSS) • Meetings and Communications • Web Portal • NIAID Annual Meeting at PIR May 2006 • Scientific Coordination - PIR & VBI • Scientific Advisory Working Group (SWG) • Interoperability Working Group (IWG) • Data Infrastructure – PIR & VBI • Proteomic Database: Storage and Retrieval (VBI) • Data Management and Analysis Tools (PIR/VBI) • Integrated Protein Knowledge System (PIR)

Protein Information Resource

Protein Information Resource

Presentation Transcript

INFORMATION RESOURCE CENTER (IRC)

Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Pro

UniProt - The Universal Protein Resource

Information resource

PIR (Protein Information Resource)

Human Resource Information Systems

CISL Resource Information System

UniProt: Universal Protein Resource

NT Land Resource Information

UniProt - The Universal Protein Resource

Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

Demo: Protein Information Resource

Information Resource Design

Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center

Protein Information Resource

PIR: Protein Information Resource

SRS – Sequence Retrieval System PIR – Protein Information Resource

Literature Data Mining and Protein Ontology Development at the Protein Information Resource (PIR)

Human Resource Information system

VA Information Resource Center

Demo: Protein Information Resource