Comprehensive Automation and Analysis of Rice Genome Annotation

Rice Sequence and Map Analysis Leonid Teytelman

Rice Genome Annotation • Sequence Alignments • Automation • Comparative Maps • Genetic Marker Correspondences • FPC Map • FPC I-Map • EnsEMBL Pipeline • Automated Annotation • Compute Farms

Rice Genome Annotation

Aligned Data Sets: • Rice Coding Sequences • Rice Complete CDSs • Rice TIGR GIs • Rice BGI EST Clusters • Rice dbEST ESTs • Rice BGI ESTs • Non-Rice Coding Sequences • Maize Unigene Clusters • Maize TIGR GIs • Maize dbEST ESTs • Barley dbEST ESTs • Wheat dbEST ESTs • Sorghum dbEST ESTs Rice CUGI BAC ends Rice JRGP/Cornell RFLP Markers Rice Cornell SSRs

Alignment Tools: Target Queries • BLAT: search & alignment • pslReps: filtering of low-quality matches • e-PCR: matches based on near-identity to the PCR primers, and correct order

Alignment Tools: • BLAT: search & alignment • pslReps: filtering of low-quality matches • e-PCR: matches based on near-identity to the PCR primers, and correct order Target Target Queries

Alignment Methods: • Rice Coding Sequences: • BLAT search & alignment • pslReps filtering of repetitive matches • Accept based on percent of EST length matched • Non-Rice Coding Sequences : • BLAT search & alignment • pslReps filtering of repetitive matches • Accept based on hit length and hit frequency • Rice BAC ends: • BLAT search & alignment • Accept based on gap length, percent of BAC end length matched, percent identity, and hit frequency.

Alignment Methods: • Rice Markers: • BLAT search & alignment • Accept based on percent of marker length matched and the gap length in case of genomic markers. • Utilize genetic map information; accept those whose genetic & physical chromosome assignment is concordant. • Rice SSRs: • e-PCR with default parameters, allowing 0 mismatches in the primers

February 2002 BAC/PAC Dataset Total BACs/PACs: 1,847 Total bp: 250,879,896 (250MB ) Phase 1: 78 Phase 2: 1,238 Phase 3: 531 Annotated Phase 3: 330 Annotated Genes: 8,034

Alignment Totals

Automating Alignments: • For each group of data sets, there is a script to automatically: • Run pslReps • Load results into the database • Discard low-quality matches • Update documentation

Comparative Maps

Map Correspondences Same marker on multiple mapping studies • Name-identity • Curated evidence • Sequence-based correspondences for JRGP and Cornell markers: • BLAT search & alignment • Utilize genetic mapping information, accepting matches on same chromosome and less than 30cM apart.

curator same name sequence-based

same name curator

FPC data from CUGI, synchronized with the latest release.

Discordant

Cornell/JRGP markers mapped to sequenced clones were assigned positions on the FPC contigs.

Total: 2,272 4,417

EnsEMBL Pipeline in a Nutshell

EnsEMBL Pipeline Overview RepeatMasker Genscan Blast GenomeBuilder Hmmer RepeatMasker BLAT GeneWise Hmmer • System for automated genome annotation • Executes and keeps track of computational jobs • Analysis job execution is serial, allowing stage dependencies • Jobs are user-defined • Can take advantage of a compute farm

Organization • Utilizes and expands on the EnsEMBL-core modules and database schema • Database stores: • analysis program names and parameters • analysis results • rules for job dependencies • and progress status for each job • Perl modules: • access the database • execute specified analysis programs • parse and load into the database the analysis results

Cluster Utilization • How to split up tasks? • Contig-by-contig approach • How to execute jobs on slave nodes? • Load management an scheduling (LSF, PBS, etc) • Management of management: • Automatic job submission • Error/completion checking

Comprehensive Automation and Analysis of Rice Genome Annotation

Comprehensive Automation and Analysis of Rice Genome Annotation

Presentation Transcript

LEONID BREZHNEV

Sequence analysis

Sequence analysis

RICE MACT and Oil Analysis

Sequence Analysis

Sequence Analysis

Sequence Analysis

Sequence Analysis

Sequence analysis

Sequence Analysis

Sequence Analysis

SEQUENCE ANALYSIS

Sequence analysis

Sequence Analysis

Sequence Analysis

Rice Sequence and Map Analysis Leonid Teytelman

SEQUENCE ANALYSIS

Sequence Analysis

Sequence Analysis

Sequence Analysis