Comprehensive Automation and Analysis of Rice Genome Annotation
260 likes | 392 Vues
This study presents an in-depth approach to rice genome annotation, focusing on sequence alignment, automation techniques, and comparative mapping methods. We detail the EnsEMBL pipeline utilized for efficient annotation processes, including handling of various rice coding sequences, EST clusters, and markers. The automation of alignment workflows is discussed, along with methods for filtering low-quality matches and enhancing alignment accuracy through scripts. The results contribute to improved genetic marker correspondences, facilitating future genomic research and crop improvement efforts.
Comprehensive Automation and Analysis of Rice Genome Annotation
E N D
Presentation Transcript
Rice Sequence and Map Analysis Leonid Teytelman
Rice Genome Annotation • Sequence Alignments • Automation • Comparative Maps • Genetic Marker Correspondences • FPC Map • FPC I-Map • EnsEMBL Pipeline • Automated Annotation • Compute Farms
Aligned Data Sets: • Rice Coding Sequences • Rice Complete CDSs • Rice TIGR GIs • Rice BGI EST Clusters • Rice dbEST ESTs • Rice BGI ESTs • Non-Rice Coding Sequences • Maize Unigene Clusters • Maize TIGR GIs • Maize dbEST ESTs • Barley dbEST ESTs • Wheat dbEST ESTs • Sorghum dbEST ESTs Rice CUGI BAC ends Rice JRGP/Cornell RFLP Markers Rice Cornell SSRs
Alignment Tools: Target Queries • BLAT: search & alignment • pslReps: filtering of low-quality matches • e-PCR: matches based on near-identity to the PCR primers, and correct order
Alignment Tools: • BLAT: search & alignment • pslReps: filtering of low-quality matches • e-PCR: matches based on near-identity to the PCR primers, and correct order Target Target Queries
Alignment Methods: • Rice Coding Sequences: • BLAT search & alignment • pslReps filtering of repetitive matches • Accept based on percent of EST length matched • Non-Rice Coding Sequences : • BLAT search & alignment • pslReps filtering of repetitive matches • Accept based on hit length and hit frequency • Rice BAC ends: • BLAT search & alignment • Accept based on gap length, percent of BAC end length matched, percent identity, and hit frequency.
Alignment Methods: • Rice Markers: • BLAT search & alignment • Accept based on percent of marker length matched and the gap length in case of genomic markers. • Utilize genetic map information; accept those whose genetic & physical chromosome assignment is concordant. • Rice SSRs: • e-PCR with default parameters, allowing 0 mismatches in the primers
February 2002 BAC/PAC Dataset Total BACs/PACs: 1,847 Total bp: 250,879,896 (250MB ) Phase 1: 78 Phase 2: 1,238 Phase 3: 531 Annotated Phase 3: 330 Annotated Genes: 8,034
Automating Alignments: • For each group of data sets, there is a script to automatically: • Run pslReps • Load results into the database • Discard low-quality matches • Update documentation
Map Correspondences Same marker on multiple mapping studies • Name-identity • Curated evidence • Sequence-based correspondences for JRGP and Cornell markers: • BLAT search & alignment • Utilize genetic mapping information, accepting matches on same chromosome and less than 30cM apart.
curator same name sequence-based
same name curator
Cornell/JRGP markers mapped to sequenced clones were assigned positions on the FPC contigs.
Total: 2,272 4,417
EnsEMBL Pipeline Overview RepeatMasker Genscan Blast GenomeBuilder Hmmer RepeatMasker BLAT GeneWise Hmmer • System for automated genome annotation • Executes and keeps track of computational jobs • Analysis job execution is serial, allowing stage dependencies • Jobs are user-defined • Can take advantage of a compute farm
Organization • Utilizes and expands on the EnsEMBL-core modules and database schema • Database stores: • analysis program names and parameters • analysis results • rules for job dependencies • and progress status for each job • Perl modules: • access the database • execute specified analysis programs • parse and load into the database the analysis results
Cluster Utilization • How to split up tasks? • Contig-by-contig approach • How to execute jobs on slave nodes? • Load management an scheduling (LSF, PBS, etc) • Management of management: • Automatic job submission • Error/completion checking