150 likes | 249 Vues
This guide presents an efficient approach to formatting gene sequences containing multiple exons and their corresponding amino acid translations. It highlights the limitations of current manual methods using GCG's publish command, advocating for a more automated solution through Perl and Bioperl. The objective is to create an extendable formatting program that neatly aligns amino acids with codons. This document includes practical examples and demonstrates the power of Perl's text manipulation capabilities, making it an essential resource for bioinformaticians.
E N D
Publish.pl Steven Githens FW4089 Spring 2003
The Problem • Task: Format a sequence with multiple exons and their translations so that the amino acids line up with their corresponding codons. It should look nice too. • Current Solution: Use GCG’s publish command, over and over again, manually adjusting exon boundaries, and then manually fix phase mismatches. • Natural Conclusion: This is what computers are for. • Objective: Write a formatting program to do all this and have it be easily extendable. • Tools for the Job: Perl and Bioperl
Perl – Practical Extraction and Report Language • Makes manipulating text really easy. Example: @files = split(/,/, “one.txt,two.txt,three.txt”) • Good Regular Expression support Like DOS wildcards(*,?) and GCG’s findpatterns, with more power Examples: /(Fred|Wilma|Pebbles) Flinstone/ s/^([^ ]+) +([^ ]+)/$2 $1/ swap first two words /.{80,90}/ find line of length 80 to 90 /\b(\w\S+)(\s+\1) +\b/xig find repeat words(Is is this OK?) • Genetic information is itself a language and easily expressed as text. It makes sense that Perl would be a good choice to use for working with sequences.
Bioperl.org • Makes manipulating sequence data really easy. • Contains modules for • Sequence formats • Sequences and annotations • Restriction Enzymes • Remote Databases • Using external programs like clustalW • Using online queries like blast • Much much more
Converting sequence formats use Bio::SeqIO; my $in = Bio::SeqIO->newFh ( -file => '<seqs.sp', -format => ‘genbank' ); my $out = Bio::SeqIO->newFh ( -file => '>seqs.fasta', -format => ‘gcg' ); print $out $_ while <$in>;
Statistics: $seq_stats = Bio::Tools::SeqStats->new(-seq=>$seqobj); $seq_stats->count_monomers(); $seq_stats->count_codons(); $weight = $seq_stats->get_mol_wt($seqobj); Restriction Enzymes: $re1 = new Bio::Tools::RestrictionEnzyme(-NAME =>'EcoRI'); @fragments = $re1->cut_seq($seqobj); $re2 = new Bio::Tools::RestrictionEnzyme( -NAME =>'EcoRV--GAT^ATC', -MAKE =>'custom');
publish.pl Usage: perl publish.pl Options: --infile: sequence to publish in genbank format --outfile: filename for output, default is publish.output --source: Where to get exon information: all, manual, genbank, or genscan --exons: actual exon information, depends on value for –source --codontable: changes the codon table used (can vary between species and organelles) defaults to the standard codon table
Using all perl publish.pl --infile af027172.genbank --source all --outfile publish_all.output publish_all.output: . . . . . . 1 CCTGAGCCGGAGTCCTATTCAATTATCTAGAAGAAGTCTGAGCCGGAGTCCCACTCGATT 60 1 P E P E S Y S I I * K K S E P E S H S I 20 . . . . . . 61 GTCTAGGAGAAGCCTAAGCCGGAGTCCCATTCGATCACCTAGGAAGAGTGTGAGCAGGAG 120 21 V * E K P K P E S H S I T * E E C E Q E 40 . . . . . . 121 TCCAGTCCGATCATCTAGGAAGAGTGTGAGCAGAAGTCCGGTTCGTTCATCCAGGAGACG 180 41 S S P I I * E E C E Q K S G S F I Q E T 60 . . . . . . 181 TATCAGCAGGAGTCCAGTCCGATCATCTAGGAAGAGTGTGAGCAGGAGTCCTATTCGATT 240 61 Y Q Q E S S P I I * E E C E Q E S Y S I 80 . . . . . . 241 GTCCAGAAGAAGTATCAGCAGGAGTCCTATTCGATTGTCCAGGAGAAGTATCAGCAGGAG 300 81 V Q K K Y Q Q E S Y S I V Q E K Y Q Q E 100 . . . . . . 301 TCCTGTTAGAGGAAGAAGAAGAATTAGCAGAAGTCCAGTTCCGGCAAGGAGAAGGAGTGT 360 101 S C * R K K K N * Q K S S S G K E K E C 120 . . . . . . 361 GCGGCCTAGATCTCCTCCTCCTGACCGCAGAAGAAGTTTGTCAAGAAGTGCTTCTCCTAA 420 121 A A * I S S S * P Q K K F V K K C F S * 140 . . . . . . 421 TGGGCGCATAAGGAGAGGGAGAGGATTTAGCCAAAGATTCTCATACGCCCGTCGATACAG 480 141 W A H K E R E R I * P K I L I R P S I Q 160 . . . . . . 481 AACTAGTCCATCTCCTGATCGATCTCCTTATCGCTTTAGTGATAGGAGTGACCGTGACAG 540 161 N * S I S * S I S L S L * * * E * P * Q 180 . . . . . . 541 GTGAATAGCCCACACATAATATAACTCCCCCTTTCTGTTACACACTCTCGTACTGAACCG 600 181 V N S P H I I * L P L S V T H S R T E P 200
Using manual perl publish.pl --infile af027172.genbank --source manual –exons 2286..2366,2894..3089,3188..3360,3584..3698,3814..4003,4171..4437,4773..5118,5197..5334,5416..5541,5693..5905,6012..6276,6364..6560,6645..6995,7078..7665 publish.output: . . . . . . 2881 GGTCAAAGTGCAGACCAAACCTTTGAAGAATATGAATGGCCAGATATGTCAGATCTGTGG 2940 28 T K P L K N M N G Q I C Q I C G 43 . . . . . . 2941 TGATGATGTTGGACTCGCTGAAACTGGAGATGTCTTTGTCGCGTGTAATGAATGTGCCTT 3000 44 D D V G L A E T G D V F V A C N E C A F 63 . . . . . . 3001 CCCTGTGTGTCGGCCTTGCTATGAGTACGAGAGGAAAGATGGAACTCAGTGTTGCCCTCA 3060 64 P V C R P C Y E Y E R K D G T Q C C P Q 83 . . . . . . 3061 ATGCAAGACTAGATTCAGACGACACAGGGGTCAGTTGTCTTTTTCTTTTTGTTGGCAATT 3120 84 C K T R F R R H R G 93 . . . . . . 3121 GCTATATATGGATTTTCTCTTTTTGTTTCTTTGCTGTTGTGTTGAACAATTTTTTGGAAT 3180 . . . . . . 3181 TTTCCAGGGAGTCCTCGTGTTGAAGGAGATGAAGATGAGGATGATGTTGATGATATCGAG 3240 94 S P R V E G D E D E D D V D D I E 110 . . . . . . 3241 AATGAGTTCAATTACGCCCAGGGAGCTAACAAGGCGAGACACCAACGCCATGGCGAAGAG 3300 111 N E F N Y A Q G A N K A R H Q R H G E E 130 . . . . . . 3301 TTTTCTTCTTCCTCTAGACATGAATCTCAACCAATTCCTCTTCTCACCCATGGCCATACG 3360 131 F S S S S R H E S Q P I P L L T H G H T 150 . . . . . . 3361 GTAGGGACCTACATTTTCCCTTTAGACTCTAGAGTGATTTGTATTACTCAATAAATCCCT 3420 . . . . . . 3421 AGAGTGGTCATTTATTACTTACTATTCACGTTAATGTTATATGTGAACAAATCTTAACAG 3480 . . . . . . 3481 AATTTTTTTCTGATAGTACATGGTCATCCAAATTAAGAAATAATAATAGATGTTGTTAGT 3540 . . . . . . 3541 TGTGTCTGTTTTCAATAGATTCATGACCTTTTTCTATACACAGGTTTCTGGAGAGATTCG 3600 151 V S G E I R 156 . . . . . . 3601 CACGCCTGATACACAATCTGTGCGAACTACATCAGGTCCTTTGGGTCCTTCTGACAGGAA 3660 157 T P D T Q S V R T T S G P L G P S D R N 176 . . . . . . 3661 TGCTATTTCATCTCCATATATTGATCCACGGCAACCTGGTATTCATATGTTTTTCCCTTG 3720 177 A I S S P Y I D P R Q P V 189 . . . . . . 3721 TGCACGTGGTCTTTGTTAAATGTGATTCCTATTCATTTTTACAACATATATATTTTGTGT
Using Genbank perl publish.pl --infile af027172.genbank --source genbank Gets exons from: FEATURES Location/Qualifiers source 1..8401 /organism="Arabidopsis thaliana" /mol_type="genomic DNA" /cultivar="Columbia" /db_xref="taxon:3702" /chromosome="4" /map="4; between g8300 and AtKC1/mi431" gene <2286..>7665 /gene="RSW1" mRNA join(<2286..2366,2894..3089,3188..3360,3584..3698, 3814..4003,4171..4437,4773..5118,5197..5334,5416..5541, 5693..5905,6012..6276,6364..6560,6645..6995,7078..>7665) /gene="RSW1" /product="cellulose synthase catalytic subunit" CDS join(2286..2366,2894..3089,3188..3360,3584..3698, 3814..4003,4171..4437,4773..5118,5197..5334,5416..5541, 5693..5905,6012..6276,6364..6560,6645..6995,7078..7665) /gene="RSW1" /codon_start=1 /product="cellulose synthase catalytic subunit" /protein_id="AAC39334.1" /db_xref="GI:2827139" /translation="MEASAGLVAGSYRRNELVRIRHESDGGTKPLKNMNGQICQICGD
Using Genscan Output perl publish.pl --infile af027172.genbank --source genscan --exons genscan_selected.output genscan_selected.output: GENSCAN 1.0 Date run: 27-Apr-103 Time: 20:35:16Sequence 20:35:16 : 8401 bp : 40.02% C+G : Isochore 1 ( 0 - 43 C+G%)Parameter matrix: Arabidopsis.smatPredicted genes/exons:Gn.Ex Type S .Begin ...End .Len Fr Ph I/Ac Do/T CodRg P.... Tscr..----- ---- - ------ ------ ---- -- -- ---- ---- ----- ----- ------ 2.01 Init + 2286 2366 81 2 0 89 9 213 0.998 19.52 2.02 Intr + 2894 3089 196 1 1 9 44 218 0.932 12.57 2.03 Intr + 3188 3360 173 0 2 66 66 202 0.999 19.54 2.04 Intr + 3584 3698 115 1 1 79 26 124 0.942 9.40 2.05 Intr + 3814 4003 190 2 1 41 101 137 0.990 13.02 2.06 Intr + 4171 4437 267 1 0 99 81 241 0.985 25.22 2.07 Intr + 4773 5118 346 0 1 80 36 278 0.981 21.27 2.08 Intr + 5197 5334 138 0 0 80 92 83 0.999 12.64 2.09 Intr + 5416 5541 126 0 0 84 10 73 0.943 4.06 2.10 Intr + 5693 5905 213 1 0 82 76 180 0.999 19.19 2.11 Intr + 6012 6276 265 2 1 84 42 187 0.933 14.86 2.12 Intr + 6364 6560 197 2 2 70 78 194 0.999 19.81 2.13 Intr + 6645 6995 351 2 0 50 68 152 0.873 8.99 2.14 Term + 7078 7665 588 0 0 75 41 573 0.964 49.83Predicted peptide sequence(s):Predicted coding sequence(s):>20:35:16|GENSCAN_predicted_peptide_1|248_aaMTVATNPFFPPGLLSEDVESDFDGDLALDGDLDTDLAFPVPMGDFEYDFDLDLDLDGLRLWLADGDLLFLGLVNRLGENLLRLLNLQRRRYKKDGSVRECVTERGSYIMCGLFTCHGEALLDKLLLRSGGGDLGRTLLLLAGTGLLLILLLPLTGLLLILLLDNRIGLLLILLLDNRIGLLLTLFLDDRTGLLLIRLLDERTGLLLTLFLDDRTGLLLTLFLGDRMGLRLRLLLDNRVGLRLRLLLDN>20:35:16|GENSCAN_predicted_CDS_1|747_bpatgacggtggctactaatccctttttcccacctgggctattatccgaggatgtcgaatccgacttcgatggagaccttgccttggatggtgatcttgacactgatcttgctttccccgtcccaatgggagattttgaatatgactttgatcttgatcttgacctcgatggactcaggctgtggctagcggatggtgaccttctctttcttgggcttgtgaaccgacttggcgagaaccttctgcgacttctaaatctacagagaagacgttataaaaaagacggttcagtacgagagtgtgtaacagaaagggggagttatattatgtgtgggctattcacctgtcacggagaagcacttcttgacaaacttcttctgcggtcaggaggaggagatctaggccgcacactccttctccttgccggaactggacttctgctaattcttcttcttcctctaacaggactcctgctgatacttctcctggacaatcgaataggactcctgctgatacttcttctggacaatcgaataggactcctgctcacactcttcctagatgatcggactggactcctgctgatacgtctcctggatgaacgaaccggacttctgctcacactcttcctagatgatcggactggactcctgctcacactcttcctaggtgatcgaatgggactccggcttaggcttctcctagacaatcgagtgggactccggctcagacttcttctagataattga>20:35:16|GENSCAN_predicted_peptide_2|1081_aaMEASAGLVAGSYRRNELVRIRHESDGGTKPLKNMNGQICQICGDDVGLAETGDVFVACNECAFPVCRPCYEYERKDGTQCCPQCKTRFRRHRGSPRVEGDEDEDDVDDIENEFNYAQGANKARHQRHGEEFSSSSRHESQPIPLLTHGHTVSGEIRTPDTQSVRTTSGPLGPSDRNAISSPYIDPRQPVPVRIVDPSKDLNSYGLGNVDWKERVEGWKLKQEKNMLQMTGKYHEGKGGEIEGTGSNGEELQMADDTRLPMSRVVPIPSSRLTPYRVVIILRLIILCFFLQYRTTHPVKNAYPLWLTSVICEIWFAFSWLLDQFPKWYPINRETYLDRLAIRYDRDGEPSQLVPVDVFVSTVDPLKEPPLVTANTVLSILSVDYPVDKVACYVSDDGSAMLTFESLSETAEFAKKWVPFCKKFNIEPRAPEFYFAQKIDYLKDKIQPSFVKERRAMKREYEEFKVRINALVAKAQKIPEEGWTMQDGTPWPGNNTRDHPGMIQVFLGHSGGLDTDGNELPRLIYVSREKRPGFQHHKKAGAMNALIRVSAVLTNGAYLLNVDCDHYFNNSKAIKEAMCFMMDPAIGKKCCYVQFPQRFDGIDLHDRYANRNIVFFDINMKGLDGIQGPVYVGTGCCFNRQALYGYDPVLTEEDLEPNIIVKSCCGSRKKGKSSKKYNYEKRRGINRSDSNAPLFNMEDIDEGFEGYDDERSILMSQRSVEKRFGQSPVFIAATFMEQGGIPPTTNPATLLKEAIHVISCGYEDKTEWGKEIGWIYGSVTEDILTGFKMHARGWISIYCNPPRPAFKGSAPINLSDRLNQVLRWALGSIEILLSRHCPIWYGYHGRLRLLERIAYINTIVYPITSIPLIAYCILPAFCLITDRFIIPEISNYASIWFILLFISIAVTGILELRWSGVSIEDWWRNEQFWVIGGTSAHLFAVFQGLLKVLAGIDTNFTVTSKATDEDGDFAELYIFKWTALLIPPTTVLLVNLIGIVAGVSYAVNSGYQSWGPLFGKLFFALWVIAHLYPFLKGLLGRQNRTPTIVIVWSVLLASIFSLLWVRINPFVDANPNANNFNGKGGVF>20:35:16|GENSCAN_predicted_CDS_2|3246_bpatggaggccagtgccggcttggttgctggatcctaccggagaaacgagctcgttcggatccgacatgaatctgatggcgggaccaaacctttgaagaatatgaatggccagatatgtcagatctgtggtgatgatgttggactcgctgaaactggagatgtctttgtcgcgtgtaatgaatgtgccttccctgtgtgtcggccttgctatgagtacgagaggaaagatggaactcagtgttgccctcaatgcaagactagattcagacgacacagggggagtcctcgtgttgaaggagatgaagatgaggatgatgttgatgatatcgagaatgagttcaattacgcccagggagctaacaaggcgagacaccaacgccatggcgaagagttttcttcttcctctagacatgaatctcaaccaattcctcttctcacccatggccatacggtttctggagagattcgcacgcctgatacacaatctgtgcgaactacatcaggtcctttgggtccttctgacaggaatgctatttcatctccatatattgatccacggcaacctgtccctgtaagaatcgtggacccgtcaaaagacttgaactcttatgggcttggtaatgttgactggaaagaaagagttgaaggctggaagctgaagcaggagaaaaatatgttacagatgactggtaaataccatgaagggaaaggaggagaaattgaagggactggttccaatggcgaagaactccaaatggctgatgatacacgtcttcctatgagtcgtgtggtgcctatcccatcttctcgcctaaccccttatcgggttgtgattattctccggcttatcatcttgtgtttcttcttgcaatatcgtacaactcaccctgtgaaaaatgcatatcctttgtggttgacctcggttatctgtgagatctggtttgcattttcttggcttcttgatcagtttcccaaatggtaccccattaacagggagacttatcttgaccgtctcgctataagatatgatcgagacggtgaaccatcacagctcgttcctgttgatgtgtttgttagtacagtggacccattgaaagagcctccccttgttacagcaaacacagttctctcgattctttctgtggactacccggtagataaagtagcctgttatgtttcagatgatggttcagctatgcttacctttgaatccctttctgaaaccgctgagtttgcaaagaaatgggtaccattttgcaagaaattcaacattgaacctagggcccctgaattctattttgcccagaagatagattacttgaaggacaagatccaaccgtcttttgttaaagagcgacgagctatgaagagagagtatgaagagtttaaagtgaggataaatgctcttgttgccaaagcacagaaaatccctgaagaaggctggacaatgcaggatggtactccctggcctggtaacaacactagagatcatcctggaatgatacaggtgttcttaggccatagtgggggtctggataccgatggaaatgagctgcctagactcatctatgtttctcgtgaaaagcggcctggatttcaacaccacaaaaaggctggagctatgaatgcattgatccgtgtatctgctgttcttaccaatggagcatatcttttgaacgtggattgtgatcattactttaataacagtaaggctattaaagaagctatgtgtttcatgatggacccggctattggaaagaagtgctgctatgtccagttccctcaacgttttgacggtattgatttgcacgatcgatatgccaacaggaatatagtctttttcgatattaacatgaaggggttggatggtatccagggtccagtatatgtgggtactggttgttgttttaataggcaggctctatatgggtatgatcctgttttgacggaagaagatttagaaccaaatattattgtcaagagctgttgcgggtcaaggaagaaaggtaaaagtagcaagaagtataactacgaaaagaggagaggcatcaacagaagtgactccaatgctccacttttcaatatggaggacatcgatgagggttttgaaggttatgatgatgagaggtctattctaatgtcccagaggagtgtagagaagcgttttggtcagtcgccggtatttattgcggcaaccttcatggaacaaggcggcattccaccaacaaccaatcccgctactcttctgaaggaggctattcatgttataagctgtggttacgaagacaagactgaatggggcaaagagattggttggatctatggttccgtgacggaagatattcttactgggttcaagatgcatgcccggggttggatatcgatctactgcaatcctccacgccctgcgttcaagggatctgcaccaatcaatctttctgatcgtttgaaccaagttcttcgatgggctttgggatctatcgagattcttcttagcagacattgtcctatctggtatggttaccatggaaggttgagacttttggagaggatcgcttatatcaacaccatcgtctatcctattacatccatccctcttattgcgtattgtattcttcccgctttttgtctcatcaccgacagattcatcatacccgagataagcaactacgcgagtatttggttcattctactcttcatctcaattgctgtgactggaatcctggagctgagatggagcggtgtgagcattgaggattggtggaggaacgagcagttctgggtcattggtggcacatccgcccatctttttgctgtcttccaaggtctacttaaggttcttgctggtatcgacaccaacttcaccgttacatctaaagccacagacgaagatggggattttgcagaactctacatcttcaaatggacagctcttctcattccaccaaccaccgtcctacttgtgaacctcataggcattgtggctggtgtctcttatgctgtaaacagtggctaccagtcgtggggtccgcttttcgggaagctcttcttcgccttatgggttattgcccatctctaccctttcttgaaaggtctgttgggaagacaaaaccgaacaccaaccatcgtcattgtctggtctgttcttctcgcctccatcttctcgttgctttgggtcaggatcaatccctttgtggacgccaatcccaatgccaacaacttcaatggcaaaggaggtgtcttttag
Overall Algorithm • Determine the exon source • Assemble the nucleotide base string. • Assemble a list of exons. • Assemble the amino acid string in proportion to the base string using the exon list. • Format and print the strings.
#Translate Exons and match them with Genomic Sequence my $leftover_text = ""; my $leftover_amount = 0; my $leftover_counter = 0; foreach my $single_exon (@all_exons) { my @exon_bounds = split /\.\./, $single_exon; my $exon_start = shift @exon_bounds; my $exon_end = shift @exon_bounds; my $counter = $exon_start; my $pos; if ($leftover_amount > 0) { if ($leftover_amount == 1) { my $split_codon = $leftover_text . substr($seqtext,$counter-1,2); my $split_aa = $codon_table->translate($split_codon); $pos = $leftover_counter-1; $proteintext =~ s/(.{$pos})(.)/$1$split_aa/; $counter = $counter + 2; $leftover_amount = 0; } elsif ($leftover_amount == 2) { my $split_codon = $leftover_text . substr($seqtext,$counter-1,1); my $split_aa = $codon_table->translate($split_codon); $pos = $leftover_counter-1; $proteintext =~ s/(.{$pos})(.)/$1$split_aa/; $counter = $counter + 1; $leftover_amount = 0; }} while ($counter <= $exon_end) { if ($counter + 2 <= $exon_end) { my $amino_acid = $codon_table->translate(substr($seqtext,$counter-1,3)); $pos = $counter-1; $proteintext =~ s/(.{$pos})(.)/$1$amino_acid/; $counter = $counter + 3; $leftover_amount = 0; } else { $leftover_amount = $exon_end-$counter+1; $leftover_text = substr($seqtext,$counter-1,$exon_end-$counter+1); $leftover_counter = $counter; $counter = $counter + 3; } } } } Phase Algorithm (for each exon)
Room For Improvement • More exon sources • Other input formats, gcg, fasta, etc. • Various Formating options • uppercase/lowercase • linelength, numbering schemes • other formatting options like those in GCG’s publish
Sources • http://www.bioperl.org • Bioperl Course • http://www.pasteur.fr/recherche/unites/sis/formation/bioperl/ • Programming Perl, Wall, Christiansen, Schwartz • GCG Documentation • http://forestry.mtu.edu/manuals/gcg/index.htm • AF027172, Arioli,T., Peng,L., Betzner,A.S., Burn,J., Wittke,W., Herth,W., Camilleri,C., Hofte,H., Plazinski,J., Birch,R., Cork,A., Glover,J., Redmond,J. and Williamson,R.E.