1 / 19

Large scale genomes comparisons Practical sessions

Large scale genomes comparisons Practical sessions. Fredj Tekaia Institut Pasteur tekaia@pasteur.fr. EMBO Bioinformatic and Comparative Genome Analysis Course Stazione Zoologica Anton Dohrn, Naples, Italy May 7 - 19, 2012. Plan for the practical sessions

vito
Télécharger la présentation

Large scale genomes comparisons Practical sessions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large scale genomes comparisons Practical sessions Fredj Tekaia Institut Pasteur tekaia@pasteur.fr EMBO Bioinformatic and Comparative Genome Analysis Course Stazione Zoologica Anton Dohrn, Naples, Italy May 7 - 19, 2012

  2. Plan for the practical sessions • • Saccharomyces cerevisiae(SACE: 5863 protein sequences) • • Candida glabrata(CAGL: 5202 protein sequences) • • Zygosaccharomyces rouxii(ZYRO: 4991 protein sequences) • data from ftp.ncbi.nlm.nih.gov/genomes/Fungi

  3. For each proteome we will perform the following: Data preparation: -Transform the protein identification so that to get simpler identifiers; -Split the whole protein sequence database into single protein sequences; Intra-species comparisons: -Compare the proteome to itself, using blastp (with adequate options); -Get for each protein its best significant match (presented in a table form); -Get for each protein all its significant matches (presented in a table form); -For each protein calculate the number of its significant matches; Interspecies comparisons: -Perform all pair-wise proteome comparisons; For each pair: -Get for each protein its best significant hit in the other proteome; -Get for each protein all its significant hits in the other proteome; -For each protein calculate the number of its significant matches; Multiple comparisons: -Extract all pairs of proteins that are Reciprocally Best Hits (Venn Diagram); CIRCOS Prepare a table relating the relationships between genomes to be used with circos.

  4. Plan for the practical sessions: • use 3 yeast species (SACE, CAGL, ZYRO). • data from ftp.ncbi.nlm.nih.gov/genomes/Fungi • • Prepare in adequate fasta format, the protein sequence data • (needs data transformation) • • Compare each proteome to itself (duplication - paralogs) • • Compare each proteome to another proteome (RBH - orthologs) • • prepare a file for the visualization of protein similarities (circos). Need for writing sh and perl (or xx) scripts

  5. Saccharomyces cerevisiae (SACE): 16 chromosomes

  6. Candida glabrata (CAGL): 13 chromosomes Zygosaccharomyces rouxii (ZYRO) : 7 chromosomes

  7. SACE -rw-r----- 1 tekaia staff 54600 Apr 25 11:54 NC_001133.faa -rw-r----- 1 tekaia staff 5273 Apr 25 11:57 NC_001133.ptt -rw-r----- 1 tekaia staff 233863 Apr 25 11:54 NC_001134.faa -rw-r----- 1 tekaia staff 22362 Apr 25 11:57 NC_001134.ptt -rw-r----- 1 tekaia staff 85632 Apr 25 11:54 NC_001135.faa -rw-r----- 1 tekaia staff 9043 Apr 25 11:57 NC_001135.ptt -rw-r----- 1 tekaia staff 436412 Apr 25 11:54 NC_001136.faa -rw-r----- 1 tekaia staff 41636 Apr 25 11:57 NC_001136.ptt -rw-r----- 1 tekaia staff 152613 Apr 25 11:54 NC_001137.faa -rw-r----- 1 tekaia staff 15210 Apr 25 11:57 NC_001137.ptt -rw-r----- 1 tekaia staff 71415 Apr 25 11:54 NC_001138.faa -rw-r----- 1 tekaia staff 7115 Apr 25 11:57 NC_001138.ptt -rw-r----- 1 tekaia staff 303249 Apr 25 11:54 NC_001139.faa -rw-r----- 1 tekaia staff 28954 Apr 25 11:57 NC_001139.ptt -rw-r----- 1 tekaia staff 156585 Apr 25 11:53 NC_001140.faa -rw-r----- 1 tekaia staff 15544 Apr 25 11:57 NC_001140.ptt -rw-r----- 1 tekaia staff 119694 Apr 25 11:53 NC_001141.faa -rw-r----- 1 tekaia staff 11384 Apr 25 11:57 NC_001141.ptt -rw-r----- 1 tekaia staff 213993 Apr 25 11:53 NC_001142.faa -rw-r----- 1 tekaia staff 19732 Apr 25 11:57 NC_001142.ptt -rw-r----- 1 tekaia staff 184175 Apr 25 11:53 NC_001143.faa -rw-r----- 1 tekaia staff 17048 Apr 25 11:57 NC_001143.ptt -rw-r----- 1 tekaia staff 302218 Apr 25 11:53 NC_001144.faa -rw-r----- 1 tekaia staff 28180 Apr 25 11:57 NC_001144.ptt -rw-r----- 1 tekaia staff 267545 Apr 25 11:53 NC_001145.faa -rw-r----- 1 tekaia staff 25329 Apr 25 11:57 NC_001145.ptt -rw-r----- 1 tekaia staff 223148 Apr 25 11:53 NC_001146.faa -rw-r----- 1 tekaia staff 21558 Apr 25 11:57 NC_001146.ptt -rw-r----- 1 tekaia staff 304338 Apr 25 11:53 NC_001147.faa -rw-r----- 1 tekaia staff 29393 Apr 25 11:57 NC_001147.ptt -rw-r----- 1 tekaia staff 266238 Apr 25 11:53 NC_001148.faa -rw-r----- 1 tekaia staff 25450 Apr 25 11:57 NC_001148.ptt A B C D …..

  8. NC_001133.ptt SACE S288c chromosome I, complete sequence. - 1..230218 94 proteins Location Strand Length PID Gene Synonym Code COG Product 1807..2169 - 120 6319249 PAU8 YAL068C - - Pau8p 2480..2707 + 75 33438754 - YAL067W-A - - hypothetical protein 7235..9016 - 593 6319250 SEO1 YAL067C - - Seo1p 11565..11951 - 128 6319252 - YAL065C - - hypothetical protein 12046..12426 + 126 6319253 - YAL064W-B - - hypothetical protein 13363..13743 - 126 7839146 - YAL064C-A - - hypothetical protein 21566..21850 + 94 330443360 - YAL064W - - hypothetical protein ….. NC_001133.faa >gi|6319249|ref|NP_009332.1| Pau8p MVKLTSIAAGVAAIAATASATTTLAQSDERVNLVELGVYVSDIRAHLAQYYMFQAAHPTETYPVEVAEAV FNYGDFTTMLTGIAPDQVTRMITGVPWYSSRLKPAISSALSKDGIYTIAN >gi|33438754|ref|NP_878038.1| hypothetical protein YAL067W-A MPIIGVPRCLIKPFSVPVTFPFSVKKNIRILDLDPRTEAYCLSLNSVCFKRLPRRKYFHLLNSYNIKRVL GVVYC >gi|6319250|ref|NP_009333.1| Seo1p MYSIVKEIIVDPYKRLKWGFIPVKRQVEDLPDDLNSTEIVTISNSIQSHETAENFITTTSEKDQLHFETS SYSEHKDNVNVTRSYEYRDEADRPWWRFFDEQEYRINEKERSHNKWYSWFKQGTSFKEKKLLIKLDVLLA FYSCIAYWVKYLDTVNINNAYVSGMKEDLGFQGNDLVHTQVMYTVGNIIFQLPFLIYLNKLPLNYVLPSL DLCWSLLTVGAAYVNSVPHLKAIRFFIGAFEAPSYLAYQYLFGSFYKHDEMVRRSAFYYLGQYIGILSAG GIQSAVYSSLNGVNGLEGWRWNFIIDAIVSVVVGLIGFYSLPGDPYNCYSIFLTDDEIRLARKRLKENQT GKSDFETKVFDIKLWKTIFSDWKIYILTLWNIFCWNDSNVSSGAYLLWLKSLKRYSIPKLNQLSMITPGL GMVYLMLTGIIADKLHSRWFAIIFTQVFNIIGNSILAAWDVAEGAKWFAFMLQCFGWAMAPVLYSWQNDI CRRDAQTRAITLVTMNIMAQSSTAWISVLVWKTEEAPRYLKGFTFTACSAFCLSIWTFVVLYFYKRDERN NAKKNGIVLYNSKHGVEKPTSKDVETLSVSDEK >gi|6319252|ref|NP_009335.1| hypothetical protein YAL065C MNSATSETTTNTGAAETTTSTGAAETKTVVTSSISRFNHAETQTASATDVIGHSSSVVSVSETGNTKSLI TSGLSTMSQQPRSTPASSIIGSSTASLEISTYVGIANGLLTNNGISVFISTVLLAIVW ……….

  9. NC_001133.ptt SACE S288c chromosome I, complete sequence. - 1..230218 94 proteins Location Strand Length PID Gene Synonym Code COG Product 1807..2169 - 120 6319249 PAU8 YAL068C - - Pau8p 2480..2707 + 75 33438754 - YAL067W-A - - hypothetical protein 7235..9016 - 593 6319250 SEO1 YAL067C - - Seo1p 11565..11951 - 128 6319252 - YAL065C - - hypothetical protein 12046..12426 + 126 6319253 - YAL064W-B - - hypothetical protein 13363..13743 - 126 7839146 - YAL064C-A - - hypothetical protein 21566..21850 + 94 330443360 - YAL064W - - hypothetical protein ….. NC_001133.faa >gi|6319249|ref|NP_009332.1| Pau8p MVKLTSIAAGVAAIAATASATTTLAQSDERVNLVELGVYVSDIRAHLAQYYMFQAAHPTETYPVEVAEAV FNYGDFTTMLTGIAPDQVTRMITGVPWYSSRLKPAISSALSKDGIYTIAN >gi|33438754|ref|NP_878038.1| hypothetical protein YAL067W-A MPIIGVPRCLIKPFSVPVTFPFSVKKNIRILDLDPRTEAYCLSLNSVCFKRLPRRKYFHLLNSYNIKRVL GVVYC >gi|6319250|ref|NP_009333.1| Seo1p MYSIVKEIIVDPYKRLKWGFIPVKRQVEDLPDDLNSTEIVTISNSIQSHETAENFITTTSEKDQLHFETS SYSEHKDNVNVTRSYEYRDEADRPWWRFFDEQEYRINEKERSHNKWYSWFKQGTSFKEKKLLIKLDVLLA FYSCIAYWVKYLDTVNINNAYVSGMKEDLGFQGNDLVHTQVMYTVGNIIFQLPFLIYLNKLPLNYVLPSL DLCWSLLTVGAAYVNSVPHLKAIRFFIGAFEAPSYLAYQYLFGSFYKHDEMVRRSAFYYLGQYIGILSAG GIQSAVYSSLNGVNGLEGWRWNFIIDAIVSVVVGLIGFYSLPGDPYNCYSIFLTDDEIRLARKRLKENQT GKSDFETKVFDIKLWKTIFSDWKIYILTLWNIFCWNDSNVSSGAYLLWLKSLKRYSIPKLNQLSMITPGL GMVYLMLTGIIADKLHSRWFAIIFTQVFNIIGNSILAAWDVAEGAKWFAFMLQCFGWAMAPVLYSWQNDI CRRDAQTRAITLVTMNIMAQSSTAWISVLVWKTEEAPRYLKGFTFTACSAFCLSIWTFVVLYFYKRDERN NAKKNGIVLYNSKHGVEKPTSKDVETLSVSDEK >gi|6319252|ref|NP_009335.1| hypothetical protein YAL065C MNSATSETTTNTGAAETTTSTGAAETKTVVTSSISRFNHAETQTASATDVIGHSSSVVSVSETGNTKSLI TSGLSTMSQQPRSTPASSIIGSSTASLEISTYVGIANGLLTNNGISVFISTVLLAIVW ……….

  10. Final sequence format NC_001133.ptt SACE S288c chromosome I, complete sequence. - 1..230218 94 proteins Location Strand Length PID Gene Synonym Code COG Product 1807..2169 - 120 6319249 PAU8 YAL068C - - Pau8p 2480..2707 + 75 33438754 - YAL067W-A - - hypothetical protein 7235..9016 - 593 6319250 SEO1 YAL067C - - Seo1p 11565..11951 - 128 6319252 - YAL065C - - hypothetical protein 12046..12426 + 126 6319253 - YAL064W-B - - hypothetical protein 13363..13743 - 126 7839146 - YAL064C-A - - hypothetical protein 21566..21850 + 94 330443360 - YAL064W - - hypothetical protein ….. NC_001133.faa >YAL068C Pau8p MVKLTSIAAGVAAIAATASATTTLAQSDERVNLVELGVYVSDIRAHLAQYYMFQAAHPTETYPVEVAEAV FNYGDFTTMLTGIAPDQVTRMITGVPWYSSRLKPAISSALSKDGIYTIAN >YAL067W-A hypothetical protein YAL067W-A MPIIGVPRCLIKPFSVPVTFPFSVKKNIRILDLDPRTEAYCLSLNSVCFKRLPRRKYFHLLNSYNIKRVL GVVYC >YAL067C Seo1p MYSIVKEIIVDPYKRLKWGFIPVKRQVEDLPDDLNSTEIVTISNSIQSHETAENFITTTSEKDQLHFETS SYSEHKDNVNVTRSYEYRDEADRPWWRFFDEQEYRINEKERSHNKWYSWFKQGTSFKEKKLLIKLDVLLA FYSCIAYWVKYLDTVNINNAYVSGMKEDLGFQGNDLVHTQVMYTVGNIIFQLPFLIYLNKLPLNYVLPSL DLCWSLLTVGAAYVNSVPHLKAIRFFIGAFEAPSYLAYQYLFGSFYKHDEMVRRSAFYYLGQYIGILSAG GIQSAVYSSLNGVNGLEGWRWNFIIDAIVSVVVGLIGFYSLPGDPYNCYSIFLTDDEIRLARKRLKENQT GKSDFETKVFDIKLWKTIFSDWKIYILTLWNIFCWNDSNVSSGAYLLWLKSLKRYSIPKLNQLSMITPGL GMVYLMLTGIIADKLHSRWFAIIFTQVFNIIGNSILAAWDVAEGAKWFAFMLQCFGWAMAPVLYSWQNDI CRRDAQTRAITLVTMNIMAQSSTAWISVLVWKTEEAPRYLKGFTFTACSAFCLSIWTFVVLYFYKRDERN NAKKNGIVLYNSKHGVEKPTSKDVETLSVSDEK >YAL065C hypothetical protein YAL065C MNSATSETTTNTGAAETTTSTGAAETKTVVTSSISRFNHAETQTASATDVIGHSSSVVSVSETGNTKSLI TSGLSTMSQQPRSTPASSIIGSSTASLEISTYVGIANGLLTNNGISVFISTVLLAIVW ……….

  11. Write a perl/sh script to systematically transform the sequence identifications Follow the indications on PS document

  12. Notations: Sequence and genome files: We consider sequences and databases in “fasta” format. DB.pep (extension “.pep” for protein databases); Exp.: GSACE.pep, for Saccharomyces cerevisiae protein db. seq.prt (extension “.prt” for protein sequences); Exp.: YAL063C.prt Scripts: script.pl (extension “.pl” for perl scripts); script.scr (extension “.scr” for unix shell scripts);

  13. Associative array #!/bin/perl #Use: replaceid.pl NC_000962.ptt NC_000962.faa #output in NC_xx.pep $PTT = @ARGV[0]; # ncbi ptt file $FAA = @ARGV[1]; # ncbi faa file $CHR=substr($PTT, 0 , length($PTT) -4); open(OUT,">$CHR.pep"); open(IN, "$PTT") || die "can't find $PTT"; while(<IN>) { @tab=split(/\s+/, $_); $PID{$tab[3]} = "$tab[5]"; } #while close(IN); open (IN2, "$FAA") || die "can't open $FAA"; while(<IN2>) { print OUT $_ if ( !m/^>/ ); if ( m/^>/ ) { @tab = split( /[\|]/, $_ ); print OUT ">$PID{$tab[1]} $tab[4]"; }#if }# while close(IN2); close(OUT); Examples List of values

  14. #!/bin/sh for file in `ls *.ptt` do NC=`echo $file | sed -e "s/\..*//g"` replaceid.pl $NC.ptt $NC.faa done

  15. Comparing one proteome vs itself All hits YAL005C YLL024C 98.19 607 11 0 … 0.0 1041 YAL005C YER103W 83.58 609 97 2 … 0.0 889 YAL005C YBL075C 81.94 609 107 2 … 0.0 888 YAL005C YJL034W 64.74 604 209 3 … 0.0 702 YAL005C YDL229W 64.02 567 198 4 … 2e-176 613 YAL005C YNL209W 63.84 567 199 4 … 4e-176 613 YAL005C YJR045C 51.06 611 281 9 … 1e-136 481 YAL005C YEL030W 49.43 615 285 11 … 8e-130 459 YAL005C YLR369W 49.27 548 254 8 … 1e-125 445 YAL005C YPL106C 35.85 371 230 4 … 9e-63 236 YAL005C YBR169C 35.85 371 230 4 … 1e-57 219 YAL005C YHR064C 31.90 373 242 5 … 2e-48 188 YAL005C YKL073W 24.55 501 343 10 … 2e-28 122 YAL007C YOR016C 75.00 180 42 1 … 8e-61 227 YAL012W YGL184C 32.69 413 240 13 … 2e-46 181 YAL012W YLR303W 30.37 438 243 12 … 6e-34 139 YAL012W YHR112C 29.90 398 236 14 … 2e-29 125 YAL012W YFR055W 27.99 293 199 7 … 2e-27 117 YAL015C YOL043C 50.88 285 140 0 … 9e-82 298 YAL017W YOL045W 62.10 694 224 6 … 0.0 771 ………. Multiple matches if any

  16. Comparing one proteome vs itself Best hits YAL005C YLL024C 98.19 607 11 0 … 0.0 1041 YAL005C YER103W 83.58 609 97 2 … 0.0 889 YAL005C YBL075C 81.94 609 107 2 … 0.0 888 YAL005C YJL034W 64.74 604 209 3 … 0.0 702 YAL005C YDL229W 64.02 567 198 4 … 2e-176 613 YAL005C YNL209W 63.84 567 199 4 … 4e-176 613 YAL005C YJR045C 51.06 611 281 9 … 1e-136 481 YAL005C YEL030W 49.43 615 285 11 … 8e-130 459 YAL005C YLR369W 49.27 548 254 8 … 1e-125 445 YAL005C YPL106C 35.85 371 230 4 … 9e-63 236 YAL005C YBR169C 35.85 371 230 4 … 1e-57 219 YAL005C YHR064C 31.90 373 242 5 … 2e-48 188 YAL005C YKL073W 24.55 501 343 10 … 2e-28 122 YAL007C YOR016C 75.00 180 42 1 … 8e-61 227 YAL012W YGL184C 32.69 413 240 13 … 2e-46 181 YAL012W YLR303W 30.37 438 243 12 … 6e-34 139 YAL012W YHR112C 29.90 398 236 14 … 2e-29 125 YAL012W YFR055W 27.99 293 199 7 … 2e-27 117 YAL015C YOL043C 50.88 285 140 0 … 9e-82 298 YAL017W YOL045W 62.10 694 224 6 … 0.0 771 ……….

  17. Comparing one proteome vs a different proteome All hits YAL001C CAGL0A00803g 42.26 1188 623 20 … 0.0 823 YAL002W CAGL0A00781g 31.31 1217 798 20 … 3e-167 584 YAL003W CAGL0F08547g 74.52 208 50 2 … 2e-59 223 YAL005C CAGL0G03795g 93.41 607 40 0 … 0.0 993 YAL005C CAGL0G03289g 85.39 609 86 2 … 0.0 899 YAL005C CAGL0D02948g 64.24 604 212 3 … 0.0 684 YAL005C CAGL0K04741g 64.90 567 193 4 … 2e-179 624 YAL005C CAGL0C05379g 64.90 567 193 4 … 2e-179 624 YAL005C CAGL0I03322g 50.90 613 283 9 … 2e-135 477 YAL005C CAGL0I01496g 50.08 613 288 9 … 1e-134 475 YAL005C CAGL0G04917g 46.07 573 291 8 … 6e-121 429 YAL005C CAGL0M06083g 35.31 371 232 4 … 4e-58 220 YAL005C CAGL0L10560g 32.26 372 241 4 … 5e-51 197 YAL005C CAGL0F06369g 22.37 599 406 16 … 4e-20 94.7 YAL007C CAGL0C02761g 70.17 181 51 2 … 3e-58 219 YAL009W CAGL0C02717g 70.10 204 61 0 … 1e-81 296 YAL010C CAGL0C02695g 47.37 494 225 5 … 1e-111 398 YAL011W CAGL0H06391g 38.42 596 318 9 … 1e-74 275 YAL012W CAGL0H06369g 85.24 393 55 2 … 0.0 659 YAL012W CAGL0L06094g 35.20 392 226 13 … 4e-54 206 ………. Multiple hits

  18. Comparing one proteome vs a different proteome Best hits YAL001C CAGL0A00803g 42.26 1188 623 20 … 0.0 823 YAL002W CAGL0A00781g 31.31 1217 798 20 … 3e-167 584 YAL003W CAGL0F08547g 74.52 208 50 2 … 2e-59 223 YAL005C CAGL0G03795g 93.41 607 40 0 … 0.0 993 YAL005C CAGL0G03289g 85.39 609 86 2 … 0.0 899 YAL005C CAGL0D02948g 64.24 604 212 3 … 0.0 684 YAL005C CAGL0K04741g 64.90 567 193 4 … 2e-179 624 YAL005C CAGL0C05379g 64.90 567 193 4 … 2e-179 624 YAL005C CAGL0I03322g 50.90 613 283 9 … 2e-135 477 YAL005C CAGL0I01496g 50.08 613 288 9 … 1e-134 475 YAL005C CAGL0G04917g 46.07 573 291 8 … 6e-121 429 YAL005C CAGL0M06083g 35.31 371 232 4 … 4e-58 220 YAL005C CAGL0L10560g 32.26 372 241 4 … 5e-51 197 YAL005C CAGL0F06369g 22.37 599 406 16 … 4e-20 94.7 YAL007C CAGL0C02761g 70.17 181 51 2 … 3e-58 219 YAL009W CAGL0C02717g 70.10 204 61 0 … 1e-81 296 YAL010C CAGL0C02695g 47.37 494 225 5 … 1e-111 398 YAL011W CAGL0H06391g 38.42 596 318 9 … 1e-74 275 YAL012W CAGL0H06369g 85.24 393 55 2 … 0.0 659 YAL012W CAGL0L06094g 35.20 392 226 13 … 4e-54 206 ……….

  19. Follow the document : Tekaia_EMBO2012_PS.pdf

More Related