Approches génomiques - TP L3 – BCP

Approches génomiques - TP L3 – BCP Lois MaignienMCf IUEM lois.maignien@univ-brest.fr

Enseignant Loïs MaignienMCfEcoGenomique Lois.maignien@univ-brest.fr 0290915380 – IUEM A223 Diapos sur http://pagesperso.univ-brest.fr/~maignien • Ecologie microbienne • Bioinformatique • Ecologie moléculaire

Plan du TP • Méthodes de séquençage (NGS) • Quelle est cette séquence? • BLAST et NCBI • Quelles relations entre plusieurs séquences? • Alignements et phylogénie avec MEGA • Utilisation des NGS en écologie microbienne • Outils d’analyse NGS: Présentation de Galaxy

Méthodes de séquençage • Sanger http://www.youtube.com/watch?v=bEFLBf5WEtc

Méthodes de séquençage • Sanger Max 96 séquences de 2x 800 pb 800 pb 800 pb ~100 pb 1500 pb Taille de la séquence correspond a la longueur de l’ADNr 16S! 96 sequences en parallele Appliedbiosystems.com

Méthodes de séquençage • 454 (aka pyrosequencage) • Ajout d’adaptateurs • PCR en émulsion (clonage in vitro) • Dénaturation de l’ADN et • distribution des microbilles sur une microplaque • d. DNApol immobilisée pour PCR • Plaque PicoTiter • Flow successifs de ATCG. Emission de lumière a chaque incorporation Jonathan M Rothberg & John H Leamon Nature Biotechnology 26, 1117 - 1124 (2008

Méthodes de séquençages • 454: 1 x 500 pb 500 pb MID 500 pb 1.800.000 séquences en parallèle Plusieurs librairies sur une même plaque (multiplexage) Démultiplexage in silico avec les MID http://www.youtube.com/watch?v=nFfgWGFe0aA 454 GS flx titanium www.roche.com

Méthodes de séquençages • 454 Flowgram .sff File Standard Flow File .FastQ file Lysholm et al. BMC Bioinformatics 2011 12:293 @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAA + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 http://fr.wikipedia.org/wiki/FASTQ

Méthodes de séquençages • 454 Problème des homopolymères: 2 ou 3 G ??? TCAGATCGTG-GTG TCAGATCGTGGGTG

Méthodes de séquençages • Illumina http://www.youtube.com/watch?v=l99aKKHcxC4 http://seqanswers.com/forums/showthread.php?t=21

Méthodes de séquençages • Illumina http://seqanswers.com/forums/showthread.php?t=21

Méthodes de séquençages • Illumina MiSeq R1250 pb R2 250 pb 400 pb 30.000.000 séquences en parallèle Plusieurs librairies sur une même plaque (multiplexage) Démultiplexage in silico avec MID et BarCode

Méthodes de séquençages • Illumina HiSeq R1 100 pb R2 100 pb ~100 pb 100 pb 150.000.000 séquences en parallèle Plusieurs librairies sur une même plaque (multiplexage) Démultiplexage in silico avec MID et BarCode

Méthodes de séquençages • PacBio Eid et al. Science 2009 Vol. 323 no. 5910 pp.133-138 Séquençage d’une seule molécule de 5000 pb dans 10-21 litre. Multiplexage de 50.000 molécules http://www.youtube.com/watch?v=v8p4ph2MAvI

Evolution du cout du séquençage… • Recensement des 105 microbes dans 1 mL d’eau de mer • 2005: 50.000 euros (Sanger) • 2013: 5 euros (MiSeq) • Nouveaux possibles! • Comparaison à grande échelle des • Gènes • Génomes • Transcriptomes • Populations • Communautés

Format de fichier de séquences • Fasta Dans un fichier texte (wordpad, notepad, textedit) Pas de traitement de texte! (Word, LibreOffice…) >Defline ATCTGGCCGGCC (sur 1 seule ligne)

Format de fichier de séquences • exemple de fasta: defline simple >My_Sequence GAAGTCATTTCGTCAGTGCTGAGAATTTTGAAAAAGAAGGAAATAATGGAGGAGAAAATATGGCATACAAACCCCAGTACGGTCCCGGCCAGACGCACATCGCCGAGAACAGGCGTCAGCAGATGGACCCCAACCACAA GCTGGAAAAGCTTCGGGATGTTACTGACGAGGACGTTGTCCTCGTCATGGGACACCGTGCACCCGGCTCG GCATACCCATCCTGTCACCCGCCGCTCTCTGAGCAGCAGGAACCAGCCTGCCCGATCCGCAAGCTTGTGA CCCCGACCGACGGCGCAAAGGCAGGCGACCGTGTCCGGTACATCCAGTTCACCGACTCGATGTACAACGC ACCCTGCCAGCCCTACCAGAGAAGCTGGCTTGAGTCCTACCGCTTCCGCGGTATTGACCCAGGTACACTC

Format de fichier de séquences • exemple de fasta: deflineGeneBank >gi|385654574|gb|JQ404495.1| Unculturedarchaeon clone 6 methyl coenzyme M reductasesubunit C (mcrC) gene, partial cds; methyl coenzyme M reductase gamma subunit (mcrG) gene, completecds; and methyl coenzyme M reductase alpha subunit (mcrA) gene, partial cds GAAGTCATTTCGTCAGTGCTGAGAATTTTGAAAAAGAAGGAAATAATGGAGTGAGAAAATATGGCATACA AACCCCAGTACGGTCCCGGCCAGACGCACATCGCCGAGAACAGGCGTCAGCAGATGGACCCCAACCACAA GCTGGAAAAGCTTCGGGATGTTACTGACGAGGACGTTGTCCTCGTCATGGGACACCGTGCACCCGGCTCG GCATACCCATCCTGTCACCCGCCGCTCTCTGAGCAGCAGGAACCAGCCTGCCCGATCCGCAAGCTTGTGA CCCCGACCGACGGCGCAAAGGCAGGCGACCGTGTCCGGTACATCCAGTTCACCGACTCGATGTACAACGC ACCCTGCCAGCCCTACCAGAGAAGCTGGCTTGAGTCCTACCGCTTCCGCGGTATTGACCCAGGTACACTC TCGGGACGTCAGATCGTCGAATGCCGTGAGCGTGACCTCGAAAAGTACGCAAAGGAACTCATCAACACCG AGCTCTTCGATGCGGCACTGACCGGCATCCGTGGCTGCACGGTGCACGGGCACTCTCTCCGTCTCGATGA GAACGGCATGATGTTCGACATGCTCCAGCGCTTTGTCATGGACAAGAAGGCAGGCGTCGTGAAGTATGTC AAGGACCAGGTCGGTGTACCACTGGACGCTGAAGTCAAAGTCGGCAAGCCGGCAGACGCAAAGTGGCTCA AGGCACACACGACGATGTACCACTCTGTCCAAGGCACCGGATTCCGGGATGACCCTGAATACGTTGAGTA

Format de fichier de séquences • exemple de fasta: deflineGeneBankproteinsequence >gi|147919725|ref|YP_686529.1| methyl-coenzyme M reductase, gamma subunit [Methanocellaarvoryzae MRE50] MAYKPQFYPGKTSVAQNRKKFMDPSYKMEKLRSLSDDDIVIMLGHRAPGSAYKTIHPPLTESNEPDCPIR KLVEPTPGAKAGDRIRYNQYADSMYFAPMVPYLRSWMAVTRYRGVDPGTLSGRQIIEARERDLEKITKET FETEMFDPARTSLRGCTVHGHSLRLNENGMMFDMLQRQVLDKDGTVKAVKDQVGDPLDRKVNLGKPMSEA ELKKRTTIYRIDGVSFRSDDEVVGWVQRIFTLRTKCGFYPKV

Séquence multiples et alignements Format Phylip 12 270 methyl_co MAYKPQFYPGKTSVAQNRKKFMDPSYKMEKLRSLSDDDIVIMLGHRAPGSA RecName__ MA---QFYPGSTKIAENRRKFMNPDAELEKLREISDEDVVRILGHRAPGEE RecName__ MA---QYYPGTTKVAQNRRNFCNPEYELEKLREISDEDVVKILGHRAPGEE RecName__ MA---QYYPGTSKVAQNRRNFCNPEYELEKLREISDEDVVKILGHRAPGEE RecName__ MAYERQYYPGATSVAANRRKHMSG--KLEKLREISDEDLTAVLGHRAPGSD RecName__ MAYKPQFYPGATKVAENRRNHLNPNYELEKLREIPDEDVVKIMGHRQPGED RecName__ MAYKPQFYPGQTKIAQNRRDHMNPDVQLEKLRDIPDDDVVKIMGHRQPGED RecName__ MAYEPQFNPGETKIAENRRKHMNPNYELKKLREIADEDIVRVLGHRSPGES RecName__ MSYKAQYTPGETQIAENRRKHMDPDYEFRKLREVSDEDLVKVLGHRNPGES RecName__ MTYKAQYTPGETQIAENRRKHMDPDYEFRKLREVSDEDLVKVLGHRNPGES RecName__ MAYKPQFYPGNTLIAENRRKHMNPEVELKKLRDIPDDEIVKILGHRNPGES RecName__ MAYKPQFYPSATKVAENRRNHINPAFELEKLREIPDEDVVKIMGHRQPSED

Séquence multiples et alignements Format Clustal CLUSTAL W (1.83) multiple sequencealignment ref|YP_686529.1| MAYKPQFYPGKTSVAQNRKKFMDPSYKMEKLRSLSDDDIVIMLGHRAPGSAYKTIHPPLT gi|126877 MA---QFYPGSTKIAENRRKFMNPDAELEKLREISDEDVVRILGHRAPGEEYPSVHPPLE gi|126879 MA---QYYPGTTKVAQNRRNFCNPEYELEKLREISDEDVVKILGHRAPGEEYPSVHPPLE gi|3334251 MA---QYYPGTSKVAQNRRNFCNPEYELEKLREISDEDVVKILGHRAPGEEYPSVHPPLE gi|126876 MAYERQYYPGATSVAANRRKHMSG--KLEKLREISDEDLTAVLGHRAPGSDYPSTHPPLA gi|126880 MAYKPQFYPGATKVAENRRNHLNPNYELEKLREIPDEDVVKIMGHRQPGEDYKTVHPPLE gi|2842572 MAYKPQFYPGQTKIAQNRRDHMNPDVQLEKLRDIPDDDVVKIMGHRQPGEDYKTVHPPLE gi|33301226 MAYEPQFNPGETKIAENRRKHMNPNYELKKLREIADEDIVRVLGHRSPGESFKTVHPPLE gi|313104216 MSYKAQYTPGETQIAENRRKHMDPDYEFRKLREVSDEDLVKVLGHRNPGESYKSVHPPLD gi|20532398 MTYKAQYTPGETQIAENRRKHMDPDYEFRKLREVSDEDLVKVLGHRNPGESYKSVHPPLD gi|2497838 MAYKPQFYPGNTLIAENRRKHMNPEVELKKLRDIPDDEIVKILGHRNPGESYKTVHPPLE gi|126881 MAYKPQFYPSATKVAENRRNHINPAFELEKLREIPDEDVVKIMGHRQPSEDYKTVHPPLE * * * * ** *** * *** * ****

Séquence multiples et alignements Format NEXUS #NEXUS BEGIN DATA; DIMENSIONS ntax=12 nchar=270; FORMAT datatype=protein gap=- interleave; MATRIX YP_686529 MAYKPQFYPGKTSVAQNRKKFMDPSYKMEKLRSLSDDDIVIMLGHRAPGSAYKTIHPPLTE 126877 MA---QFYPGSTKIAENRRKFMNPDAELEKLREISDEDVVRILGHRAPGEEYPSVHPPLEE 126879 MA---QYYPGTTKVAQNRRNFCNPEYELEKLREISDEDVVKILGHRAPGEEYPSVHPPLEE 3334251 MA---QYYPGTSKVAQNRRNFCNPEYELEKLREISDEDVVKILGHRAPGEEYPSVHPPLEE 126876 MAYERQYYPGATSVAANRRKHMSG--KLEKLREISDEDLTAVLGHRAPGSDYPSTHPPLAE 126880 MAYKPQFYPGATKVAENRRNHLNPNYELEKLREIPDEDVVKIMGHRQPGEDYKTVHPPLEE 2842572 MAYKPQFYPGQTKIAQNRRDHMNPDVQLEKLRDIPDDDVVKIMGHRQPGEDYKTVHPPLEE 33301226 MAYEPQFNPGETKIAENRRKHMNPNYELKKLREIADEDIVRVLGHRSPGESFKTVHPPLEE 313104216 MSYKAQYTPGETQIAENRRKHMDPDYEFRKLREVSDEDLVKVLGHRNPGESYKSVHPPLDE 20532398 MTYKAQYTPGETQIAENRRKHMDPDYEFRKLREVSDEDLVKVLGHRNPGESYKSVHPPLDE 2497838 MAYKPQFYPGNTLIAENRRKHMNPEVELKKLRDIPDDEIVKILGHRNPGESYKTVHPPLEE 126881 MAYKPQFYPSATKVAENRRNHINPAFELEKLREIPDEDVVKIMGHRQPSEDYKTVHPPLEE

Séquence multiples et alignements Visualiser avec un éditeur d’alignement (MEGA, SeaView, ebiotools, …)

FastQ Séquence + Score Qualité • Voirehttp://fr.wikipedia.org/wiki/FASTQ @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAG + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

TP: Utilisation de Blast • Tutoriel BLAST sur NCBI http://www.ncbi.nlm.nih.gov/books/NBK1734/ Lire les chapitres 1 2 et 3 Les exercices se trouvent sur la page BLAST_quickstart http://www.ncbi.nlm.nih.gov/Class/minicourses/quickblast.html

TP1: Utilisation de Blast • Blast est un programme d’alignement de séquence. Il permet de trouver les séquences similaires à une requête courte (query) dans une base de donnée (ref. db) ADN ou proteines. GOOGLE pour la biologie moléculaire!

Question -1- • Quelle est la fonction du programme BLAST • Format Input / Output? • Qu’est-ce qu’un « blast score »? • Qu’est-ce que l’ « E-value »? • Comment varie-t-elle avec la longueur de la requête? • Est-elle comparable pour une memerequete dans deux bases de données? • Quels sont les différents types de BLAST • BLASTn • BLASTp • tBLASTn • BLASTx

Probleme 1: detection de séquence cible d’amorce PCR • 3.2.1 Problem 1 Click on the linkindicated by “P” next to the “Nucleotide-nucleotide BLAST (blastn)” to access the problem. This problemdemonstrates how to use BLAST to findhumansequences in GenBankthatcanbeamplifiedwith a particular primer pair. Access the nucleotide–nucleotide BLAST page (by clicking on the Nucleotide–nucleotide BLAST link). Pasteboth the forward and reverse primersinto the BLAST input box. Insert a string of about 30 N’safter the first primer sequence to separate the twosequences to befound in separate, not overlappingalignments. Limityoursearch to humansequences by selecting “Homo sapiens” from the “All organisms” pull down menu under the Options for advancedblasting and click the BLAST! link. Retrieveresults by clicking on the “Format” button. Look for two hits to the samedatabasesequence.

Probleme 1: detection de séquence cible d’amorce PCR • 3.2.1 Question 1 Combien trouvez-vous de résultats pouvant être amplifié par PCR avec ces primer? Visualisez le resultat dans un genome et decrivez le résultat

Problème 2: Détection de SNP • 3.2.2 Problem 2 Click on the linkindicated by “H” next to the Nucleotide–nucleotide BLAST (blastn) to access the problem. This problemdescribes how to obtainsingle-nucleotidepolymorphism (SNP) information in similarsequences in the database. Hermankova et al. (8) studied the HIV-1 drugresistance profiles in children and adultsreceivingcombinationdrugtherapy. To identify the SNPs in the HIV-1 isolatesfromthese patients, or othersimilarsequences in the database, use the sequencefrom one of the patients givennext and run a nucleotide–nucleotide BLAST search as described in the problempreviouslylisted. Format the resultsusing the “Flat QuerywithIdentities” option from the “AlignmentView” pull down menu under the “Format” options (seeNote 3). Identify the SNP observedatalignment position 6 (querynucleotidenumber 10) in Fig. 3. There is an A/G SNP in many of the databasesequences.

Problème 2: Détection de SNP • 3.2.2 Question 2 Décrivez le premier SNP (nucléotide / position) Fabriquer un arbre phylogénétique de toutes les séquences de virus HIV obtenus par BLAST Les arbres peuvent être téléchargé et ouvert avec figtree. http://tree.bio.ed.ac.uk/software/figtree/ Changer Max SeqDif. à 0.1 et Sequence Label = Sequence ID. telechargez l’arbre et sauvez le en .pdf avec figtree.

BLASTer des Séquences de Proteines • 4.2.1 Problem 1 Click on the linkindicated by “P” next to “Protein–protein BLAST (blastp)” to access the problem. It describes how to use blastp to determine the type of protein. For thispurpose, wewillchoose the databasecontaining the curated and annotatedproteinsequences, such as RefSeqor Swissprot. Use the querysequenceprovided in the problem. This sequencewasgenerated by translating a 5 exongenefromDrosophila. To determine the nature of thisprotein, run a blastpsearch. Access the “Protein–protein BLAST (blastp)” page by clicking on the link, paste in the querysequence, select the Swissprotdatabasefrom the “Choosedatabase” pull down menu and click on the BLAST! link. For eachprotein–proteinsearch, the queryisalsosearchedagainst the Conserved Domain Database(seeNote 5). Retrieveresults by clicking on the “top Image”. The proteinissimilar to a number of aspartateaminotransferases.

BLASTer des Séquences de Proteines • 4.2.1 Question 3 Quelles est la principale différence entre les bases de données RefSeqou SwissProtet « non redundantproteinsequence nr »? A quelle famille de protéine appartient cette séquence? A partir des résultats des domaines conservés, a quelles superfamille appartient cette séquence dans les bases de données Pfam et COG? .

BLASTer des protéines (2) • 4.2.2 Problem 2 Click on the linkindicated by “H” next to the “Protein–protein BLAST (blastp)” to access a similarproblem to determine the type of protein. Use the querysequenceprovided in the problem. This sequencewasgenerated by translating a 4 exongenefromDrosophila. To determine the nature of thisprotein, run a blastpsearchagainst the Swissprotdatabase as described in Subheading 2.

BLASTer des protéines (2) • 4.2.2 Question 4 Quelle est cette protéine? D’après les « conserveddomains » quelle réaction catalyse-t-elle?

BLAST traduit • 5.1 AvailableTranslatedSearches There are threevarietiesof translated BLAST search; “tblastn,” “blastx,” and “tblastx.” In the first variant, “tblastn,” a proteinsequencequeryiscompared to the six-frame translations of the sequences in a nucleotidedatabase. In the second variant, “blastx,” a nucleotidesequencequeryistranslated in six reading frames, and the resultingsix-proteinsequences are compared, in turn, to those in a proteinsequencedatabase. In the third variant, “tblastx,” both the “query” and database “subject” nucleotidesequences are translated in six reading frames, afterwhich 36 (6 × 6) protein “blastp” comparisons are made. Proteinsequences are betterconservedthantheircorrespondingnucleotidesequences. Because the translatedsearchesmaketheircomparisonsat the level of proteinsequences, they are more sensitive than direct nucleotidesequencesearches. A common use of the “tblastn” and “blastx” programs is to help annotatecodingregions on a nucleotidesequence; they are alsouseful in detectingframe-shifts in thesecodingregions. The “tblastx” program provides a sensitive way to compare transcripts to genomicsequenceswithout the knowledge of anyprotein translation, however, itisverycomputationally intensive. MegaBLASTcanoftenachievesufficientsensitivityat a muchgreater speed in searchesbetween the sequences of closelyrelatedspecies and ispreferred for batch analysis of short transcriptsequencessuch as expressedsequence tags.

BLAST Traduit • PROBLEME 5 Click on the linkindicated by “P” next to the “Translatedquery vs proteindatabase (blastx)” to access the problem. This problemdescribes how to identify a frame shift in a nucleotidesequence by comparingitstranslatedaminoacidsequence to a similarprotein in the database. Access the Blastx page by clicking on the link “Translatedquery vs proteindatabase (blastx),” paste the nucleotidesequenceprovided in the problem in the query box and run the Blast search. The translation of the querysequenceissimilar to the sequences of envelopeglycoproteins in the database. Compared to the similarproteins in the results, thereappears to be a frame shift aroundnucleotide 268 as seen in Fig. 4. The querywhentranslated in reading frame 2 (as indicated by a rectangle) up to nucleotide 268 issimilar to only the first 89 aminoacids of the databaseprotein AAL71647.1. The translation of the queryneeds to beshifted to reading frame 1 (as indicated by an oval) to findsimilarity to the rest of the proteinsequence. To discover the nucleotidedifferencearound 268, refer to Note 6

BLAST Traduit • QUESTION 5 Combien de recherches sont effectuées en parallèle par BLASTx? Quel est le meilleur résultat (acc. number) Combien de fragments ont été retrouvés par BLASTx sur le premier résultat? Sur le premier résultat, quelle est la différence entre les deux fragments (taille, position, frame, % sim.) .

BLAST sur un génome • 6.2.1 Problem 1 • Click on the linkindicated by “P” next to mouse genome BLAST to access the problem. This problemdescribes how to use mouse genome blast to identify the Hoxb homologues encoded by the mouse genomicassemblysequence. As described in Subheading 5.1., translatedsearches or protein–proteinsearches are more sensitive for identifyingsimilarity in the codingregionsthan the nucleotide–nucleotidesearches. Within the translated or protein–proteinsearches, tblastnwillbe more appropriatethanblastx or blastp for thisproblem. Both latter programs will use proteindatabasesconsisting of alreadyidentifiedproteinsequenceswhereastblastnwillbeuseful for identifyingunannotatedcodingregions as well.

BLAST sur un génome • 6.2.1 Problem 1 Wewilldemonstrate the sensitivity of tblastn as compared to the nucleotide–nucleotidesearch to identify a similarity to a codingregion by running twosearches: (1) MegaBLAST the querymRNAsequence, NM_008268, against the mouse genomicsequence and (2) tblastn the queryproteinsequence, NP_032294, against the mouse genomicsequence. 1/ Access the mouse genome BLAST page, by clicking on the “mouse” linkunder the Genomes panel. For the first search, paste the accession number NM_008268 into the query box, accept the default MegaBLASToption, and select the “genome (referenceonly)” as the database. The results, shown in Figs. 6 and 7, containonly four hits, two to the twoHoxb5coding exons and one each to the Hoxb3 and Hoxd3genes. Pay attention to the “Refer to Features in this part of subjectsequence.” Three of these hits, two to the Hoxb5 and one to the Hoxb3genes, are on the Contig NT_096135.3 placed on chromosome 11. 2/ For the second search, paste the protein accession number NP_032294 into the mouse genomesearch page, select “genome (referenceonly)” as the database and tblastnas the program. The resultshouldappearsimilar to thatshown in Fig. 8. This searchgivesseveral more hits than the earlierMegaBLASTsearch. Pay attention to the “Refer to Features in this part of subjectsequence.” There is a complete hit to the homeobox B5 protein, shown in Fig. 9, and to the homeodomains of the othermembers of the homeobox B family, seen in Fig. 10 (corresponding to the aminoacids 195..253 in the query), such as B6, B4, B3, B2, B13, and so on, onchromosome 11, homeobox A familymembers on chromosome 6, and homeobox C familymembers on chromosome 15 (refer to Note 8 for the locations of conserveddomain).

BLAST sur un génome • QUESTION 6 Pourquoi le deuxieme BLAST donne plus de résulat que le premier?

Approches génomiques - TP L3 – BCP