140 likes | 274 Vues
Extracting homoeologous genomic sequences – the challenge of the wheat genome. Ivan Popov ABI, 2011. Why a challenge?. Wheat is hexaploid – it has 3 times more DNA than most organisms Wheat DNA isn’t just copied 3 times – it contain 3 different genomes!
E N D
Extracting homoeologous genomic sequences – the challenge of the wheat genome Ivan Popov ABI, 2011
Why a challenge? • Wheat is hexaploid – it has 3 times more DNA than most organisms • Wheat DNA isn’t just copied 3 times – it contain 3 different genomes! • The A, B and D genomes can contain different variants of a single gene • Then what do we get from sequencing?
NGS gives us a consensus • Sequencing reads are assembled together and small variations are lost • Output is a consensus sequence… Unless we separate the reads before assembly or use the assembly to guide us while we read the three separate sequences.
What we need: • A wheat sequencing database with BLAST functionality • An assembly program (CAP3, Cortex) • A gene of interest • A programming language and someone to use it (probably us again)
Resources explained: Database • Contains sequencing “reads” from wheat genomic research – character strings with varying length (why?) • Usually on-line available • May implement the BLAST and assembly software • Example: www.cerealsdb.uk.net
Resources explained: Software • Basic Local Alignment Search Tool • Sample alignment: ATGCTGGGACCTAT-GAT ATGCTC-GACCAATCGAT • Matching read to gene (different length) • Returns the best matching reads – probably belonging to our gene • Assembler software – overlaps the reads and produces the longest possible sequences: contigs
Workflow • BLAST the database with our gene • Assemble the reads • Look at the result… and see the errors:
An assembly example . : . : . : . : . : . : lcl|GKU3SMK03END4X- TGTGGCCACGCGGCTCACCTGCTCCACTGCGGAGGATGAGACCACCGGGTTCATCACCGG lcl|GHILVEO01C7S9I- TGTGGCCACGCGGCTCACCTGCTCCACTGCGGA lcl|GJVZXJB02IXON4+ TGTGGCCACGCGGCTCACCTGCTCCACTGCGGAGGATGAGACCACCGGGTTCATCACCGG UserQuery- TGCGGCTACGCGGCTCACCTGCTCCACTGCGGAGGACGAGACCACCGGGTTCATCACCGG lcl|GKJD2EX01B6242+ TGTGGCCACGCGGCTCACCTGCTCCACTGCGGAGGATGAGACCACCGGGTTCATCACCGG lcl|GBFXLNY02GE3WG+ TGTGGCCACGCGGCTCACCTGCTCCACTGCGGAGGATGAGACCACCGGGTTCATCACCGG lcl|GKJD2EX01EJ18K+ TGTGGCCACGCGGCTCACCTGCTCCACTGCGGAGGATGAGACCACCGGGTTCATCACCGG lcl|GINEZBA01ALFTD- TGCGGCCACGCGGCTCACCTGCTCCACTGCAGAGGATGAGACCACCGGGTTCATCACTGG lcl|GHILVEO02HKGDP- CGGGTTCATCACCGG lcl|GJVZZDM02G6UTI- TGCGGCCACGCGGCTCACCTGCTCCACTGCAGAGGATGAGACCACCGGGTTCATCACTGG ____________________________________________________________ consensus TGTGGCCACGCGGCTCACCTGCTCCACTGCGGAGGATGAGACCACCGGGTTCATCACCGG
How do we unravelthe genomes? • We select the variable points of interest • We separate the reads by these points • Then we stack the reads in the same order while preserving the different sequences.
A stacking example(only variable points are shown!) Reads: TGACA TGACAA Genome variants: AAGGGG TGACAA AAGGGG ACATC ACATC AAGGGGCT GGGGCT AGCCGT AGCCGT
Workflow part II • Define the variable positions • Separate the variants by co-occurrence • Hope we get a meaningful result
Problems • Sometimes more than 3 variants emerge! • Is it because there are different alleles of the gene? • There are variable points that do not share any reads • Deeper sequencing needed (more reads at each point) • Use of other contigs from the assembly?
What do we gain? • Specific primers • Isolation/amplification of a specific genome (A, B or D) • Connecting phenotypic traits to gene variants • Combining specific gene variants to get new traits • Better wheat = more food!
Questions? And thank you for your attention!