190 likes | 279 Vues
Developing Accessible Application Software for Individual de novo Genome Projects. Vince Forgetta , PhD Candidate Ken Dewar PhD, Supervisor Department of Human Genetics, McGill University Montreal, Quebec, Canada December 8 th , 2011. Next-Gen Gap.
E N D
Developing Accessible Application Software for Individual de novo Genome Projects Vince Forgetta, PhD Candidate Ken Dewar PhD, Supervisor Department of Human Genetics, McGill University Montreal, Quebec, Canada December 8th, 2011
Next-Gen Gap “Unfortunately, the software and computer hardware demands on these analyses are not much less than those of the large Genome Centers. From this perspective, the gap between large-scale genome centers and individual investigators may seem to be growing, not shrinking, as the next-generation platforms’ apparent promise of a ‘Genome Center in a box’ may have only been half delivered, providing data without a full suite of tools.” (Nature Methods 6, S2 - S5 (2009)) Download Data Learn *NIX Install Software and Dependencies Run Software … Wait? … Problems? Bacterial genome in < 1 week for ~ $3000 (Genome Assembly)+
Three Common Methodologies in de novo Genome Analysis Display and analysis of genome annotations Quality assessment of a genome assembly Comparison and mining of genomic data from public repositories. • One or more methodologies used to address needs in three specific projects; projects used as a vehicle to develop software:
Assembly Analysis • Researchers should have easy access to determine quality and perform simple analysis. DNA Sequencing Centre Researcher Assembly • Delays and limits on data access exist: • - Viewers need to be installed and have specific software (e.g. Linux) or hardware requirements (e.g. RAM). • - Assembly data (multiple GBs) must be downloaded.
Objective • Develop a simple assembly viewer that operates within a web-browser, allowing a researcher to rapidly analyze and access their data.
Method Parser/Converter: Used python to parse, analyze, and convert assembly data into web accessible formats (HTML, JSON, JPG images) which are stored on sequence centre servers. Interface: Use browser-based interface (HTML) to dynamically access data (Javascript) on servers. Incorporates pre-existing web-technologies (JQuery, SeadragonDeepzoom AJAX). Usage: - after genome assembly, parser/converter is run on sequencing center servers - researcher accesses interface over the internet using a modern web browser
Performance Parser/Converter: • Multiple platforms (Windows/OS X/Linux) • Multi-processor support. • Low memory usage (< 250Mb of memory per processor). User interface: • Client-side programming decreased server load • Data is downloaded is on-demand limited bandwidth users. • Sole system requirement: a modern web-browser (Firefox, Opera, Google Chrome) ease of installation. • Low memory usage (peaks at ~ 250 Mb).
The Interface • Dynamic Charts: • toggle axis value • identify points • summarize regions Assembly statistics, batch download of sequence and statistical data. • Table of contig/scaffold statistics: • Sortable/Filter by column • Access to contig sequence/quality and read sequences. • Contig Assembly: • Pan/Zoom • Identify position, read names, mismatches
blip.codeplex.com BLAST Pivot Microsoft Research Summer Internship Microsoft Biology Foundation Redmond, Washington, USA Mentor - Simon Mercer
blip.codeplex.com BLAST NCBI ACGTCACTGACTGACTAGCTAGCTAGCTAGCATCGATCGATCGATCGATCGATCGACGTAACTAGCACGACTGACTCT ? Species, Function, … Local
blip.codeplex.com Limitation Scientist + = >gi|301326298|ref|ZP_07219671.1| TIM-barrel protein, nifR3 family [Escherichia coli MS 78-1] Length=321 Score = 583.563 bits (1503), Expect = 8.65371E-165 Identities = 280/281 (100%), Positives = 280/281 (100%), Gaps = 0/281 (0%) Frame = 0 Query 1 MMSSNPQVWESDKSRLRMVHIDEPGIRTVQIAGSDPKEMADAARINVESGAQIIDINMGC 60 MMSSNPQVWESDKSRLRMVHIDEPGIRTVQIAGSDPKEMADAARINVESGAQIIDINMGC Sbjct 41 MMSSNPQVWESDKSRLRMVHIDEPGIRTVQIAGSDPKEMADAARINVESGAQIIDINMGC 100 Query 61 PAKKVNRKLAGSALLQYPDVVKSILTEVVNAVDVPVTLKIRTGWAPEHRNCEEIAQLAED 120 PAKKVNRKLAGSALLQYPDVVKSILTEVVN VDVPVTLKIRTGWAPEHRNCEEIAQLAED Sbjct 101 PAKKVNRKLAGSALLQYPDVVKSILTEVVNTVDVPVTLKIRTGWAPEHRNCEEIAQLAED 160 Query 121 CGIQALTIHGRTRACLFNGEAEYDSIRAVKQKVSIPVIANGDITDPLKARAVLDYTGADA 180 CGIQALTIHGRTRACLFNGEAEYDSIRAVKQKVSIPVIANGDITDPLKARAVLDYTGADA Sbjct 161 CGIQALTIHGRTRACLFNGEAEYDSIRAVKQKVSIPVIANGDITDPLKARAVLDYTGADA 220 Query 181 LMIGRAAQGRPWIFREIQHYLDTGELLPPLPLAEVKRLLCAHVRELHDFYGPAKGYRIAR 240 LMIGRAAQGRPWIFREIQHYLDTGELLPPLPLAEVKRLLCAHVRELHDFYGPAKGYRIAR Sbjct 221 LMIGRAAQGRPWIFREIQHYLDTGELLPPLPLAEVKRLLCAHVRELHDFYGPAKGYRIAR 280 Query 241 KHVSWYLQEHAPNDQFRRTFNAIEDASEQLEALEAYFENFA 281 KHVSWYLQEHAPNDQFRRTFNAIEDASEQLEALEAYFENFA Sbjct 281 KHVSWYLQEHAPNDQFRRTFNAIEDASEQLEALEAYFENFA 321 + = ~5000 genes E. coli Programmer
blip.codeplex.com Blast in Pivot 1 3 2 Pivot ACGTCACTGACTGACTAGCTAGCTAGCTAGCATCGATCGATCGATCGATCGATCGACGTAACTAGCACGACTGACTCT ACGTCACTGACTGACTAGCTAGCTAGCTAGCATCGATCGATCGATCGATCGATCGACGTAACTAGCACGACTGACTCT ACGTCACTGACTGACTAGCTAGCTAGCTAGCATCGATCGATCGATCGATCGATCGACGTAACTAGCACGACTGACTCT ? ? ? BLAST
blip.codeplex.com E. coli ECD227 Divergent Strain ????? Species? Function? E. coli E. coli ECD-227 Antibiotic Resistant! Acknowledgement Moussa Diarra, Heidi Rempel
Conclusions • ContiGo: used by clients of the Genome Centre at McGill (release soon). • BL!P: >500 downloads (blip.codeplex.com).
Acknowledgements E. coli ECD-227 H. Rempel Andrew Metcalfe M. S. Diarra BL!P/Microsoft Simon Mercer Xin-Yi Chua Mauro Luigi Drago Beatriz Diaz Acosta Vivek Kumar Bob Davidson Mike Zyskowski Xiaoji Chen Bob Silverstein Vikram Bapat Jared Jackson Wei Lu The Pivot Team Ophiostoma novo-ulmi Jan Kieleczawa Michael Zianni Robert Steen Deborah Grove Anoja Perera Robert Lyons Jr. Sushmita Singh Doug Bintzler Scottie Adams Deborah Grove Gregory Grove Robert Lyons Jr. Suzanne Genik Chris Wright Alvaro Hernandez Sharon Bachman Lorie Hetrick Sushmita Singh Nichole Peterson Gary Leveque Joana Dias Clotilde Teiling Tim Harkins C. difficile Ken Dewar Andre Dascal Matthew Oughton Joana Dias Gary Leveque Pascale Marquis Corina Nagy Amelie Villeneuve Ivan Brukner, Mark Miller Vivian Loo Mike Mulvey Dale Gerding Maya Rupnik Elaine Mardis V. Magrini M. Hickenbotham K. Haub C. Markovic J. Nelson