1 / 47

Searching for Sense in the Genome

Searching for Sense in the Genome. NKx 2.6. NKx 2.9. Jim Kent Genome Bioinformatics Group University of California Santa Cruz. Mouse mRNA in-situ’s of various transcription factors courtesy of Paul Gray. Hmx1. Mllt7. The Paradox of the Genome.

seth
Télécharger la présentation

Searching for Sense in the Genome

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Searching for Sense in the Genome NKx 2.6 NKx 2.9 Jim KentGenome Bioinformatics GroupUniversity of California Santa Cruz Mouse mRNA in-situ’s of various transcription factors courtesy of Paul Gray. Hmx1 Mllt7

  2. The Paradox of the Genome How does a long, static, one dimensional string of DNA turn into the remarkably complex, dynamic, and three dimensional human body? GTTTGCCATCTTTTGCTGCTCTAGGGAATCCAGCAGCTGTCACCATGTAAACAAGCCCAGGCTAGACCAGTTACCCTCATCATCTTAGCTGATAGCCAGCCAGCCACCACAGGCATGAGT

  3. Textbook flow of information from genome

  4. More Complex In Real Life Image from UCSC Genome Browser with “Known Genes” and RepeatMasker tracks. The two genes are TPARL on the forward strand, and CLOCK on the reverse strand. CLOCK regulates sleeping. The function of TPARL is unknown. Note bulk of genome is repeats & introns.

  5. Transposons • Similar to retroviruses like HIV. • “Selfish DNA” or “molecular parasite” • The ALU transposon is a parasite on the LINE transposon. Things grow on things. • ~50% of human genome is relics of transposons. • Transposons and other duplications along with its sheer size made sequencing and assembling the human genome a challenge.

  6. Introns • Human genes are interrupted by introns. • Introns typically are much larger than the rest of the gene, often 100k or more. • It’s possible that the first introns, perhaps a billion years ago, were a particular type of transposon. We do see transposons creating introns today. • Introns make finding genes in the human genome an ongoing challenge.

  7. Lines of Evidence for Genes • Browser shot with computational gene finders, ESTs, comparative genomics, full length mRNA. UCSC Genome Browser presenting tracks of evidence for 2 genes

  8. Hints of a Gene Very suggestive evidence for unknown gene.

  9. Genome Progress • Genome DNA sequence: • 85% complete 2000 • 99% complete 2004 • Human gene set: • 85% complete 2004 • 99% complete 2008?

  10. The Paradox of Genes How do 25,000 genes, each in the end just one dimensional strings of DNA turn into the remarkably complex, dynamic, and three dimensional human body? CLOCK TPARL FLJ13352 PDCL2 NMU KDR SEC3L1 KIAA0635 AK126014 KIT NRPS998 PPAT CR749824 SRP72 ARL9 PDGFRA HOP GSH-2 AF090902 SPINK2 CHIC2 BC057822 REST C4orf14 POLR2B IGFBP7 LNX SCFD2 RASL11B AY189288 USP46 AK021912 LOC132671 SGCB …

  11. How to Understand Incredibly Complex Systems? • DNA is popularly considered the code of life. • Computer programs are complex systems that ultimately are built up of 0’s and 1’s, perhaps they are a model for a genome built of A,C,G and T? BUT…. • Human genome lacks documentation, has accumulated 3 billion years of cruft, and does not believe in local variables. • Therefore we must look to less than straightforward software programs as guides.

  12. Bioperl CORBA module sub new { my ( $class, @args) = @_; my $self = $class->SUPER::new(@args); my ( $idl, $ior, $orbname ) = $self->_rearrange( [ qw(IDL IOR ORBNAME)], @args); $self->{'_ior'} = $ior || 'biocorba.ior'; $self->{'_idl'} = $idl || $ENV{BIOCORBAIDL} || 'biocorba.idl'; $self->{'_orbname'} = $orbname || 'orbit-local-orb'; $CORBA::ORBit::IDL_PATH = $self->{'_idl'}; my $orb = CORBA::ORB_init($orbname); my $root_poa = $orb->resolve_initial_references("RootPOA"); $self->{'_orb'} = $orb; $self->{'_rootpoa'} = $root_poa; return $self; }

  13. Obfuscated C #define c(n,s)case n:s;continue char x[]="((((((((((((((((((((((",w[]= "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b";char r[]={92,124,47},l[]={2,3,1 ,0};char*T[]={" |"," |","%\\|/%"," %%%",""};char d=1,p=40,o=40,k=0,*a,y,z,g= -1,G,X,**P=&T[4],f=0;unsigned int s=0;void u(int i){int n;printf( "\233;%uH\233L%c\233;%uH%c\233;%uH%s\23322;%uH@\23323;%uH \n",*x-*w,r[d],*x+*w ,r[d],X,*P,p+=k,o);if(abs(p-x[21])>=w[21])exit(0);if(g!=G){struct itimerval t= {0,0,0,0};g+=((g<G)<<1)-1;t.it_interval.tv_usec=t.it_value.tv_usec=72000/((g>> 3)+1);setitimer(0,&t,0);f&&printf("\e[10;%u]",g+24);}f&&putchar(7);s+=(9-w[21] )*((g>>3)+1);o=p;m(x);m(w);(n=rand())&255||--*w||++*w;if(!(**P&&P++||n&7936)){ while(abs((X=rand()%76)-*x+2)-*w<6);++X;P=T;}(n=rand()&31)<3&&(d=n);!d&&--*x<= *w&&(++*x,++d)||d==2&&++*x+*w>79&&(--*x,--d);signal(i,u);}void e(){signal(14, SIG_IGN);printf("\e[0q\ecScore: %u\n",s);system("stty echo -cbreak");}int main (int C,char**V){atexit(e);(C<2||*V[1]!=113)&&(f=(C=*(int*)getenv("TERM"))==( int)0x756E696C||C==(int)0x6C696E75);srand(getpid());system("stty -echo cbreak" );h(0);u(14);for(;;)switch(getchar()){case 113:return 0;case 91:case 98:c(44,k =-1);case 32:case 110:c(46,k=0);case 93:case 109:c(47,k=1);c(49,h(0));c(50,h(1 ));c(51,h(2));c(52,h(3));}}

  14. Reverse Engineering Microsoft mouse blue screen of death Windows XP keyboard network elaborate proprietary process

  15. Looks like ‘code’ not enough, must study actual cells & DNA

  16. Textbook Gene Regulation: Promoter Tells Where to Begin Different promoters activate different genes in different parts of the body.

  17. Computing Baldness Idealized promoter for a gene involved in making hair. Proteins that bind to specific DNA sequences in the promoter region together turn a gene on or off. These proteins are themselves regulated by their own promoters leading to a gene regulatory network with many of the same properties as a neural network.

  18. Genes can be transcription factors that activate or repress other genes, leading to regulatory networks such as this one from the development of the central nervous system. (Image from D’Haeseleer Somogyi 1999)

  19. The Decisions of a Cell • When to reproduce? • When to migrate and where? • What to differentiate into? • When to secrete something? • When to make an electrical signal? The more rapid decisions usually are via the cell membrane and 2nd messengers. The longer acting decisions are usually made in the nucleus.

  20. Nucleus Used to Appear Simple • Cheek cells stained with basic dyes. Nuclei are readily visible.

  21. Mammalian Nuclei Stained in Various Ways Image from Tom Misteli lab

  22. Artist’s rendition of nucleus Image from nuclear protein database

  23. Chromatin

  24. Turning on a gene: • Get DNA into right part of nucleus. • Unpack chromatin. • Attract RNA polymerase enzyme to transcribe gene from DNA into RNA.

  25. Methods for Studying Gene Regulation • Genetics in model organisms. • Microarrays. • In situs. • Promoters hooked to reporter genes • Phylogenic footprinting

  26. Drosophila Genetics antennapediamutant normal

  27. UCSC Gene Sorter showing GNF Gene Atlas 2 microarray (gene chip) data on several transcription factors.

  28. In Situ’s NKx 2.6 NKx 2.9 Hmx1 Mouse mRNA in-situ’s of various transcription factors courtesy of Paul Gray. Mllt7

  29. Reporter Gene Constructs promoter to study easily seen gene Drosophila embryo transfected with ftz promoter hookedup to lacz reporter gene, creating stripes where ftz promoteris active.

  30. Comparative Genomics Scott Schwatz

  31. Human/Mouse blastz at BMP10

  32. Conservation of Gene Features Conservation pattern across 3165 mappings of human RefSeq mRNAs to the genome. A program sampled 200 evenly spaced bases across 500 bases upstream of transcription, the 5’ UTR, the first coding exon, introns, middle coding exons, introns, the 3’ UTR and 500 bases after polyadenylatoin. There are peaks of conservation at the transition from one region to another.

  33. Normalized eScores

  34. Conservation in Multiple Alignments • As you add more species the phylogenic footprint gets sharper. • Currently genome.ucsc.edu shows multiple alignments between 8 species. • Alignment and conservation scoring algorithms are interesting, involve dynamic programming.

  35. PhyloHMM on Drosophila • Drosophila proteasome alpha 7-1. In many genes like this one phylogenic footprint suggests promoter actually is downstream of transcription start site.

  36. Other tools to cybernetically enhance your mind at genome.ucsc.edu

  37. UCSC Gene Sorter • Swiss army knife for dealing with gene sets. • Presents functional data on genes including microarray expression information. • Hilights relationships and connections between genes. • Powerful data mining tool.

  38. Up in Testes, Down in Brain

  39. A Big Bioinformatics Web Site • genome.ucsc.edu gets > 100,000 hits by > 5000 scientists each day. • Involves 600,000 lines of C code, bits of awk, perl, bash, tcsh, java, r and tcl. • 1200 CPUs and 12 Terabytes of disk • 12 full time staff, 18 part time, grad student and post-doc.

  40. Site Architecture • 8 web servers running Apache and MySQL • CGI’s written in C access genome data and user interface settings in MySQL. • Genome database is bottleneck, and is replicated on each server. • Cluster of 1000 CPUs, and smaller clusters of faster CPUs create annotation files which are loaded into database.

  41. Site Sociology • 1/3 of group telecommutes. • Thursdays are devoted to reading and testing each other’s code and if necessary a one or two hour meeting. • We develop very incrementally, and do a new release once a week. • 1/4 of group is dedicated to quality assurance, I’m wanting to increase this to 1/3. • User support is shared by everyone.

  42. Parasol and Kilo Cluster • UCSC cluster has 1000 CPUs running Linux • 1,000,000 BLASTZ jobs in 25 hours for mouse/human alignment • We wrote Parasol job scheduler to keep up. • Very fast and free. • Jobs are organized into batches. • Error checking at job and at batch level.

  43. Conclusions • Spaghetti code is not so helpful in understanding the genome. • Human genome suggests that trial and error development is likely to yield a robust version of windows within 3 billion years. • Understanding the flow of control in the genome is a problem that fascinates biologists and computer scientists alike.

  44. Individuals Institutions Acknowledgements Chuck Sugnet, Angie Hinrichs, Fan Hsu, Terry Furey, Heather Trumbower, Kate Rosenbloom, Hiram Clawson, Brian Raney, Rachel Harte, Bob Kuhn, Andy Pohl, Mathieu Blanchette, Donna Karolchik, David Haussler Bob Waterston, John Sulston, Eric Lander, Richard Gibbs, Francis Collins, Michael Brent, Olivier Jaillon, David Kulp, Ewan Birney, Greg Schuler, Deanna Church, Scott Schwartz, Ross Hardison, Webb Miller and everyone else! NHGRI, NCI, HHMI, The Wellcome Trust, Taxpayers in the US and worldwide. Baylor, Sanger, Wash U, Whitehead, Stanford, JGI/ DOE, Vancouver GSC, UW and the international sequencing centers. UCSC, NCBI, EBI, Ensembl, Genoscope, MGC, Intel, TIGR, Jackson Labs, Affymetrix, SwissProt.

  45. THE END

More Related