The club resident JD WatsonBack2back with DJ. Venter and 2001: The Human Genome International Human Genome Sequencing Consortium, Nature, 409: 860-921 (2001) Venter et. al. , Science 292:1304-1351 (2001)
PrologueRNA word – the dark matter of genomics • How many coding genes in the human genome? • The Bet of 2000: • Mean 61710 • Range – 30,000 – 150,000 • By the end of the genome project the estimated number of human protein-coding genes declined to only ~25,000 • What is the source for that discrepancy? • ESTs based estimation Vs. Whole Genome annotation
RNA revolution • The majority of the transcriptional output comes from non coding RNA • an average of 10% of the human genome (compared with ~1.5% exonic sequences) resulted in transcripts [Cheng et al. 2005] • Or even more...62% of the mouse genome is transcribed [FANTOM3: Science 2005]
Various RNAs – A partial list… • messenger RNA (mRNA) • Ribosomal RNA (rRNA) • Transfer RNA (tRNA) • Small nuclear RNA (snRNA) • Small nucleolar RNA (snoRNA) • Short interfering RNA (siRNA) • Micro RNA (miRNA)
Transcription Translation Protein RNA RNAs are not merely the intermediary cousins of proteins -The Central dogma of molecular biology Revisited Genome miRNA Regulation by proteins Regulation by RNA Transcriptome Proteome
Research in Biology is complex… • Deciphering Biological Systems • The advantage (what makes this quest feasible) and the hindrance (what makes this quest inherently difficult) –both explained by evolution.
The Hindrance – Topological Entanglement of functional interconnections • The difficulties in our research fundamentally owe their complexity to the designer – natural selection. • What is it - a “Robot” or a “UFO” ? • The reason lies in the profound difference between systems “designed” by natural selection and those designed by intelligent engineers[Langton 1989 Artificial Life].
Bottom line:we investigate an outrageously complex weave of interconnections • The “textbook networks” represent only the tip of the iceberg. • miRNAs and “Regolomics” • microRNAs - Expected to represent ~1% of predicted genes [Lim et al., 2003] • Lewis et al., (2003) estimate average of five targets per miRNA • Many targets are transcription factors - miRNAs regulate the regulators
The advantage – universal homology, thus enabling comparative biology. • Bottom line:the research in biology advances through a reductionist approach - using simple model organisms to infer functionality of homologous systems.
Human genome statistics 2.91 billion base pairs 24,000 protein coding genes (>30,000 non-coding genes ???) 1.5% exons (127 nucleotides) 24% introns (~3,000 nucleotides) 75% intergenic (no genes) Repetitive elements rule (~ 45% dispersed repeat) Average size of a gene is 27,894 bases Contains an average of 8.8 exons*Titin contains 234 exons. Ave. of 4 diff. proteins per gene (alternative splicing)
Detecting genes in the human genome Gene finding methods: • Ab initio use general knowledge of gene structure: rules and statisticsThe challenge: small exons in a sea of introns • Homology-based The problem: will not detect novel genes
Genscan (ab initio) \\|// (o o) -. .-. .-oOOo~(_)~oOOo-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. ||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ |/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \||| ' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' • Based on a probabilistic model of a gene structure • Takes into account:- promoters - gene composition – exons/introns- GC content- splice signals • Goes over all 6 reading frames Burge and Karlin, 1997, Prediction of complete gene structure in human genomic DNA, J. Mol. Biol. 268
Eukaryotic splice sites Poly-pyrimidine tract
CpG CpG Islands: another signal • CpG islands are regions of the genome with a higher frequency of CG dinucleotides (not base-pairs!) than the rest of the genome • CpG islands often occur near the beginning of genes maybe related to the binding of the TF Sp1
cell nucleus Nuclear chromosome GeneOntology • GO describes proteins in terms of :biological process(e.g. induction of apoptosis by external signals)cellular component(e.g. membrane fraction)molecular function(e.g. protein kinase)
Comparative proteome analysis Functional categories based on GO
Comparative proteome analysis • Humans have more proteins involved in cytoskeleton, immune defense, and transcription
Horizontal (lateral) gene transfer • Lateral Gene Transfer(LGT)is any process in which an organism transfers genetic material to another organism that is not its offspring
Mechanisms: • Transformation • Transduction (phages/viruses) • Conjugation
Bacteria to vertebrate LGT detection • E-value of bacterial homolog X9 better than eukaryal homolog Human query: Hit………………e-value Frog ………….. 4e-180 Mouse …………1e-164 E.Coli ………….. 7e-124 Streptococcus .. 9e-71 Worm ……………….0.1
Bacteria to vertebrate LGT Non-vertebrates Bacteria vertebrates
Bacteria to vertebrate LGT?? • Hundreds of sequenced bacterial genome vs. handful of eukaryotes • Gene finding in bacteria is much easier than in eukaryotes • On the practical side: rigid mechanical barriers to LGT in eukaryotes (nucleus, germ line)
Repeats statistics • The human genome is ~45% dispersed repeat • 20% LINEs, (AT rich) • 13% is SINES (11% Alu), (GC rich) • 8% LTR (retrovirus like) and • 2% DNA transposons • Another 3% is tandem simple sequence repeats (e.g. triplet) • And another 3-5% is segmentally duplicated at high similarity (over 1kb over 90% id) • Identifying and screening these out is essential to avoid fake matches
LINEs and SINEs • Highly successful elements in eukaryotes • LINE - Long Interspersed Nuclear Element (>5,000 bp) • SINE - Short Interspersed Nuclear Element (< 500 bp) • SINEs are freeriders on the backs of LINEs –encode no proteins
The C-value paradox • Genome size does not correlate with organism complexity
Repetitive elements • The C-value mystery was partially resolved when it was found that large portions of genomes contain repetitive elements
Are Alus functional?? • SINEs are transcribed under stress • SINE RNAs may bind a protein kinase promote translation under stress Need to be in regions which are highly transcribed • Role in alternative splicing
Segment duplications • 1077 segmental duplications detected • Several genes in the duplicated regions associated with diseases (may be related to homologous recombination) • Most are recent duplications (conservation of entire segment, versus conservation of coding sequences only)
481 segments > 200 bp absolutely conserved (100% identity) between human, rat and mouse
Comparison with a neutral substitution rate • Compare the substitution rate in a any 1Mb region • Probability of 10-22 of obtaining 1 ultranconserved element (UE) by chance
481 UEs 100 intronic 111 UE overlap a known mRNA: exonic UEs 256 - no overlap (non-exonic) 156 inter-genic 114 - inconclusive
Who are the genes? Type 1: exonic Type 2: genes which are near non-exonic UEs (???)
Intergenic UEs • Genes which flank intergenic UEs are enriched for early developmental genes • Are UEs distal enhancers of these genes?
Gene enhancer • A short region of DNA, usually quite distant from a gene (due to chromatin complex folding), which binds an activator • An activator recruits transcription factors to the gene
Experimental studies of UEs Tested 167 UEs (both mouse-human UEs and fish-human UEs) for enhancer activity: cloned before a reporter gene to test their activity 45% functioned as enhancers
A bioinformatic success • Ultraconservation can predict highly important function!
BUT … Ahituv PLoS Biol. 2007 Sep;5(9):e234 Chose 4 UEs which are near specific genes:genes which show a specific phenotype when knocked-out Performed complete deletion of these UEs … the mice were viable and did not show any different phenotype
Conclusions… • Ultraconservation can be indicative of important function • … • And sometimes not:- gene redundancy- long-range phenotypes- laboratories cannot mimic life