IMG/M and metagenome analysis

IMG/M and metagenome analysis Natalia Ivanova MGM Workshop February 5, 2009

Problems of metagenomic data IMG/M features Analysing metagenomic data: flowcharts Outline

Problems of metagenomic data (metagenomic data is the problem) (see IMG/M -> Using IMG/M -> About IMG/M -> Background for definitions)

Definition of high quality genome sequence: an example of “finished” JGI genomes - each base is covered by at least two Sanger reads in each direction with a quality of at least Q20 Definition of “ high quality” metagenome? Too many variables: species composition/abundance amount of DNA available average GC content of each species (applies to 454 Titanium as well) “clonability” of the DNA of each species (or biases of 454 libraries) amount of sequence allocated no clear sequencing goal … Metagenomic data are noisy

Metagenomic data are noisy • Sequence coverage of metagenomes is low • Rate of sequencing artifacts is high • Frameshifts are the most unpleasant artifacts, they lead to errors in gene prediction

Median scaffold length in 56 GEBA genomes – 28,179 bp Median scaffold length in US Sludge, Phrap assembly – 1,157 bp Many more gene fragments in metagenomes (median protein size in GEBA genomes – 252 aa, median protein size in US Sludge, Phrap –195 aa) Problems with assignment to protein families and functional annotation Metagenomic data are highly fragmented

Metagenomic datasets are large (or huge) • No manual annotation (functional annotations in metagenomes should be taken with a grain of salt) • “Divide and conquer” approach

2. IMG/M features (see also IMG/M -> Using IMG/M -> Using IMG/M -> IMG User Guide and IMG/M Addendum)

IMG/M User Interface Map

gene lists gene counts histogram (phylum/class) summary statistics counts, lists, statistics histogram (family) counts, lists histogram (species) recruitment plots Dividing the genes phylogenetically • Bins Microbiome Details -> Microbiome Information -> Bins (of scaffolds) • Phylogenetic Distribution of Genes Microbiome Details -> Phylogenetic Distribution of Genes Components: • histograms • Protein Recruitment Plots • summary statistics tables • lists of genes

Dividing the genes by abundance/ by function • Abundance Profiles Compare Genomes -> Abundance Profiles Tools Components: • Abundance Profile Overview • Abundance Profile Search • Function Comparisons • Function Category Comparisons Common parameters: • Normalization (none/scale for size) • Type of count (raw counts/estimated gene copies) • Type of protein family (COG, Pfam, Enzyme, TIGRfam)

3. Analysing metagenomic data: flowcharts

Sanger metagenomes raw read QC: GC content insert-less clones contamination 16S sequences Sanger library loading to IMG/M-ER (upon request) 10 plate QC taxonomic analysis (MEGAN) manual analysis (protein families, etc.) Full sequence vector and quality trimming loading to IMG/M-ER assembly annotation binning

454 Titanium metagenomes raw read QC; initial assembly ? 16S pyrotags Titanium library loading to IMG/M-ER (upon request) ¼ run QC (100 Mb) taxonomic analysis (MEGAN) manual analysis (protein families, etc.) dereplication quality trimming ? Full sequence (1 run, ~500 Mb) annotation? binning ? loading to IMG/M-ER assembly ?

Sanger/Titanium metagenomes: unassembled data taxonomic analysis using Phylogenetic Distribution of genes • gross counts of hits to taxa • hits to housekeeping genes at different % identity • compare to 16S and MEGAN results unassembled metagenomes • compare to relevant metagenomes (ecology/taxonomy) • compare to relevant genomes (ecology/taxonomy) • check “Genes in internal clusters” abundance analysis using Function Comparisons and Function Category Comparisons abundance analysis of custom function categories using Function Profiles • find the relevant genes and reference sequences in the literature • identify relevant protein families • add them to Function Cart, run Function Profiles, compare sums of counts

Sanger/Titanium metagenomes: assembled data • look for reference genomes • try to select a training set for binning taxonomic analysis using Phylogenetic Distribution of genes assembled metagenomes binning • compare to relevant metagenomes (ecology/taxonomy) • compare to relevant genomes (ecology/taxonomy) • check “Genes in internal clusters” abundance analysis using Function Comparisons and Function Category Comparisons • find the relevant genes and reference sequences in the literature • identify relevant protein families • add them to Function Cart, run Function Profiles, compare sums of counts abundance analysis of custom function categories using Function Profiles

Sanger/Titanium metagenomes: assembled and binned data • check the genes on the scaffolds with lowest confidence • analysis of bin coverage: check the presence of COGs in biosynthetic pathways, ribosomal proteins, etc. QC analysis of bins assembled and binned metagenomes • COG Pathways and Functional Categories • KEGG maps • custom pathways metabolic reconstruction on bins • keep in mind bin coverage • analyze gene presence/absence in pathway context • be careful with unique proteins – they may be errors of gene prediction compare bin content using Phylogenetic Profiles analyze recombination within populations using SNP VISTA

IMG/M and metagenome analysis