M-GCAT: A Tool for Efficient Multiple Genome Alignment and Comparison

Todd Treangen1 , Xavier Messeguer1,2 1 Algorithmics and Genetics group, Technical University of Catalonia, Spain 2Barcelona Supercomputing Center

Outline Part 1: Introduction and Overview Part 2: M-GCAT Specifications and Availability Part 3: M-GCAT user interface Part 4: Live Demo Part 5: Experimental Results

M-GCAT Tutorial Part 1: Introduction and Overview back

Motivation • Comparative genomics continues to provide important information specific to the evolution of a species or genetic diseases. • As new genomes are published, more variations of multiple genome alignment will be possible; tools able to deal with extensive comparison of large genomes have become essential. • Several new tools have recently emerged to handle these variations; multiple large genome comparison still poses several challenges to current tools.

Comparing and aligning genomes • Previously, attempts were made to efficiently and accurately align using a global dynamic programming algorithm. • However, for sequences larger than 10,000 bps it is computationally expensive to globally align, so we need to choose a different approach if we want to be able to align multiple large sequences.

An anchor-based alternative • Anchor-based sequence alignment has been widely recognized as a valid alternative approach to efficient alignment, and is seen in tools such as MGA, MUMmer, and MAUVE. • The idea is that any optimal global alignment will more than likely contain significant matching regions and thus we can build our alignment around these anchors. • Most existing tools strictly enforce the idea of collinearity, and require that all anchors are non-overlapping. M-GCAT can consider all anchors(MUMs) when processsing a multiple alignment.

What is a MUM? • Given a set of genomes G = {G1, G2, ... }, a string is a Maximal Unique Matching(MUM) if and only if: • It is a substring of each genome, i.e. It is a matching • It appears once and only once in each genome, i.e. It is unique • It is not a substring of another unque match. i.e. It is maximal

And MUM Cluster • And additionally, a cluster can be defined as: • Given a set of MUMs M = {M1, M2, ... }, a cluster is a subset of M, where: • Consecutive MUMs are within some threshold distance in M • All MUMs in M are collinear

Existing Tools • A selection of existing genome comparison and alignment tools • M-GCAT was designed for genome comparisons involving Large, Multiple, Closely-related species, while being able to detect rearrangements, such as inversions and transpositions.

M-GCAT Tutorial Part 2: M-GCAT Specifications back

M-GCAT Specifications • M-GCAT was written in C++ and Python, and has been compiled in Windows, Linux, Mac OS X, and Solaris, and should be readily portable to other platforms. • M-GCAT requires a minimum of 30 megabytes of disk space for installation, and when performing large genome comparisons it is recommended to have at least 512 MB RAM available.

M-GCAT Availability • M-GCAT is publicly available for non-commercial use at: • http://alggen.lsi.upc.es/ • Required software for running M-GCAT: • PYTHON 2.3 or 2.4: http://www.python.org • Other software • MUSCLE: http://www.drive5.com/muscle

M-GCAT Tutorial Part 3: M-GCAT user interface back

First, select the M-GCAT Parameters page

Then, to add FASTA format DNA sequences, click on Add Sequences

The added sequences will appear here once selected

We can view the added sequences by clicking here

Double clicking on the sequence will toggle the Reverse Complement of the sequence.

To configure parameters before running, there are three parameter sections: MUM, Cluster, and Alignment.

In MUM Parameters, there are three values you can set: Min MUM length, Min Anchor Length, and Random MUM Length.

Min Anchor length is the minimum allowable size for the initial set of MUM anchors found among all genomes.

Min MUM length is the minimum allowable size for a MUM within the genome region during the coarsening process. As searchable genome regions between the initial MUM anchors become smaller and smaller, so should this value

Random MUM length is the maximum length of MUMs that can be considered insignificant with respect to the genomes being compared. All MUMs less than this length and which meet the random criteria will be removed.

In Cluster Parameters, there are three values you can set: the Q value, D value, and the Partition value(P).

Q is the minimum allowable length of a region where M-GCAT will perform a search for new MUMs. Decreasing this value will generally generate more mums.

D is the maximum allowable distance between any two MUMs in a cluster. Increasing this value will generally decrease the number of Clusters and increase the percentage of the genomes we can Align with ClustalW.

P is short for Parition, and it is used to partition large genomes in order to reduce memory usage. For example, a Parition value of 1000 would split a genome with 10,000 bps into 10 parts, and thus reduce memory usage by a factor of 9 or 10. Decreasing this value generally decreases memory usage and increases running time.

M-GCAT Parameter Description

After configuring M-GCAT parameters, we are ready to to compare these two sequences. To start a new analysis, click on Run M-GCAT.

When running, the shell window tracks the progress of M-GCAT

When the results are ready, the viewer will automatically load the M-GCAT summary page to give the summary of the analysis.

Clicking a Cluster with the left mouse button displays the length, start and end positions, and bases of this cluster.

The highlighted area shows a mapping of the clusters in each sequence, and is used to store alignment information for each region among the genomes.

Select->Align select regions to align currently selected cluster.

When MUSCLE finishes calculating the alignment, the results will be displayed just below the Cluster window. The alignment is offered in two formats: the first is MUM-oriented as we see for each sequence the *MUMs* , the second is a standard alignment format showing the resulting linear arrangement and the positions.

Additionally, an alignment score is calculated, and based on this value the cluster alignment color is updated.

The regions between any two clusters can also be selected for alignment.

The alignment of this inter-cluster region returned a lower alignment score as indicator by its lighter color. Since inter-cluster regions have no MUMs, they can take an extemely long time to align with either alignment program.

Selecting->Align ALL clusters will iteratively align ALL clusters found among all genomes.

Simply by surveying the color landscape of the genomes, we can get an approximate idea of similarity between these sequences.

This last alignment option allows for the alignment over the entire length of the genomes. This will align ALL regions in the genomes, clusters & inter-clusters.

Now that the entire genome is aligned, we can save the alignment for future reference. Select->Save M-GCAT alignment data as.. To store the alignment in a file. *Note: The files can be quite large due to the amount of information generated from the alignment.

And the BLAST results for the region in Mycoplasma pneumoniae are more similar to itself than any other sequence.

Another feature is being able to view the gene information for each cluster. To view this information select the Gene viewer workspace in the View menu.

When a cluster is selected, the genes will be displayed in the lower left window. Each horizontal rectange represents a gene, an arrow its orientation, and the vertical lines represent MUMs within the selected cluster. Also, the color is specific to gene function and follows the scheme the provided legend.

A gene in the first genome can be selected by clicking on it with the left mouse button. When selected, relevant information is displayed in the lower right window. This information has been obtained from PTT files provided by NCBI.

Selecting on the gene in the same position in the second sequence reveals that they share the same function, and indeed are related.

To view a MUM which links these two genes, select it and the MUM information will be displayed in the lower right window.

M-GCAT: A Tool for Efficient Multiple Genome Alignment and Comparison

M-GCAT: A Tool for Efficient Multiple Genome Alignment and Comparison

Presentation Transcript

ADAM VIZINA 1,2 , MARTIN HANEL 1,2 , OLDŘICH NOVICK Ý 1 , PAVEL TREML 1

Xavier University

Víctor Ponce 1,2 , Sergio Escalera 1,2 , and Xavier Baró 1,3

Jason P. Stockmann 1 , Gigi Galiana 2 , Leo Tam 1 , and R. Todd Constable 1,2

M.T. Pay 1 . J . M. Baldasano 1,2 , S. Gassó 1,2

1 John 3:1,2

Today’s Scripture 1 Peter 1:1,2

S. T. Wu 1,2 , A. H. Wang 1 , Yang Liu 3 , and Todd Hoeksema 3

Midori Nakamura 1,2 , Yusuke Takahara 1,2 , Saori Matsuoka 1 ,

Bernat Gel and Xavier Messeguer ( bgel@lsi.upc )

Xavier Gellynck

Ivo Kabelka, 1 Jakub Štěpán, 1,2 Jaroslav Koča 1,2 , and Petr Kulhánek 1,2

Bartosz Baliś 1,2 Marian Bubak 1,2 Michał Węgiel 1

Zadonina E.O. (1) , Caldeira B. (1,2) , Bezzeghoud M. (1,2), Borges J.F. (1,2)

Chashei I 1 ., Glubokova 1,2 S., Glyantsev 1,2 А ., Tyul’bashev 1 , S. , Shishov 1 , V.

Alex Sánchez 1,2 , Xavier de Pedro 2 , Ferran Briansó 2 1 Departament d’Estadística UB

Wioletta Wujcicka 1 , Jan Wilczyński 1,2 , Dorota Nowakowska 1,2

Xavier University

Francis Xavier

Psalm 1:1,2