1 / 23

Hiroshi Dozono Saga University

Visualization and Classification of DNA sequences using Pareto learning Self Organizing Maps based on Frequency and Correlation Coefficient. Hiroshi Dozono Saga University. Introduction (1).

questa
Télécharger la présentation

Hiroshi Dozono Saga University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Visualization and Classification of DNA sequencesusing Pareto learning Self Organizing Maps based on Frequency and Correlation Coefficient Hiroshi Dozono Saga University

  2. Introduction (1) • The first step of Genome analysis is DNA sequencing which identifies the sequence of nucleotides on DNA sequences. • About 10 years ago, DNA sequencing requires large costs and long time. • Recently, Next Generation Sequencing(NGS) can read the sequences very rapidly in low cost. • $100〜$1000 in 1 hour. • NGS produces large amounts of sequence data at once. • Gbytes 〜Tbytes

  3. Introduction(2) • After reading the sequences, further analyses are conducted. • Identify the organisms • Identify the functions of genome • Remap the sequences on reference sequences • Comparison of the genomes among organisms • For the comparison of genomes, it will need large amount of computation to compare the sequences precisely. • The sequence alignment method is generally used. • The sequence alignment is effective for pairwise comparison or comparing small number of sequences. • It will need large computation for comparing large number of sequences • The statistical information of the sequences will be the indicator which can identify the similarity among the sequences.

  4. DNA sequencing • DNA sequence • Sequence of 4 types of -nucleotide A, G, T, C • Complement nucleotide hybridizes each other. A-T G-C AGTCTTATCGATTAG ||||||||||||||| TCAGAATAGCTAATC • DNA sequencing - Genome analysis • Next generation sequencers can read all DNA sequences of a organism or some organisms at once. • Large amount of sequencing data (from some G to T bytes) is produced. • The result of sequencing is obtained as a collection of short fragment of the nucleotides A,G,T and C. • Effective method for identifying the features of the sequences is required.

  5. Conventional DNA analysis • Sequencing • Reconstruction of the sequence • Identification of coding region which codes genes • Identification of the function of genes • It needs large computational costs after sequencing • Our approach aims to extract global features of the DNA sequencing without precise analysis.

  6. Frequency based SOM • SOM which uses the Frequency of N-tuples in DNA sequences as input vector is proposed in T. Abe, T. Ikemura,et.al, Informatics for unreveiling hidden genome signatures, Genome Res., vol.13, p.693-702 • For N-tuples, the dimension of input vector is 4N

  7. SOM based on correlation coefficients of nucleotides. Correlation Coefficients(CC) of DNA sequence ACGCTACTAG A 1000010010 ρAA(n) CC between A and n-shifted A C 0101001000 ρAC(n) CC between A and n-shifted C G 0010000001 : T 0000100100 ρTT(n) CC between T and n-shifted T For all combinations of A,G,T,C and from 1 to n shifts, 4x4xn correlation coefficients are calculated, and used as input vector of SOM. Compared with dimension of n-tuples(4n), dimension of CC is much smaller.

  8. Using these equations, correlation coefficients can be calculated without converting DNA sequences to binary sequences.

  9. Experimental results of SOM based on correlation coefficients • Settings of the experiments • Set 1: genes from amino acid metabolisms of 6 species • Set 2: genes from 7 metabolic pathway of homosapience • The sequences are segmented to 1000 bases.

  10. Experimental results of Set 1(1) • The resolution and topology of these maps are almost compatible. • Map of frequencies of 4-tuples • From 6 species L=256 • Map of CC of 1-4 shifts • from 6 species L=2 • L=64

  11. Experimental results of Set 1(2) • For small dimensions, CC shows better separation.

  12. Experimental results of Set 2 • The genes from metabolic pathways of homosapience can not be clearly clustered.

  13. Experimental results of virus genome

  14. Experiments of identification of sequences • 70% of the fragments of sequences are used for learning, and remainder are used for test. • The experiments are conducted using SOM and Supervised Parato learning SOM, which is proposed by “Dozono”, to combine the integration of multi-modal vector, the visualization and supervised learning.

  15. Winner and updated units • Conventional SOM • Pareto learning SOM • Overlapped neighbors are updated more strongly. • It play a important role for integration of muti-modal vectors.

  16. Supervised Pareto learning SOM(SP-SOM) • The category vector can be introduced as an independent vector to each input vector for P-SOM. • The category vector attracts the input vectors in same category closely on the map corporately with other input vectors. • The P-SOM learning algorithm becomes supervised. • Category of test vector xt is determined as follows. • where P(xt) is the Pareto optimal set of units for xt

  17. Mapping results using Supervised Pareto-learning SOM

  18. Experimental results of identification

  19. Conclusions(1) • We proposed a preprocessing method for DNA sequences by using correlation coefficients of the occurrence of the nucleotides. • Using this method, the clustering results of the sequences were nearly compatible with those obtained using the frequencies of the N-tuples despite the difference in the length of input vectors.

  20. Conclusions(2) • Pareto learning SOM method is applied to the classification of DNA sequences by using correlation coefficients and frequencies as input vectors. • Pareto learning SOM using CC as the input vector shows good performance for classification compared with that obtained with conventional SOMs, and frequencies.

  21. Feature works • Application of this method to additional types sequence data, such as coding region and non-coding region, and to large data sets such as whole genome. • Improvement of the computational costs of P-SOMs, which are 5 times more than those of conventional SOMs.

  22. Acknowledgements • This work was supported by JSPS KAKENHI Grant Number 24500279.

More Related