Chenghai Xue, Fei Li, Tao He, Guo-Ping Liu, Yanda Li, and Xeugong Zhang

Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He, Guo-Ping Liu, Yanda Li, and Xeugong Zhang CISC 841 Bioinformatics Nehar

Background: miRNAs • Single-stranded RNA, ~ 20-25 nucleotides, that play a regulatory role in gene expression. • Transcribed as long primary miRNA having a hairpin structure. • pri-miRNA processed by nuclear RNase III Drosha into ~60-70 nt long pre-miRNA. • pre-miRNA actively transported from the nucleus to the cytoplasm by Exportin-5. • Cleaved into ~20-25 nt mature miRNA.

Background: The ‘hairpin loop’ • Sequence of nucleotides where two segments can form base-pairs with each other, but a segment within that sequence can not.

Background: The ‘hairpin loop’ The sequence ---CCTGCXXXXXXXGCAGG--- Forms the hairpin structure ---C G--- C G T A G C C G X X X X X X X

Background: The ‘hairpin loop’ • Sequence of nucleotides where two segments can form base-pairs with each other, but a segment within that sequence can not. • The pre-miRNA 'hairpin' is an important secondary structure for identifying miRNAs. • Since mature miRNAs are very short (~20 nt), sequence alignment is not very useful for identification of miRNAs. • Solution is to make use the hairpin structure of pre-miRNA.

The problem  • There are many sequence segments that fold into similar stem-loop hairpin structure. • so existing methods for identification of miRNAs must utilize comparative genomics information besides the structure features. An example: Filter out hairpins not conserved in related species. • This implies an inability to identify miRNAs without close known homologues. • Furthermore, for species without closely related species sequenced comparative genomics approaches can't be applied.

Proposed solution  • ab initio (from first principles) classification of real pre-miRNA from "pseudo " pre-miRNA i.e. non pre-miRNA sequence having the hairpin structure. • Get a set of novel features that combine local structure and sequence information of pre-miRNA stem-loops. • Use SVM to classify as pre-miRNA and pseudo pre-miRNA.

The datasets • Sets of human pre-miRNA and pseudo-miRNA hairpins collected to train SVM and evaluate performance. • Human pre-miRNA downloaded from the miRNA registry database. only pre-miRNAs without multiple loops considered (~193 or 93% of database.) • pseudo and candidate miRNA hairpins. Segments having stem-loop structure similar to pre-miRNA but aren't pre-miRNA. • CODING dataset and the CONSERVED-HAIRPIN dataset.

The Coding dataset • Collected from protein coding regions. • Used as negative samples in training and validation of classifier. • Length distribution kept identical to pre-miRNAs. • Criteria for selection: • minimum 18 base pairings on the stem and hairpin. • Maximum of -15 kcal/mol free energy of secondary structure. (numbers correspond to limits for genuine human pre-miRNAs.) • 8,494 pre-miRNA-like hairpins in this dataset.

The Conserved-hairpin dataset • Extracted from genome region of position 56,000,001 – 57,000,000 on human chromosome 19 ( UCSC db.) • Used as a candidate dataset to evaluate the classifier. • 2,444 hairpins from sequences conserved between Human and mouse. • Most hairpins likely to be pseudo-miRNAs. In fact, only 3 known miRNAs in this dataset.

Training and Test sets • For classification experiments, one training set and two test sets built from the 3 datasets. • TR-C: Training set. • 163 human pre-miRNAs (+ve samples) from the 193 human pre-miRNAs. • 168 pseudo pre-miRNAs (-ve samples.) from the Coding dataset. • TE-C: Test set 1. • Remaining 30 human pre-miRNAs; 1000 pseudo pre-miRNAs (avoiding those in TR-C.) • Conserved-hairpin dataset: Test set 2.

Two further test sets • Apply the SVM trained using previous sets on two further test sets. • Cross-Species test set • 581 pre-miRNAs from 11 species. • Updated test set • New batch of reported human miRNA. • Includes 39 non-redundant pre-miRNAs without multiple loops.

Local contiguous structure-sequence features • Local sequence features are important in pre-miRNAs. • Authors claim – Distribution of local sub-structures (i.e. continuously paired or unpaired structures) of pre-miRNAs are significantly distinct from pseudo pre-miRNAs. • Use a combination of local structure with sequence information to classify real vs. pseudo miRNA hairpins. • Focus on information of 3 adjacent nucleotides (triplet elements.) • “(“ and “)” mean paired at 5’-end and 3’-end. “.” means unpaired. Paper doesn’t make 5’ – 3’ distinction.

Structure-sequence features • 8 possible structure compositions for each triplet [ “(((“, “((.”, “(..”, and so on] • 32, (U,C,G,A)x8 structure –sequence combinations if we consider the middle nt.

Structure-sequence features • e.g. U((( => middle nt is U and all three nts are paired. • Count appearance of each triplet to get a 32-dimensional feature vector (normalized).

SVM Classification • The SVM classifier is trained with TE-C & applied to other test sets. • From TR-C 28/30 human pre-miRNA and 881/1000 pseudo-miRNAs correctly identified. • On Conserved hairpin set 2174/2444 structures classified as false miRNAs.

SVM Classification • The triplet elements reflect contiguous fine-structures and sequence composition. For instance “(((” => stacking of paired bases, and “…” => bulge loops. • The success of the classifier shows that these features reflect intrinsic characteristics of pre-miRNAs. • “(((” appears at higher frequency in pre-miRNAs. And “…” appears more often in pseudo miRNAs.

SVM Classification Average freq. of triplets in training dataset

SVM Classification • The triplet elements reflect contiguous fine-structures and sequence composition. For instance “(((” => stacking of paired bases, and “…” => bulge loops. • The success of the classifier shows that these features reflect intrinsic characteristics of pre-miRNAs. • “(((” appears at higher frequency in pre-miRNAs. And “…” appears more often in pseudo miRNAs. • Observations can be linked to the stability of the secondary structure. Stacking of more continuously paired nts decreases free energy. So, pre-miRNAs are more stable.

SVM Classification • Sequence information • Frequency of same triplet structure with different middle nts in real pre-miRNAs, and across real and psuedo miRNAs varies.

SVM Classification Average freq. of triplets in training dataset

SVM Classification across species • Applied the classifier trained on human data to other species (Cross-Species test set.) • Pretty good performance in identifying true pre-miRNAs. • 581 known pre-miRNA of 11 species. 90.9% overall accuracy.

SVM Classification across species

Conclusion • Ab initio methods for distinguishing true pre-miRNA from pre-miRNA-like hairpin structures are very important. • The triplet-SVM classifier describes fine grained sequence-structure characteristics. • 90% accuracy on human data. • Upto 90% accuracy on 11 other species (including plants and virus) without using comparative genomics information. • Current specificity of about 89% is not enough for genome-wide applications.

Chenghai Xue, Fei Li, Tao He, Guo-Ping Liu, Yanda Li, and Xeugong Zhang