140 likes | 261 Vues
This project focuses on the extraction of features from DNA sequences using K-mer analysis, emphasizing the balance between feature size and classification accuracy. It examines various K-mer lengths, the impact of degeneracy and mismatches, and presents methods for selecting features to avoid overfitting in machine learning models. Using tools like MEME and MAST, the project demonstrates how to identify significant motifs leading to a more accurate DNA classification while maintaining computational efficiency. By analyzing sequences formatted for Weka, it provides insights into effective feature representation and classification strategies.
E N D
Feature extraction TGGCCGTACGAGTAACGGACTGGCTGTCTTCTCGT n CCGATACCCCCCACGCGAAACCCTACACATCAAAT p AGCTAACTAGAGTCACTCCTTAGGATAGTGAGCGT n AGACAAGAATCAATGCTCGCCCCCGGGTACTGAAT p GTAGGACAACAATATTGGTCCGGTGGTACCGGTAC n ACGCGGGTGGCGGCATGGTGCTCCGAAAGTGTTGT n CTCATATCCTACGCGGCCCCAACTATTAGCTCATG p TGCTCCTTTCGCGGTCCAGCAGGCAAGCGAAAGAC n
K-mer based features • Exact k-mers: k = 1 to 10 • Larger k • More accurate representation: accurate classification • More features: overfitting => lower accuracy • Less efficient (both time and memory) • Feature selection may be useful • Smaller k • Less features: less likely to overfit, shorter runing time • Less accurate representation: inaccurate classification • More efficient (both time and memory) • K-mers with degenerate chars (wild cards) • K-mers with mismatches • Two-block kmers: e.g., ACGN{1-3}CGT
PWM-based features • Can represent all forms in the previous slides in some way • Key: how to get them? • Cluster top-ranked k-mers • Use existing tools (e.g., MEME or alignACE) • Find motifs from top positive sequences (for efficiency) • Use MAST (in MEME) or ScanACE (in alignACE) to find matches in all sequences. • HMMs for learning and classification
Submission data format TGCTCCTTTCGCGGTCCAGCAGGCAAGCGAAAGAC n 0.005 AGCTAACTAGAGTCACTCCTTAGGATAGTGAGCGT n 0.345 AGACAAGAATCAATGCTCGCCCCCGGGTACTGAAT p 0.985 GTAGGACAACAATATTGGTCCGGTGGTACCGGTAC n 0.489 CTCATATCCTACGCGGCCCCAACTATTAGCTCATG p 0.523
Input format for weka @relation experiment @attribute seq string @attribute AAAA numeric @attribute AAAC numeric @attribute AAAG numeric … @attribute TTTT numeric @attribute class {n,p} @data TGCTCCTTTCGCGGTCCAGCAGGCAAGCGAAAGAC 0 1 1 0 0 …. p AGCTAACTAGAGTCACTCCTTAGGATAGTGAGCGT 1 0 0 1 1…. n
Output format from weka With option –p 0 –distribution (you could try –p 1 to get the seq printed) inst# actual predicted error distribution 1 2:p 1:n + *1,0 2 2:p 1:n + *0.868,0.132 3 1:n 1:n *1,0 … 12 1:n 1:n *0.996,0.004
More about “–p 1” • With the seq as the first attribute, you’ll get an UnsupportedAttributeTypeException: Cannot handle string attributes • Solution – use an unsupervised attribute filter to remove it: java -cp weka.3.6.6.jar weka.classifiers.meta.FilteredClassifier -F "weka.filters.unsupervised.attribute.RemoveType -T string" -W weka.classifiers.trees.J48 -t TF_10_data_1.4mer.arff -T TF_10_data_2.4mer.arff -p 1 -distribution -- -C 0.1 • This has the same effect of calling java -cp weka.3.6.6.jar weka.classifiers.trees.J48 -C 0.1-t TF_10_data_1.4mer.arff -T TF_10_data_2.4mer.arff -p 0 -distribution (but without having the seq as the first attribute in the .arff file. And of course you won’t have the seq in your output file.)
About MEME • Input seqs in FASTA format: > seq1 ATGACGTTACGTAATCCGTGATTATCCCGGCGTAC > seq2 CAGAATGGAACTTAAGGGAGAATTGTGCAATATTA > seq3 CCAACCAGGATGTTGTGCAACCGGTTCTTTTTATA > seq4 CTTGGTTGCGTCATGTCCTGGGATGCATTTACTAT > seq5 GAGATCTCCCTCTTATGTTGACATCCGTAAGCCCA > seq6 GATCGTGATGCATTGACTTGCGTAATAGTGTTATG > seq7 GCTGCAGAGATGGGTCATTATGTAATGTTGCCCCC
Running MEME <ROOT>/meme/bin/meme tf_1_top_100p.fasta -dna -minw 6 -maxw 15 -minsites 10 -mod zoops -nmotifs 5 -text > meme.out.txt tf_1_top_100p.fasta: my input was the top 100 positive seqs for TF_1 (you should try something differently: e.g. randomly split the positive seqs into 10 subsets) -dna: input is dna seq -minw 6: minimum motif length = 6 -maxw 6: maximum motif length = 15 -minsites: motif occurs in at least 10 sequences -mod zoops: each seq contains zero or one copy of the motif -nmotifs 5: return top 5 motifs -text: output is in plain text format (as opposed to HTML format)
Output from MEME ******************************************************************************** MOTIF 1 width = 10 sites = 99 llr = 717 E-value = 3.8e-141 ******************************************************************************** -------------------------------------------------------------------------------- Motif 1 Description -------------------------------------------------------------------------------- Simplified A 5::2:12991 pos.-specific C 1:::6:51:3 probability G 4:16:9:::1 matrix T :a924:3::5 bits 2.2 2.0 1.7 * * 1.5 * * ** Information 1.3 * * ** content 1.1 ** ** ** (10.3 bits) 0.9 ** ** ** 0.7 ***** ** 0.4 ********* 0.2 ********** 0.0 ---------- Multilevel ATTGCGCAAT consensus G AT T C sequence A
Running MAST <ROOT>/meme/bin/mast meme.out.txt -d TF_1_allseq.fasta -text –norc –m <k> -ev 100 -mt 0.01 –b [-stdout > mast_out.txt] • -norc: do not search motif from the reverse-complementary strand • -m <k>: find matches to the k-th motif present in meme.out.txt (vary k from 1 to max number of motifs) • Easier to use a scripting language to loop through k and parse the output • -ev 100: use an E-value cutoff = 100. this allows sequences with poor matches to be output. (default is -ev 10) • -mt 0.01: similar to –ev 100. (default is –mt 0.0001) • -b: simplify output • -stdout: redirect the output into a given file instead of the default file (mast.<motiffilename>.<seqfilename>.<paraValues>)
Output from MAST • Only need Section 1 in output SEQUENCE NAME DESCRIPTION E-VALUE LENGTH ------------- ----------- -------- ------ seq22 0.11 35 seq13 0.11 35 seq72 0.3 35 seq19 0.3 35 seq29 0.38 35 … seq446 96 35 seq298 96 35 seq264 96 35 seq251 96 35 • Larger e-values correspond to lower scores (poorer matches). • Seq not in the output has lowest score. • Therefore, need to transform e-values to scores. • E.g.: if seq not in output: score = 0; else score = max_evalue – evalue (max_evalue = 100 was specified by the –ev parameter.) • Or score = 1 / evalue.
AlignACE and ScanACE • AlignACE: AlignACE –i tf_1_top_100p.fasta > tf_1_top_100p.ace • ScanACE: ScanACE -i tf_1_top_100p.ace -z tf_1_data_1.fasta -c 1 -o <result> -c: allows more motif matches to be output Results saved in: <result>_n1.scn, <result>_n2.scn, etc. Higher scores mean better match. No transformation needed.