Feature Extraction Techniques for DNA Sequence Classification Using K-mer Analysis

Some Ideas on Final Project

Feature extraction TGGCCGTACGAGTAACGGACTGGCTGTCTTCTCGT n CCGATACCCCCCACGCGAAACCCTACACATCAAAT p AGCTAACTAGAGTCACTCCTTAGGATAGTGAGCGT n AGACAAGAATCAATGCTCGCCCCCGGGTACTGAAT p GTAGGACAACAATATTGGTCCGGTGGTACCGGTAC n ACGCGGGTGGCGGCATGGTGCTCCGAAAGTGTTGT n CTCATATCCTACGCGGCCCCAACTATTAGCTCATG p TGCTCCTTTCGCGGTCCAGCAGGCAAGCGAAAGAC n

K-mer based features • Exact k-mers: k = 1 to 10 • Larger k • More accurate representation: accurate classification • More features: overfitting => lower accuracy • Less efficient (both time and memory) • Feature selection may be useful • Smaller k • Less features: less likely to overfit, shorter runing time • Less accurate representation: inaccurate classification • More efficient (both time and memory) • K-mers with degenerate chars (wild cards) • K-mers with mismatches • Two-block kmers: e.g., ACGN{1-3}CGT

PWM-based features • Can represent all forms in the previous slides in some way • Key: how to get them? • Cluster top-ranked k-mers • Use existing tools (e.g., MEME or alignACE) • Find motifs from top positive sequences (for efficiency) • Use MAST (in MEME) or ScanACE (in alignACE) to find matches in all sequences. • HMMs for learning and classification

Submission data format TGCTCCTTTCGCGGTCCAGCAGGCAAGCGAAAGAC n 0.005 AGCTAACTAGAGTCACTCCTTAGGATAGTGAGCGT n 0.345 AGACAAGAATCAATGCTCGCCCCCGGGTACTGAAT p 0.985 GTAGGACAACAATATTGGTCCGGTGGTACCGGTAC n 0.489 CTCATATCCTACGCGGCCCCAACTATTAGCTCATG p 0.523

Input format for weka @relation experiment @attribute seq string @attribute AAAA numeric @attribute AAAC numeric @attribute AAAG numeric … @attribute TTTT numeric @attribute class {n,p} @data TGCTCCTTTCGCGGTCCAGCAGGCAAGCGAAAGAC 0 1 1 0 0 …. p AGCTAACTAGAGTCACTCCTTAGGATAGTGAGCGT 1 0 0 1 1…. n

Output format from weka With option –p 0 –distribution (you could try –p 1 to get the seq printed) inst# actual predicted error distribution 1 2:p 1:n + *1,0 2 2:p 1:n + *0.868,0.132 3 1:n 1:n *1,0 … 12 1:n 1:n *0.996,0.004

More about “–p 1” • With the seq as the first attribute, you’ll get an UnsupportedAttributeTypeException: Cannot handle string attributes • Solution – use an unsupervised attribute filter to remove it: java -cp weka.3.6.6.jar weka.classifiers.meta.FilteredClassifier -F "weka.filters.unsupervised.attribute.RemoveType -T string" -W weka.classifiers.trees.J48 -t TF_10_data_1.4mer.arff -T TF_10_data_2.4mer.arff -p 1 -distribution -- -C 0.1 • This has the same effect of calling java -cp weka.3.6.6.jar weka.classifiers.trees.J48 -C 0.1-t TF_10_data_1.4mer.arff -T TF_10_data_2.4mer.arff -p 0 -distribution (but without having the seq as the first attribute in the .arff file. And of course you won’t have the seq in your output file.)

About MEME • Input seqs in FASTA format: > seq1 ATGACGTTACGTAATCCGTGATTATCCCGGCGTAC > seq2 CAGAATGGAACTTAAGGGAGAATTGTGCAATATTA > seq3 CCAACCAGGATGTTGTGCAACCGGTTCTTTTTATA > seq4 CTTGGTTGCGTCATGTCCTGGGATGCATTTACTAT > seq5 GAGATCTCCCTCTTATGTTGACATCCGTAAGCCCA > seq6 GATCGTGATGCATTGACTTGCGTAATAGTGTTATG > seq7 GCTGCAGAGATGGGTCATTATGTAATGTTGCCCCC

Running MEME <ROOT>/meme/bin/meme tf_1_top_100p.fasta -dna -minw 6 -maxw 15 -minsites 10 -mod zoops -nmotifs 5 -text > meme.out.txt tf_1_top_100p.fasta: my input was the top 100 positive seqs for TF_1 (you should try something differently: e.g. randomly split the positive seqs into 10 subsets) -dna: input is dna seq -minw 6: minimum motif length = 6 -maxw 6: maximum motif length = 15 -minsites: motif occurs in at least 10 sequences -mod zoops: each seq contains zero or one copy of the motif -nmotifs 5: return top 5 motifs -text: output is in plain text format (as opposed to HTML format)

Output from MEME ******************************************************************************** MOTIF 1 width = 10 sites = 99 llr = 717 E-value = 3.8e-141 ******************************************************************************** -------------------------------------------------------------------------------- Motif 1 Description -------------------------------------------------------------------------------- Simplified A 5::2:12991 pos.-specific C 1:::6:51:3 probability G 4:16:9:::1 matrix T :a924:3::5 bits 2.2 2.0 1.7 * * 1.5 * * ** Information 1.3 * * ** content 1.1 ** ** ** (10.3 bits) 0.9 ** ** ** 0.7 ***** ** 0.4 ********* 0.2 ********** 0.0 ---------- Multilevel ATTGCGCAAT consensus G AT T C sequence A

Running MAST <ROOT>/meme/bin/mast meme.out.txt -d TF_1_allseq.fasta -text –norc –m <k> -ev 100 -mt 0.01 –b [-stdout > mast_out.txt] • -norc: do not search motif from the reverse-complementary strand • -m <k>: find matches to the k-th motif present in meme.out.txt (vary k from 1 to max number of motifs) • Easier to use a scripting language to loop through k and parse the output • -ev 100: use an E-value cutoff = 100. this allows sequences with poor matches to be output. (default is -ev 10) • -mt 0.01: similar to –ev 100. (default is –mt 0.0001) • -b: simplify output • -stdout: redirect the output into a given file instead of the default file (mast.<motiffilename>.<seqfilename>.<paraValues>)

Output from MAST • Only need Section 1 in output SEQUENCE NAME DESCRIPTION E-VALUE LENGTH ------------- ----------- -------- ------ seq22 0.11 35 seq13 0.11 35 seq72 0.3 35 seq19 0.3 35 seq29 0.38 35 … seq446 96 35 seq298 96 35 seq264 96 35 seq251 96 35 • Larger e-values correspond to lower scores (poorer matches). • Seq not in the output has lowest score. • Therefore, need to transform e-values to scores. • E.g.: if seq not in output: score = 0; else score = max_evalue – evalue (max_evalue = 100 was specified by the –ev parameter.) • Or score = 1 / evalue.

AlignACE and ScanACE • AlignACE: AlignACE –i tf_1_top_100p.fasta > tf_1_top_100p.ace • ScanACE: ScanACE -i tf_1_top_100p.ace -z tf_1_data_1.fasta -c 1 -o <result> -c: allows more motif matches to be output Results saved in: <result>_n1.scn, <result>_n2.scn, etc. Higher scores mean better match. No transformation needed.

Feature Extraction Techniques for DNA Sequence Classification Using K-mer Analysis

Feature Extraction Techniques for DNA Sequence Classification Using K-mer Analysis

Presentation Transcript

Some ideas on campaigning

Ideas on Project Columbia Management

final project ideas and FSMs

Some ideas….

Some final thoughts…

Final Project Ideas

final project ideas and FSMs

Some ideas for the project

Some Final Thoughts

Some ideas on water

FINAL YEAR PROJECT ON

Some ideas on innovations in implementation science

Some ideas on Gram Panchayat e-Gov

some final thoughts on other care

Brainstorm some ideas

Some Final Thoughts…

Some Project Ideas…

Some Ideas on the DAMPE Software Framework

Some Project Ideas

Final Project Ideas, Requirements, and Deadlines

Quality – Some ideas

Some XCON ideas