1 / 14

Some Ideas on Final Project

Some Ideas on Final Project. Feature extraction. TGGCCGTACGAGTAACGGACTGGCTGTCTTCTCGT n CCGATACCCCCCACGCGAAACCCTACACATCAAAT p AGCTAACTAGAGTCACTCCTTAGGATAGTGAGCGT n AGACAAGAATCAATGCTCGCCCCCGGGTACTGAAT p GTAGGACAACAATATTGGTCCGGTGGTACCGGTAC n ACGCGGGTGGCGGCATGGTGCTCCGAAAGTGTTGT n

chet
Télécharger la présentation

Some Ideas on Final Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Some Ideas on Final Project

  2. Feature extraction TGGCCGTACGAGTAACGGACTGGCTGTCTTCTCGT n CCGATACCCCCCACGCGAAACCCTACACATCAAAT p AGCTAACTAGAGTCACTCCTTAGGATAGTGAGCGT n AGACAAGAATCAATGCTCGCCCCCGGGTACTGAAT p GTAGGACAACAATATTGGTCCGGTGGTACCGGTAC n ACGCGGGTGGCGGCATGGTGCTCCGAAAGTGTTGT n CTCATATCCTACGCGGCCCCAACTATTAGCTCATG p TGCTCCTTTCGCGGTCCAGCAGGCAAGCGAAAGAC n

  3. K-mer based features • Exact k-mers: k = 1 to 10 • Larger k • More accurate representation: accurate classification • More features: overfitting => lower accuracy • Less efficient (both time and memory) • Feature selection may be useful • Smaller k • Less features: less likely to overfit, shorter runing time • Less accurate representation: inaccurate classification • More efficient (both time and memory) • K-mers with degenerate chars (wild cards) • K-mers with mismatches • Two-block kmers: e.g., ACGN{1-3}CGT

  4. PWM-based features • Can represent all forms in the previous slides in some way • Key: how to get them? • Cluster top-ranked k-mers • Use existing tools (e.g., MEME or alignACE) • Find motifs from top positive sequences (for efficiency) • Use MAST (in MEME) or ScanACE (in alignACE) to find matches in all sequences. • HMMs for learning and classification

  5. Submission data format TGCTCCTTTCGCGGTCCAGCAGGCAAGCGAAAGAC n 0.005 AGCTAACTAGAGTCACTCCTTAGGATAGTGAGCGT n 0.345 AGACAAGAATCAATGCTCGCCCCCGGGTACTGAAT p 0.985 GTAGGACAACAATATTGGTCCGGTGGTACCGGTAC n 0.489 CTCATATCCTACGCGGCCCCAACTATTAGCTCATG p 0.523

  6. Input format for weka @relation experiment @attribute seq string @attribute AAAA numeric @attribute AAAC numeric @attribute AAAG numeric … @attribute TTTT numeric @attribute class {n,p} @data TGCTCCTTTCGCGGTCCAGCAGGCAAGCGAAAGAC 0 1 1 0 0 …. p AGCTAACTAGAGTCACTCCTTAGGATAGTGAGCGT 1 0 0 1 1…. n

  7. Output format from weka With option –p 0 –distribution (you could try –p 1 to get the seq printed) inst# actual predicted error distribution 1 2:p 1:n + *1,0 2 2:p 1:n + *0.868,0.132 3 1:n 1:n *1,0 … 12 1:n 1:n *0.996,0.004

  8. More about “–p 1” • With the seq as the first attribute, you’ll get an UnsupportedAttributeTypeException: Cannot handle string attributes • Solution – use an unsupervised attribute filter to remove it: java -cp weka.3.6.6.jar weka.classifiers.meta.FilteredClassifier -F "weka.filters.unsupervised.attribute.RemoveType -T string" -W weka.classifiers.trees.J48 -t TF_10_data_1.4mer.arff -T TF_10_data_2.4mer.arff -p 1 -distribution -- -C 0.1 • This has the same effect of calling java -cp weka.3.6.6.jar weka.classifiers.trees.J48 -C 0.1-t TF_10_data_1.4mer.arff -T TF_10_data_2.4mer.arff -p 0 -distribution (but without having the seq as the first attribute in the .arff file. And of course you won’t have the seq in your output file.)

  9. About MEME • Input seqs in FASTA format: > seq1 ATGACGTTACGTAATCCGTGATTATCCCGGCGTAC > seq2 CAGAATGGAACTTAAGGGAGAATTGTGCAATATTA > seq3 CCAACCAGGATGTTGTGCAACCGGTTCTTTTTATA > seq4 CTTGGTTGCGTCATGTCCTGGGATGCATTTACTAT > seq5 GAGATCTCCCTCTTATGTTGACATCCGTAAGCCCA > seq6 GATCGTGATGCATTGACTTGCGTAATAGTGTTATG > seq7 GCTGCAGAGATGGGTCATTATGTAATGTTGCCCCC

  10. Running MEME <ROOT>/meme/bin/meme tf_1_top_100p.fasta -dna -minw 6 -maxw 15 -minsites 10 -mod zoops -nmotifs 5 -text > meme.out.txt tf_1_top_100p.fasta: my input was the top 100 positive seqs for TF_1 (you should try something differently: e.g. randomly split the positive seqs into 10 subsets) -dna: input is dna seq -minw 6: minimum motif length = 6 -maxw 6: maximum motif length = 15 -minsites: motif occurs in at least 10 sequences -mod zoops: each seq contains zero or one copy of the motif -nmotifs 5: return top 5 motifs -text: output is in plain text format (as opposed to HTML format)

  11. Output from MEME ******************************************************************************** MOTIF 1 width = 10 sites = 99 llr = 717 E-value = 3.8e-141 ******************************************************************************** -------------------------------------------------------------------------------- Motif 1 Description -------------------------------------------------------------------------------- Simplified A 5::2:12991 pos.-specific C 1:::6:51:3 probability G 4:16:9:::1 matrix T :a924:3::5 bits 2.2 2.0 1.7 * * 1.5 * * ** Information 1.3 * * ** content 1.1 ** ** ** (10.3 bits) 0.9 ** ** ** 0.7 ***** ** 0.4 ********* 0.2 ********** 0.0 ---------- Multilevel ATTGCGCAAT consensus G AT T C sequence A

  12. Running MAST <ROOT>/meme/bin/mast meme.out.txt -d TF_1_allseq.fasta -text –norc –m <k> -ev 100 -mt 0.01 –b [-stdout > mast_out.txt] • -norc: do not search motif from the reverse-complementary strand • -m <k>: find matches to the k-th motif present in meme.out.txt (vary k from 1 to max number of motifs) • Easier to use a scripting language to loop through k and parse the output • -ev 100: use an E-value cutoff = 100. this allows sequences with poor matches to be output. (default is -ev 10) • -mt 0.01: similar to –ev 100. (default is –mt 0.0001) • -b: simplify output • -stdout: redirect the output into a given file instead of the default file (mast.<motiffilename>.<seqfilename>.<paraValues>)

  13. Output from MAST • Only need Section 1 in output SEQUENCE NAME DESCRIPTION E-VALUE LENGTH ------------- ----------- -------- ------ seq22 0.11 35 seq13 0.11 35 seq72 0.3 35 seq19 0.3 35 seq29 0.38 35 … seq446 96 35 seq298 96 35 seq264 96 35 seq251 96 35 • Larger e-values correspond to lower scores (poorer matches). • Seq not in the output has lowest score. • Therefore, need to transform e-values to scores. • E.g.: if seq not in output: score = 0; else score = max_evalue – evalue (max_evalue = 100 was specified by the –ev parameter.) • Or score = 1 / evalue.

  14. AlignACE and ScanACE • AlignACE: AlignACE –i tf_1_top_100p.fasta > tf_1_top_100p.ace • ScanACE: ScanACE -i tf_1_top_100p.ace -z tf_1_data_1.fasta -c 1 -o <result> -c: allows more motif matches to be output Results saved in: <result>_n1.scn, <result>_n2.scn, etc. Higher scores mean better match. No transformation needed.

More Related