Automatic Transformation of Raw Clinical Data into Clean Data Using Decision Tree Learning
170 likes | 343 Vues
Automatic Transformation of Raw Clinical Data into Clean Data Using Decision Tree Learning. Jian Zhang Supervised by: Karen Petrie. Background. Cancer research has become an extremely data rich environment. Plenty of analysis packages can be used for analyzing the data. Data preprocessing.
Automatic Transformation of Raw Clinical Data into Clean Data Using Decision Tree Learning
E N D
Presentation Transcript
Automatic Transformation of Raw Clinical Data into Clean Data Using Decision Tree Learning Jian Zhang Supervised by: Karen Petrie
Background • Cancer research has become an extremely data rich environment. • Plenty of analysis packages can be used for analyzing the data. • Data preprocessing.
Rich data environment • There are some factors about breast cancer
Raw clinical data sample • Yes-No data: yes: yes, Yes, Ye, yed, yef … no: No, n, not … null: don’t know, no data, waiting for lab • Positive-Negative data: Positive: +, ++, p, p++… Negative: -, n, neg, n---… Null: no data, ruined sample, waiting for lab
Question? Could we make the process automated?
Introduction • Decision Tree learning • Weka
Decision Tree Learning • Decision tree learning is a method for approximating discrete-valued functions, which is one of the most popular inductive algorithms.
Weka • Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software written in Java, which contains a collection of algorithms for data analysis and predictive modeling.
Experiment • Data: Training dataset with 100 instances Test dataset with 100 instances, which has 17 different values from the training dataset • Tool: weka
Experiment • Experiment 1 : training dataset • Experiment 2 : training dataset, test dataset
Result • Through the results, the decision tree has a good classification and prediction for the existing entries, but for the unknown entries, the prediction is not as good as expected.
Future work • Find and correct the incorrect prediction in the process • Automated transformation for unknown entries