620 likes | 772 Vues
MIS 696 Final Presentation Fall 2008. Mary Burns Katherine Carl Jiesi Cheng Soomi Cheong Koren Elder Li Fan Chun-neng Huang Brent Langhals Matthew Pickard Nathan Twyman Shuo Zeng Xinlei Zhao. What is MIS?. As a discipline?. As a field of research?. MIS: A Conventional Definition.
E N D
MIS 696 Final PresentationFall 2008 Mary Burns Katherine Carl Jiesi Cheng Soomi Cheong Koren Elder Li Fan Chun-neng Huang Brent Langhals Matthew Pickard Nathan Twyman Shuo Zeng Xinlei Zhao
What is MIS? As a discipline? As a field of research?
MIS: A Conventional Definition Management Information (Computer Science) Systems (Engineering)
2007: A Normative Approach, Decision Tree 1999: A Simple Model and Key Researchers The Quest: From the Seven Pillars to the Tree of Decision 2006: Methodological Approach 2008: An IS approach to MIS? 2000: Additional Pillars 2005: Another Model, Publication Trends 1998: Seven Pillars 2001: Another 2D Model, A Timeline of Researchers 2004: A 2D Model, Research Institutions 2003: A 3D Model, Timeline, Endnote Library 2002: Researchers, More of the Same
The Brainstorm “Discovery consists of seeing what everybody has seen and thinking what nobody has thought.” –Albert Szent-Gyorgyi Nathan
The Ideagora Journal Trends Web of Science Graphical Representation Clustering Validation of 2007 Decision Tree
The Realization We are a large and intelligent group of people, but can we deliver all of these analyses in a semester? We need a way to manage a large quantity of data.
Contribution: Database Data Data Data Data Data
Contribution: Database • Basic article info • Category
Contribution: Database • Web of Knowledge and Google Citations
Our Contribution: A Database • Article Dimensions • Rigor vs. Relevance • Theoretical vs. Applied • Innovation vs. Review • Behavioral vs. Technical
Purpose and Methodology • Purpose • Classify the MIS papers from a different perspective – the general attributes of the papers • Provide useful information to assist the trend analysis and prediction about MIS research • Methodology • Clustering: Use Fuzzy k-Means Clustering Algorithm • Validation: Use Partition Index (SC) to determine the best number of clusters • Cluster Evaluation: Label the papers with cluster numbers • Analysis: Analyze the clustering results
Attributes of Papers • 8 Attributes / 4 Attributes Pairs • Theoretical vs. Applied • Rigor vs. Relevance • Review vs. Innovation • Technical vs. Behavior • Scoring and Data Processing • Every attribute of a paper is given a score 1~5 • The score of one attribute is considered the reverse score of the other attribute in the pair (i.e. scoreTheoretical = 3 equals to scoreApplied = -3)
Fuzzy k-Means Clustering • Average value of scores in the same pair are used as the coordinates of the paper in MIS-Paper Space, it is 4-dimensional • All coordinates of papers are used as the raw data in the clustering procedure • Because the best number of clusters could not be decided at the beginning, the clustering procedure will run several times with the number of clusters predefined from 3~15
Validation • Goal of clustering • Group the papers with as many similarities as possible • Separate different groups as far away from each other as possible • Choice of validation index • Partition Index: The ratio of the sum of compactness and separation of the clusters • The lower the ratio, the better
Validation (Cont’d) • Best number of clusters: 7 • Reasons • It is the “elbow” point, the increase of performance after 7 is not as prominent as that before 7 • Although 12 has the lowest index value, too many clusters (too few papers every cluster) will affect the generalization of the characteristics of every cluster
Cluster Evaluation • Choose the largest membership value and label the paper with cluster number • Center and number of papers of every cluster
Possible Analysis Results • By analyzing the paper distribution in domain and clusters, we can generate • Authors’ research map • Universities’ research map • Journals’ preference on paper types • By analyzing the above result with a time series, we can generate • Trend and prediction of authors’, universities’ research • Journals’ preference
Benefits • Catch the latest research hotspot in every domain • Follow the changes of the preference of journals • Acquire real-time information about the changes of universities’ and professors’ roles in the MIS community • Discover the unexplored domain in MIS area
Discussion & Future Work • Two difficulties • Need information from perspectives to reasonably explain the results • Attribute score may contain bias, which will affect the performance of the clustering • Future work • Select new attributes to evaluate papers • Examine the effect of score bias and design better approach • Replace manual analysis with automatic process, such as Text Mining and Social Network Analysis
Text Mining SQL 2005 Data Mining
Data Mining Algorithms Association rules Seq. Clustering Neural Network Decision Trees Naïve Bayes Time Series Clustering Classification Regression Segmentaion Assoc. Analysis Anomaly Detect. Seq. Analysis Time series √ - first choice √ - second choice
Naïve Bayesian • Based on Bayesian Theorem with “Naïve” assumption • The fastest algorithm, and gives reasonable accuracy • Best used for • Advanced data exploration (correlation, attribute discrinimation, etc) • Manual feature selection • Parallel correlation counting • Parameters: • MAXIMUM_INPUT_ATTRIBUTES • MAXIMUM_OUTPUT_ATTRIBUTES • MINIMUM_NODE_SCORE • MAXIMUM_STATES
Decision Trees • Best accuracy for classification, regression, association prediction in many cases. • Multiple internal algorithms • Bayesian with K2 prior, Uniform prior • Entropy-based • Bayesian Gaussian for regression trees • Complete/simple-binary splits • Patent-pending technologies • Automatic feature-selection • High cardinality attribute handling • Continuous attribute handling • Parallel correlation counting • Parameters: • COMPLEXITY_PENALTY • MAXIMUM_INPUT_ATTRIBUTES • MAXIMUM_OUTPUT_ATTRIBUTES • MINIMUM_LEAF_CASES • FORCE_REGRESSORS • SCORE_METHOD • SPLIT_METHOD
Clustering • Segmentation, profiling • Multiple internal algorithms • K-means • EM • Automatic feature selection on input attributes, automatic high cardinality attribute handling • Parameters • CLUSTER_COUNT • MAXIMUM_INPUT_ATTRIBUTES • CLUSTER_METHOD • MAXIMUM_STATES • MINIMUM_CLUSTER_CASES • MODELLING_CARDINALITY • STOPPING_TOLERANCE
Neural Network • Classification, segmentation, association prediction, segmentation. • Conjugate gradient method • 0-1 hidden layer • Early stopping criteria • Automatic feature selection • Parameters • MAXIMUM_INPUT_ATTRIBUTES • MAXIMUM_OUTPUT_ATTRIBUTES • MAXIMUM_STATES • HIDDEN_NODE_RATION • HOLDOUT_PERCENTAGE