200 likes | 294 Vues
Explore the importance and methods of classifying syllabi for educational purposes, using SVM and NB models for evaluation. Future work includes enhancing features and expanding the syllabus library to create an educational online community.
E N D
Automatic Syllabus ClassificationJCDL – Vancouver – 22 June 2007 Edward A. Fox (presenting co-author), Xiaoyan Yu, Manas Tungare, Weiguo Fan, Manuel Perez-Quinones, William Cameron, GuoFang Teng, and Lillian (“Boots”) Cassel
Why Study the Syllabus Genre? • Educational resource • Importance to the educational community • Educators • Students • Self-learners • Thanks to NSF DUE grant 5328255 (personalization support for NSDL)
Where to look for a specific syllabus? • Non-standard publishing mechanisms: • Instructor’s website • CMSs (courseware management systems, e.g., Sakai) • Catalogs • Limited access outside the university • Search on the Web • Many non-relevant links in search results
Syllabus Library • Bootstrapping • Identify true syllabi from search results • Store in a repository • Develop tools & applications • Scaling up • Encourage contributions from educational communities
An Essential Step towards Syllabus Library: Classification • Classification Objects: • Potential syllabi in Computer Science: search on the Web, using syllabus keywords, only in the educational domains • Class Definition • Feature Selection • Model Selection • Training and Testing
Four Classes Noise
course code title class time& location offering institution teaching staff course description objectives web site prerequisite textbook grading policy schedule assignment exam and resources Syllabus Components
Features • 84 Genre-specific Features • the occurrences of keywords • the positions of keywords, and • the co-occurrences of keywords and links • A series of keywords for each syllabus component
Classification Models • Discriminative Models • Support Vector Machines (SVM) • SMO-L: Sequential Minimal Optimization, accelerating the training process of SVM • SMO-P: SMO with a polynomial kernel • Generative Models • Naïve Bayes (NB) • NB-K: Applying kernel methods to estimate the distribution of numeric attributes in NB modeling
Evaluation • Training corpus: 1020 out of the 8000+ potential syllabi • All in HTML, PDF, PostScript, or Text • Manual tagging on the training corpus • Unanimous agreement by three co-authors • Evaluation strategy: ten-fold cross validation • Metrics: F1 (an overall measure of classification performance)
Results w. random set Best items are in purple boxes. Acctr: Classification accuracy on the training set.
Results (Cont’d) • SVM outperforms NB regarding our syllabus classification on average. • All classifiers fail in identifying the partial syllabus class. • The kernel settings for NB are not helpful in the syllabus classification task. • Classification accuracy on training data is not that good.
Future Work • Feature selection • Add general feature selection methods on text classification • e.g., Document Frequency, Information Gain, and Mutual Information • Hybrid: combine our genre-specific features with the general features
Future Work (Cont’d) • Syllabus Library • Welcome to http://doc.cs.vt.edu • Share your favorite course resources – not limited to the syllabus genre. • Information Extraction • Semantic search • Personalization
Summary • Towards a syllabus library • Starting from search results on the web • Classification of the search results for true syllabi • SVM is a better choice for our syllabus classification task. • Towards an educational on-line community around the syllabus library