1 / 10

EFFORTS TO AUTOMATE LABELING OF LECTURES WITH COMPUTING ONTOLOGY TERMS

EFFORTS TO AUTOMATE LABELING OF LECTURES WITH COMPUTING ONTOLOGY TERMS. Felicia Decker and Lois Delcambre Portland State University. PREVIOUS WORK. Course Intro to Databases We found 6 courses – on the web – with all lectures Lecture notes ppt/pdf/html

Télécharger la présentation

EFFORTS TO AUTOMATE LABELING OF LECTURES WITH COMPUTING ONTOLOGY TERMS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EFFORTS TO AUTOMATE LABELING OF LECTURES WITH COMPUTING ONTOLOGY TERMS Felicia Decker and Lois Delcambre Portland State University

  2. PREVIOUS WORK • Course • Intro to Databases • We found 6 courses – on the web – with all lectures • Lecture notes • ppt/pdf/html • Hand-labeled each lecture topic with Computing Ontology (CO) terms • used this to validate the CO • leaf CO terms correspond to lecture topics

  3. CURRENT WORK • Will the words that appear in these lecture notes help us choose CO terms?Are there “signature” words for each topic? • Tools • Lucene • Converter tools (ppt/pdf/html -> text) • Microsoft Excel

  4. LUCENE • Index lecture notes • text from one lecture = one document • documents/lectures from one course = one collection (with an index) • Provides us with • Term frequency (tf) • Inverse document frequency (idf) • Tf-idf • Currently using single words, just now introducing stemming

  5. CONVERTER TOOLS • Lecture notes come in different formats • PPT -> text • Apache POI • PDF -> text • TextMiningTool 1.1.42 • Xpdf-3.02 • HTML -> text • Copy/paste • Internet Explorer – save webpage as text

  6. EXCEL • After using Lucene to get tf, idf and tf-idf data for each term in the given index… • Select a CO term: e.g., Normalization • Using CO-labeled lecture notes (previous work), choose the lectures labeled with Normalization • Compile tf/idf/tf-idf data into one spreadsheet

  7. HAND-LABEL WORDS FROM LECTURES AS “IMPORTANT” • Signature words were human-selected from Database Management Systems by Ramakrishnan and Gehrke, 3rd Ed. • Use Find All/Replace All function in Excel to highlight all signature words that identify Normalization

  8. INITIAL EFFORT

  9. INITIAL EFFORT: RESULTS • Conclusions • Tf-idf is not a strong indicator • Cannot solely rely on tf-idf • ‘Running example’ • While good for teaching • We don’t care about this data • Stemming is important • Use of phrases may help

  10. NEXT STEPS • Intersection of terms across all classes • May solve ‘running example’ problem • Compute average rank • Compute average tf-idf (?) • Union all documents with the same CO label(union text from all the lectures on normalization, union text from all lectures on query optimization, etc.) • Look at tf-idf • Consider various classification algorithms (looking to see if there are some implemented for Lucene)

More Related