1 / 18

Enhancing Topic Modeling with Semantic and Network Structure: A Study on Olympic Articles

This project explores the integration of semantic and network structures in topic modeling, particularly focused on Olympic-themed articles from the Vancouver dataset. By employing a Probabilistic Latent Semantic Analysis (PLSA) framework and a network-based approach, we aim to improve the accuracy and efficiency of topic modeling. Our findings reveal potential issues such as higher time complexity and the challenge of convergence, alongside promising results that suggest future enhancements using network structures can yield better classification outcomes. Future work will delve into modeling blog articles while refining extraction and algorithm efficiency.

mikkel
Télécharger la présentation

Enhancing Topic Modeling with Semantic and Network Structure: A Study on Olympic Articles

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Topic Modeling using Semantic and Network structure Sophia(Xueyao) Liang CPSC 503 Final Project

  2. Topic modeling Olympic, vancouver Snow, cold K=3 Moon light, spider man P( |d) Unsupervised P( |d) P( |d)

  3. plsa

  4. plsa zk∈{z1,z2,…,zN}

  5. Plsa - Parameter inference Expectation: Maximization:

  6. PHITS

  7. Semantic + network

  8. NetPLSA

  9. NetPLSA

  10. NetPLSA

  11. NetPLSA Parameter Inference: No closed form solution for expectation step • Efficient Algorithm: • Expectation (PLSA) • Maximization(PLSA) • The result of the previous steps may not ends in better value for O

  12. NetPLSA • Potential Problems of the model • Parameter Inference • Higher time complexity and slower to converge -10000 100

  13. CORPUS • Cora Data version 1.0 • Cited paper not in the corpus • No abstract for some post-script files • Too many categories • Duplicated or isolated papers 30000 scientific papers, with citation information Important files: papers (ID-name, link, author…..) citations (ID-cited ID) classifications (link-category) directory: extractions (post-script form of the papers)

  14. CORPUS • Cora Data version 1.0 • Papers in category Machine Learning • About 2700 papers • 1400 Frequent Words (stop words removed, stemmed)

  15. Results

  16. Results Overall Accuracy (A) Accuracy (B) Recall Accuray and Recall for each category

  17. EvALUATION • Justified the claim that adding network structure into the model could improve the result of topic modeling • Modeled the network on a scale of articles • Inherent problem exists in the picked framework • The result is still far from satisfactory

  18. Future work • How to model the network structure of blog articles, especially considering model them on a scale of articles • Bag-of-words matrix extraction • Better integral model, maybe LDA based • Efficiency of the algorithm • Recommendation based on topic communtiy discovery

More Related