1 / 29

Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information

Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information. Authors: Yutaka Matsuo & Mitsuru Ishizuka. Designed by CProDM Team. Outline. Introduction. Study Algorithm. Algorithm implement. Evaluation. Introduction. Discard stop words. Stem.

sharne
Télécharger la présentation

Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information Authors: Yutaka Matsuo & Mitsuru Ishizuka Designed by CProDM Team

  2. Outline Introduction Study Algorithm Algorithm implement Evaluation

  3. Introduction Discard stop words Stem Extract frequency Preprocessing Expected probability Clustering Select frequent term Processing Calculate X’2 value Output

  4. StudyAlgorithm • Preprocessing Goal: - Remove unnecessary words in document. - Get terms which are candidate keywords. Stop word:the function words and, the, and of , or other words with minimal lexical meaning. Stem: remove suffixes from words

  5. StudyAlgorithm • Preprocessing It might be urged that when playing the “imitation game" the best strategy for the machine may possibly be something other than imitation of the behaviour of a man. This may be, but I think it is unlikely that there is any great effect of this kind. In any case there is no intention to investigate here the theory of the 2 game, and it will be assumed that the best strategy is to try to provide answers that would naturally be given by a man. Discard stop words urged playing “imitation game" best strategy machine possibly imitation behaviour man think unlikely great effect kind. case intention investigate theory 2 game, assumed best strategy try provide answers naturally given man.

  6. StudyAlgorithm • Preprocessing urgedplaying “imitation game" best strategy machine possibly imitation behaviour man think unlikely great effect kind. case intention investigate theory game, assumed best strategy try provide answers naturallygiven man. Stem urge play “imitation game" best strategi machine possible imitation behaviour man think unlike great effect kind. case intention investigate theory game, assum best strategi try provide answers natural give man.

  7. StudyAlgorithm • Preprocessing urge play “imitation game" best strategi machine possible imitation behaviour man think unlike great effect kind. case intention investigate theory game, assum best strategi try provide answers natural give man. Extract frequency imitation best strategi man best strategi

  8. StudyAlgorithm • Term Co-occurrence and Importance the top ten frequent terms (denoted as ) and the probability of occurrence, normalized so that the sum is to be 1

  9. StudyAlgorithm • Term Co-occurrence and Importance Two terms in a sentence are considered to co-occur once.

  10. StudyAlgorithm • Term Co-occurrence and Importance co-occurrence probability distribution of some terms and the frequent terms.

  11. StudyAlgorithm • Term Co-occurrence and Importance The statistical value of χ2 is defined as Pg Unconditional probability of a frequent term g ∈ G (the expected probability) Nw The total number of co-occurrence of term w and frequent terms G freq (w, g)Frequency of co-occurrence of term w and term g

  12. StudyAlgorithm • Term Co-occurrence and Importance

  13. StudyAlgorithm • Algorithm improvement If a term appears in a long sentence, it is likely to co-occur with many terms; if a term appears in a short sentence, it is less likely to co-occur with other terms. We consider the length of each sentence and revise our definitions Pg (the sum of the total number of terms in sentences where g appears) divided by (the total number of terms in the document) Nw The total number of terms in the sentences where w appears including w

  14. StudyAlgorithm • Algorithm improvement the following function to measure robustness of bias values

  15. StudyAlgorithm • Algorithm improvement • To improve extracted keyword quality, we will cluster terms • Two major approaches (Hofmann & Puzicha 1998) are: • Similarity-based clustering • If terms w1 and w2 have similar distribution of co-occurrence with other terms, w1 and w2 are considered to be the same cluster. • Pairwise clustering • If terms w1 and w2 co-occur frequently, w1 and w2 are considered to be the same cluster.

  16. StudyAlgorithm • Algorithm improvement Similarity-based clustering centers upon Red Circles Pairwise clustering focuses on Yellow Circles

  17. StudyAlgorithm • Algorithm improvement Similarity-based clustering Cluster a pair of terms whose Jensen-Shannon divergence is Where: and:

  18. StudyAlgorithm • Algorithm improvement Pairwise clustering Cluster a pair of terms whose mutual information is Where:

  19. StudyAlgorithm • Algorithm improvement

  20. Algorithm Implement Step 1- Preprocessing Step 2: Selection of frequent terms Step 3: Clustering frequent terms Step 4: Calculation of expected probability Step 5: Calculation of x’2value Step 6: Output keywords

  21. Algorithm Implement • Step 1: Preprocessing Discard stop words Stem Extract frequency

  22. Algorithm Implement • Step 2: Selection of frequent terms Select the top frequent terms up to 30% of the number of running termsas a standard set of terms Count number of terms in document (Ntotal )

  23. Algorithm Implement • Step 3: Clustering frequent terms • Similarity-base clustering • Pairwise clustering

  24. Algorithm Implement • Step 4: Calculate expected probability Count the number of terms co-occurring with c ∈ C, denoted as nc, to yield the expected probability

  25. Algorithm Implement • Step 5: Calculate χ’2 value Where: the number of co-occurrence frequency with c ∈ C the total number of terms in the sentences including w

  26. Algorithm Implement • Step 6: Output keywords

  27. Evaluation

  28. Evaluation In this paper, we developed an algorithm to extract keywords from a single document. Main advantages of our method are its simplicity without requiring use of a corpus and its high performance comparable to tfidfalgorithm. As more electronic documents become available, we believe our method will be useful in many applications, especially for domain-independent keyword extraction.

  29. Thank for your attention

More Related