1 / 23

An Automatic Construction of Arabic Similarity Thesaurus

An Automatic Construction of Arabic Similarity Thesaurus. Abdulaziz Al-Qabbany AbdulMalik Al-Salman Abdulrahman Almuhareb CITALA 2009. Outline. Introduction Thesauruses Similarity Thesaurus Proposed Improvement The Experiment Evaluation Discussion Conclusions and Future Work.

hpowers
Télécharger la présentation

An Automatic Construction of Arabic Similarity Thesaurus

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Automatic Construction of Arabic Similarity Thesaurus Abdulaziz Al-Qabbany AbdulMalik Al-Salman Abdulrahman Almuhareb CITALA 2009

  2. Outline • Introduction • Thesauruses • Similarity Thesaurus • Proposed Improvement • The Experiment • Evaluation • Discussion • Conclusions and Future Work

  3. Introduction • Thesaurus importance • Effective Information Retrieval systems • Vocabulary mismatch problem • Query Expansion

  4. Thesauruses • Arabic thesauruses • Manual construction drawbacks: • cost • time • subjectivity • Automatic construction approaches

  5. Similarity Thesaurus • Qiu and Frei (1993) presented their query expansion model using similarity thesaurus. • Zazo et al. (2005) used the same approach for constructing a Spanish similarity thesaurus. • Expanding queries based on similarity to their concepts rather than similarity to the individual terms.

  6. قرص الشمس قـرص قرص الدواء قرص ضوئي Similarity Thesaurus (cont.) • Using similarity thesaurus is analogous to the translation from a language to another. • Example

  7. Similarity Thesaurus construction • The similarity thesaurus is a matrix that represents terms similarities. • Each term is represented by a vector that determines its relation with each document. • The matrix is generated through calculating similarities between terms vectors.

  8. Query Expansion using Similarity Thesaurus • Similarity between the query q and any term t is computed as the sum of the similarities values between each query term and t. SIM_QT(q, t) = • As a response to any query, the terms can be ranked in descending order according to their SIM_QT values.

  9. “Sum” method • “SUM” method is appropriate when the similarity values between the query terms and the indexed term are consistent within the same range. • When similarity values are inconsistent, the differences between the values will not be reflected on the total sum. • Similarity values are considered to be inconsistent when they contain outliers.

  10. Outliers • Outlier is a value that is considerably dissimilar or inconsistent with the majority of the data. Y outlier X

  11. Proposed Improvement • A given term should have a high similarity value with each individual term in the query in order to be considered related. • The dispersion between the similarity values is one of the factors that needed to be considered in query expansion. • The total similarity value should remain as the main factor in query expansion.

  12. Proposed Improvement (cont.) • Instead of using the sum of the similarity values, we use the mean of the values subtracted by the standard error of the mean (SE). SIM_QT(q, t) = • The standard error of the mean is a measure of data dispersion. SE = where, α is the standard deviation and n is the number of values.

  13. The Experiment • we used the France Press Agency Arabic news of years 2004, 2005 and 2006 as the document collection. • This document collection can be found in LDC Arabic Gigaword corpus (Third Edition). • After examining the high frequency terms in the collection, we had chosen 150 stop words.

  14. Document collection characteristics

  15. Evaluation • The objective of the evaluation was to assess the relevance strength of the produced terms. • The evaluation process was applied for both the “SUM” and “MEAN” methods. • We have selected twenty common topics that belong to five different domains.

  16. Evaluation • For each topic, the top ten related terms were presented to five expert evaluators. • Each evaluator was asked to study these twenty topics carefully and then specify if the produced terms are relevant or not. • Levels of relevance: • Relevant • Somewhat Relevant • Irrelevant

  17. Evaluation Results • The relevance strength of the standard “SUM” method was 95.0%, while the Relevance strength of the “MEAN” method was 98.1%.

  18. Discussion • We believe that the main reason that makes the “MEAN” method a better method is its ability to detect and exclude outliers. • Adding a single term to the query may completely change the concept of the query. • The candidate related term should have consistent similarities with all of the query terms.

  19. Example • The response to a query about the former French president “جاك شيراك”:

  20. Example (cont.)

  21. Conclusions • The relevance strength of the standard “SUM” method was 95.0%, while the Relevance strength of the “MEAN” method was 98.1% • “MEAN” method shows an improvement of about 3.3% over “SUM” method. • We conclude that the “MEAN” method is more accurate mainly because it can detect and exclude the outliers.

  22. Future Work • Applying word stemming. • Producing collocations. • Constructing a single word-category thesaurus. • Using similarity thesaurus in question answering.

  23. End

More Related