1 / 25

Modeling Variable Dependencies between Characters in Chinese Information Retrieval

Modeling Variable Dependencies between Characters in Chinese Information Retrieval. Lixin Shi, Jian-Yun Nie DIRO, University of Montreal. Outline. 1. Motivation 2. Related Work 3. Variable Dependency Model 4. Parameter Estimation 5. Experiment and Discussion

dieter
Télécharger la présentation

Modeling Variable Dependencies between Characters in Chinese Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modeling Variable Dependencies between Characters in Chinese Information Retrieval Lixin Shi, Jian-Yun Nie DIRO, University of Montreal

  2. Outline Modeling Variable Dependencies between Characters in Chinese IR 1. Motivation 2. Related Work 3. Variable Dependency Model 4. Parameter Estimation 5. Experiment and Discussion 6. Conclusion and Future Work

  3. Motivation Modeling Variable Dependencies between Characters in Chinese IR 1. Motivation • Two approaches to index Chinese texts: • Character n-grams (unigram and bigram) • Segmented words • Traditional approaches assume independence among terms • Bag-of-words models • Current approaches typically combine different models with fixed weights. • e.g.

  4. Modeling Variable Dependencies between Characters in Chinese IR 1. Motivation • In reality, terms are often dependent, and term dependencies do not have equal importance in IR. • Strong: “hot dog”, “black Monday”. • These dependencies should play an important role in IR • Weak: “computer game”, “text printing” • These dependencies should be considered weakly • Dependencies in Chinese IR even more important to consider • Characters can be strongly dependent. • 京(capital, Beijing),九(nine)京九(北京九龙,Beijing-Kowloon) • Weak dependency: 房,屋房屋(house) • We try to capture various dependencies in our model • use SVM to determine their weights.

  5. 2. Related Work

  6. Combining Different Indexes • Previous studies often combine characters, bigrams and words • In LM approach, a general way is as following: • where VR is vocabulary of type R (can be U, B,and W); λRis a fixed weight. Modeling Variable Dependencies between Characters in Chinese IR 2. Related Work 6

  7. Related Work in English Modeling Variable Dependencies between Characters in Chinese IR 2. Related Work • Combining unigram model with Bigram and biterm models • Markov Random Fields: An undirected graphic model that captures the dependencies of terms within the same clique (fully connected nodes) • MRF-FD (Full Model): assumes that all terms are connected each other. Leads to the problem of complexity for large cliques. • MRF-SD (Sequential Model): considers that only adjacent terms are connected. • use fixed weights for combination: • λT (Unigram),λO (ordered bigram),λU (Unordered bigram) • Weighted SD Model (WSD): a recently extended MRF, allows different weights of λO and λUdepending on individual term pairs • Limitations • Consider adjacent terms only • Ordered and un-order term-pair uses same weight (i.e. λO = λU )

  8. 3. A New Variable Dependency Model

  9. Discriminative Models Modeling Variable Dependencies between Characters in Chinese IR 3. Variable Dependency Model The model is defined within the framework of discriminative models Allow us to selectively consider dependencies between more distance characters, without having to increase the complexity to account for less useful dependencies. The discriminative function can be a posterior probability or simply a confidence score A typical discriminative model as:

  10. Our Model • We integrate 3 types of features: • Unigram: • Ordered bigrams: • Unordered co-occurrence dependency within distance w: • λBand λCw are the importance for a particular dependency between a term pair. (λC is fixed to 1) Modeling Variable Dependencies between Characters in Chinese IR 3. Variable Dependency Model

  11. The discriminative function is defined by Cross-entropy of query language model and document language model. • We simply use maximum likelihood (ML) estimation for query model, and use Dirichlet smoothing for document language model (R is U, B or Cw). Modeling Variable Dependencies between Characters in Chinese IR 3. Variable Dependency Model

  12. 4. Parameter Estimation

  13. Estimate: Dirichlet Prior (μ) • In the document language model estimation, if we use different window size W={2,4,8} • We have the following priors: μU, μB, μC2, μC4,μC8. • Intuitively and (confirmed by our preliminary experiments), a longer document expression (e.g.C8) leads to a higher sparsity. This will require a larger μ. • The μ’s are set roughly proportional to the document length in expression of U, B, C2, C4, C8: • μU=1000,μB=1000,μC2=1000,μC4=3000,μC8=7000. Modeling Variable Dependencies between Characters in Chinese IR 4. Parameters Estimation

  14. Estimate: Dependency Strength(λs) Modeling Variable Dependencies between Characters in Chinese IR 4. Parameters Estimation Learning process: For each bigram and co-occurrence (xi) in training queries, use a coordinate ascent search algorithm to find its best weight: xi  λ*(xi) Extract a group of features xi for xi, then we get the train data: {(xi, λ*(xi)} Train SVM models for B, C2, C4, C8 respectively. For a new bigram or co-occurrence y in the query, we create the list of features and determine the weight using SVM

  15. We use epsilon Support Vector Machine Regression (ε-SVR) for SVM training/learning. • We use the following features: • Point-wise mutual information in an independent text collection: PMI_all(x) • PMI in the current test collection: PMI_coll(x) • A binary value according to the test: PMI_all(x)>Threshold? • Binary test value: PMI_coll(x)> Threshold? • idf(x) - ifd(qi) - idf(qj) • (idf(x) - idf(qi) - idf(qj)) / (idf(qi) + idf(qj)) • Does x appears in a Wikipedia Chinese title? • The distance between qi and qj • … • In our experiments, we use 10-fold cross validation: 1/10 of the data is used in tune as test data while remain 9/10 as training data. Modeling Variable Dependencies between Characters in Chinese IR 4. Parameters Estimation

  16. 5. Experiment and Discussion

  17. Experimental setting Modeling Variable Dependencies between Characters in Chinese IR 5. Experiment and Discussion We convert all the characters into GB Simplified. Chinese texts are segmented to words by ICTCLAS and LDC segmentation program. Use Indri to build the indexes of U, W, B, WU, BU, W+U, B+U.

  18. The baselines (MAP) of traditional Chinese IR models U: unigram; B: bigram, W: words BU: mixed bigram and unigram in a single index B+U: the scores using B and U are interpolated WU: mixed word and unigram in a single index W+U: the scores using W and U are interpolated Modeling Variable Dependencies between Characters in Chinese IR 5. Experiment and Discussion

  19. The baselines of dependency models: MRF-SD and WSD †:T-test<.05 ‡:T-test<.01 Modeling Variable Dependencies between Characters in Chinese IR 5. Experiment and Discussion

  20. The result of our VDM • Our model (VDM) outperforms all the baseline methods except in two cases. Many of the improvements are statistically significant. • Ideal parameters largely outperform the existing models.

  21. Examples for Ideal Various Weights .10 .01 .03 .35 .01 impact 1986 immigr. law .01 .05 .35 .6 .07 .5 .9 .7 .5 bico2 co4 co8 平 东 会 议 .9 .90 中 和 .4 .35 .01 .9 .7 Modeling Variable Dependencies between Characters in Chinese IR 5. Experiment and Discussion

  22. 6. Conclusion and Future Work

  23. Conclusion Modeling Variable Dependencies between Characters in Chinese IR 6. Conclusion and Future Work We propose a model to take into account the relationships between different types of index. In our model, a pair of characters is used in the retrieval model according to its strength and usefulness for IR. The assignment of variable weights to pairs of characters has not been investigated in previous studies. Our experiments showed that the integration of term dependencies with variable weights can lead to higher effectiveness. The model we propose in this paper points to an interesting direction for future research – the integration of dependencies according to their usefulness in IR.

  24. Future Work Modeling Variable Dependencies between Characters in Chinese IR 6. Conclusion and Future Work We have not exploited all the potential of the model. Server aspects could be further improved: • It would be possible to extend dependencies of pairs of characters to more characters. • Using a larger amount of training data (such as user click-throughs) to correctly learn the weights.

  25. Questions? Thanks

More Related