Modeling Variable Dependencies between Characters in Chinese Information Retrieval

Modeling Variable Dependencies between Characters in Chinese Information Retrieval Lixin Shi, Jian-Yun Nie DIRO, University of Montreal

Outline Modeling Variable Dependencies between Characters in Chinese IR 1. Motivation 2. Related Work 3. Variable Dependency Model 4. Parameter Estimation 5. Experiment and Discussion 6. Conclusion and Future Work

Motivation Modeling Variable Dependencies between Characters in Chinese IR 1. Motivation • Two approaches to index Chinese texts: • Character n-grams (unigram and bigram) • Segmented words • Traditional approaches assume independence among terms • Bag-of-words models • Current approaches typically combine different models with fixed weights. • e.g.

Modeling Variable Dependencies between Characters in Chinese IR 1. Motivation • In reality, terms are often dependent, and term dependencies do not have equal importance in IR. • Strong: “hot dog”, “black Monday”. • These dependencies should play an important role in IR • Weak: “computer game”, “text printing” • These dependencies should be considered weakly • Dependencies in Chinese IR even more important to consider • Characters can be strongly dependent. • 京(capital, Beijing),九(nine)京九(北京九龙,Beijing-Kowloon) • Weak dependency: 房,屋房屋(house) • We try to capture various dependencies in our model • use SVM to determine their weights.

2. Related Work

Combining Different Indexes • Previous studies often combine characters, bigrams and words • In LM approach, a general way is as following: • where VR is vocabulary of type R (can be U, B,and W); λRis a fixed weight. Modeling Variable Dependencies between Characters in Chinese IR 2. Related Work 6

Related Work in English Modeling Variable Dependencies between Characters in Chinese IR 2. Related Work • Combining unigram model with Bigram and biterm models • Markov Random Fields: An undirected graphic model that captures the dependencies of terms within the same clique (fully connected nodes) • MRF-FD (Full Model): assumes that all terms are connected each other. Leads to the problem of complexity for large cliques. • MRF-SD (Sequential Model): considers that only adjacent terms are connected. • use fixed weights for combination: • λT (Unigram),λO (ordered bigram),λU (Unordered bigram) • Weighted SD Model (WSD): a recently extended MRF, allows different weights of λO and λUdepending on individual term pairs • Limitations • Consider adjacent terms only • Ordered and un-order term-pair uses same weight (i.e. λO = λU )

3. A New Variable Dependency Model

Discriminative Models Modeling Variable Dependencies between Characters in Chinese IR 3. Variable Dependency Model The model is defined within the framework of discriminative models Allow us to selectively consider dependencies between more distance characters, without having to increase the complexity to account for less useful dependencies. The discriminative function can be a posterior probability or simply a confidence score A typical discriminative model as:

Our Model • We integrate 3 types of features: • Unigram: • Ordered bigrams: • Unordered co-occurrence dependency within distance w: • λBand λCw are the importance for a particular dependency between a term pair. (λC is fixed to 1) Modeling Variable Dependencies between Characters in Chinese IR 3. Variable Dependency Model

The discriminative function is defined by Cross-entropy of query language model and document language model. • We simply use maximum likelihood (ML) estimation for query model, and use Dirichlet smoothing for document language model (R is U, B or Cw). Modeling Variable Dependencies between Characters in Chinese IR 3. Variable Dependency Model

4. Parameter Estimation

Estimate: Dirichlet Prior (μ) • In the document language model estimation, if we use different window size W={2,4,8} • We have the following priors: μU, μB, μC2, μC4,μC8. • Intuitively and (confirmed by our preliminary experiments), a longer document expression (e.g.C8) leads to a higher sparsity. This will require a larger μ. • The μ’s are set roughly proportional to the document length in expression of U, B, C2, C4, C8: • μU=1000,μB=1000,μC2=1000,μC4=3000,μC8=7000. Modeling Variable Dependencies between Characters in Chinese IR 4. Parameters Estimation

Estimate: Dependency Strength(λs) Modeling Variable Dependencies between Characters in Chinese IR 4. Parameters Estimation Learning process: For each bigram and co-occurrence (xi) in training queries, use a coordinate ascent search algorithm to find its best weight: xi  λ*(xi) Extract a group of features xi for xi, then we get the train data: {(xi, λ*(xi)} Train SVM models for B, C2, C4, C8 respectively. For a new bigram or co-occurrence y in the query, we create the list of features and determine the weight using SVM

We use epsilon Support Vector Machine Regression (ε-SVR) for SVM training/learning. • We use the following features: • Point-wise mutual information in an independent text collection: PMI_all(x) • PMI in the current test collection: PMI_coll(x) • A binary value according to the test: PMI_all(x)>Threshold? • Binary test value: PMI_coll(x)> Threshold? • idf(x) - ifd(qi) - idf(qj) • (idf(x) - idf(qi) - idf(qj)) / (idf(qi) + idf(qj)) • Does x appears in a Wikipedia Chinese title? • The distance between qi and qj • … • In our experiments, we use 10-fold cross validation: 1/10 of the data is used in tune as test data while remain 9/10 as training data. Modeling Variable Dependencies between Characters in Chinese IR 4. Parameters Estimation

5. Experiment and Discussion

Experimental setting Modeling Variable Dependencies between Characters in Chinese IR 5. Experiment and Discussion We convert all the characters into GB Simplified. Chinese texts are segmented to words by ICTCLAS and LDC segmentation program. Use Indri to build the indexes of U, W, B, WU, BU, W+U, B+U.

The baselines (MAP) of traditional Chinese IR models U: unigram; B: bigram, W: words BU: mixed bigram and unigram in a single index B+U: the scores using B and U are interpolated WU: mixed word and unigram in a single index W+U: the scores using W and U are interpolated Modeling Variable Dependencies between Characters in Chinese IR 5. Experiment and Discussion

The baselines of dependency models: MRF-SD and WSD †:T-test<.05 ‡:T-test<.01 Modeling Variable Dependencies between Characters in Chinese IR 5. Experiment and Discussion

The result of our VDM • Our model (VDM) outperforms all the baseline methods except in two cases. Many of the improvements are statistically significant. • Ideal parameters largely outperform the existing models.

Examples for Ideal Various Weights .10 .01 .03 .35 .01 impact 1986 immigr. law .01 .05 .35 .6 .07 .5 .9 .7 .5 bico2 co4 co8 平东会议 .9 .90 中和 .4 .35 .01 .9 .7 Modeling Variable Dependencies between Characters in Chinese IR 5. Experiment and Discussion

6. Conclusion and Future Work

Conclusion Modeling Variable Dependencies between Characters in Chinese IR 6. Conclusion and Future Work We propose a model to take into account the relationships between different types of index. In our model, a pair of characters is used in the retrieval model according to its strength and usefulness for IR. The assignment of variable weights to pairs of characters has not been investigated in previous studies. Our experiments showed that the integration of term dependencies with variable weights can lead to higher effectiveness. The model we propose in this paper points to an interesting direction for future research – the integration of dependencies according to their usefulness in IR.

Future Work Modeling Variable Dependencies between Characters in Chinese IR 6. Conclusion and Future Work We have not exploited all the potential of the model. Server aspects could be further improved: • It would be possible to extend dependencies of pairs of characters to more characters. • Using a larger amount of training data (such as user click-throughs) to correctly learn the weights.

Questions? Thanks

Modeling Variable Dependencies between Characters in Chinese Information Retrieval

Modeling Variable Dependencies between Characters in Chinese Information Retrieval

Presentation Transcript

Amazing Chinese Characters

Correlations Between Characters

Chinese Pinyin / Characters Introduction

Chinese Characters 汉字

Chinese Characters and Scripts

Semantic modeling of Chinese characters Kent Lee

CHINESE CHARACTERS

Probabilistic Models in Information Retrieval SI650: Information Retrieval

Components of Chinese Characters

CHINESE CHARACTERS

CHINESE CHARACTERS

Language Modeling Frameworks for Information Retrieval

CHINESE CHARACTERS

A Chinese Information Retrieval System Using SDD

Challenges in Information Retrieval and Language Modeling

Modeling Diversity in Information Retrieval

Information Retrieval Modeling

Chinese Characters and Scripts