1 / 15

Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese

Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese. Teruko Mitamura Mengqiu Wang Hideki Shima Frank Lin In CMU EACL 2006. Introduction. JAVELIN is a modular and extensible architecture for building QA systems.

hija
Télécharger la présentation

Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese Teruko Mitamura Mengqiu Wang Hideki Shima Frank Lin In CMU EACL 2006

  2. Introduction • JAVELIN is a modular and extensible architecture for building QA systems. • Since JAVELIN is language independent, this paper extend the original English version for CLQA in Chinese and Japanese.

  3. JAVELIN • The JAVELIN architecture: • Question Analyzer • Producing keywords • Retrieval Strategist • Finding documents • Information Extractor • Extracting answers • Answer Generator • Ranking answers

  4. JAVELIN Extension • Extension for Cross-Lingual QA:

  5. Question Analyzer • Input questions are processed using the RASP parser (Korhonen and Briscoe, 2004). • The module output contains three main components: • selected keywords • the answer type (e.g. numeric-expression, person-name, location) • the answer subtype (e.g. author, river, city) • The QA module is extended with a keyword translation sub-module.

  6. Translation Module • The TM selects the best combination of translated keywords from several sources: • Machine Readable Dictionaries (MRDs) • Machine Translation systems (MTs) • Web-mining-Based Keyword Translators (WBMTs) (Nagata et al., 2001, Li et al., 2003). • For E-to-J translation, two MRDs, eight MTs and one WBMT are used. • If none of them return a translation, the word is transliterated into kana for Japanese. • For E-to-C translation, one MRD, three MTs and one WBMT are used.

  7. Translation Module • After gathering all possible translations for every keyword, the TM uses a noisy channel model to select the best combination of translated keywords: • The TM estimates model statistics using the World Wide Web:

  8. TM: Smoothing • When the target keyword co-occurrence count with n keywords is below a set threshold, a moving window of size n-1 would “move” through the keywords in sequence. • If the co-occurrence count of all of these sets is above the threshold, return the product of the score of these sets as the language model score. • If not, decrease the window and repeat until either all of the split sets are above the threshold or n = 1. • Drawbacks: • "Moving-window smoothing" assumes that keywords that are next to each other are also more semantically related, which may not always be the case. • "Moving-window smoothing" tends to give the keywords near the middle of the question more weight, which may not be desirable.

  9. TM: Pruning • Early Pruning: • Possible translations of the individual keywords are pruned before being combined. • A very simple pruning heuristic via a word frequency list is used. Very rare translations produced by a resource are not considered. • Late Pruning: • Possible translation candidates of the entire set of keywords are pruned after calculating translation probabilities. • Since the calculation of the translation probabilities requires little access to the web, we can calculate only the language model score for the top N candidates with the highest translation score and prune the rest.

  10. Retrieval Strategist • Lemur 3.0 toolkit (Ogilvie and Callan, 2001) is used. • Lemur supports structured queries using operators such as Boolean AND, Synonym, Ordered/Un-Ordered Window and NOT. • For example: • The RS uses an incremental relaxation technique: • It starts from an initial query that is highly constrained. • The algorithm searches for all the keywords and data types in close proximity to each other. • The priority is based on a function of the likely answer type, keyword type (word, proper name, or phrase) and the inverse document frequency of each keyword. • The query is gradually relaxed until the desired number of relevant documents is retrieved.

  11. Information Extractor • The Light IX module: • The algorithm considers only those terms that are tagged as named entities which match the desired answer type. • E-C uses character-based tokenization and the linear Dist measure. • E-J uses word-based tokenization and logarithmic Dist measure.

  12. Answer Generator • The task of the AG module is to produce a ranked list of answer candidates from the IX output. • The AG is designed to resolve representational differences and combine answer candidates that differ only in surface form. • Even though the AG module plays an important role in JAVELIN, its full potential is not used in the E-C and E-J systems, since some language-specific resources required for multilingual answer merging are lacked.

  13. Experiment Settings • Three different runs were carried out for both the E-C and E-J systems. • The same 200-question test set and the document corpora provided by the NTCIR CLQA task is used. • The first run: • A fully automatic run using the original translation module in the CLQA system. • The second run: • The keywords selected by the Question Analyzer module are manually translated. • The third run: • The corresponding term for each English keyword in the translations for the English questions provided by NTCIR organizers is looked up.

  14. Experiment Results

  15. Discussions • Chinese: • Word sense disambiguation for keywords • e.g. “埋” and “葬” • Regional language difference • e.g. “软件” and “軟體” • Japanese: • Representational gaps • e.g. “ヴェルナー ・ シュピース” and “ヴェルナー シュピース”. • This problem can be resolved by Lemur. • Transliteration in WBMT • Japanese nouns written in romaji (e.g. Funabashi) are transliterated them into hiragana for a better result in WBMT. • Since there may be higher positive co-occurrence between kana and kanji (i.e. “ふなばし” and “船橋”) than between romaji and kanji (i.e. “funabashi” and “船橋”). • Document Retrieval in Kana • Transliteration from kana into kanji is more ambiguous than transliteration from romanji into kana. • Therefore, indexing kana readings in the corpus and querying in kana is sometimes a useful technique for CLQA.

More Related