170 likes | 435 Vues
Kernel Canonical Correlation Analysis. Cross-language information retrieval. Blaz Fortuna JSI, Slovenija. Input. Two different views of the same data:. Text documents written in different languages Images with attached text …. Goal.
E N D
Kernel Canonical Correlation Analysis Cross-language information retrieval Blaz Fortuna JSI, Slovenija
Input Two different views of the same data: • Text documents written in different languages • Images with attached text • …
Goal Find pairs of features from both views with highest correlations Example: words that co-appear in document and its translation Auto, Fahrzeug, … car, vehicle, … Fleisch, Hahnchen, Rindfleisch, Schweinerne, … meat, chicken, beef, pork, …
Theory behind CCA • Documents are presented with pairs of vectors – one for each view • Result of CCA are basis vectors for each view such that the correlation between the projections of the variables onto these basis vectors are mutually maximized
Kernelisation of CCA • Method can be rewritten so feature vectors only appear inside inner-product • We can use Kernel for calculating inner-product • Input documents don not need to be vectors (eg. text documents together with string kernel)
Cross-Language Text Mining • KCCA constructs language independent representation for text documents • Good part: documents from different languages can be compared using this representation • Bad part: paired dataset is needed for training (can be avoided using machine translation tools)
KCCA and LSI • LSI discovers statistically most significant co-occurrences of terms in documents • When word appears in a document, what other words usually also appear? • KCCA matches terms from the first language with terms from the second based on co-occurrences • When word appears in a document, does it also appear in its translation?
Text document retrieval • Query databases with multilingual documents • Documents from database and query are transformed into language independent representation • Nearest neighbour
Experiments • 36th Canadian Parliament proceedings corpus • Part of documents used for training • For testing 5 most relevant keywords were extracted from a document and used as queries • English query, French documents retrieval accuracy (top-ranked/top-ten-ranked) [%]
Text categorization • Categorize multilingual documents • All documents are transformed into language independent representation • Classifier is trained on transformed labelled documents
Experiments • NTCIR-3 patent retrieval test collection • Japanese – English • SVM trained on English documents • Tested both on the Japanese and English Average precision [%]
Image-Text Retrieval • Retrieval of images based on a text query • No labels associated with images • Paired dataset: • Image retrieved from internet • Text on web page where image appeared
Experiments • Querying database with images with text queries • Images were split into three clusters • 10 or 30 images that best match query are retrieved • In first test success is when images are of same label • In second test success is when images that actually matched query is retrieved
Images retrieved for the text query: ”height: 6-11 weight: 235 lbs position: forward born: september 18, 1968, split, croatia college: none”
”at phoenix sky harbor on july 6, 1997. 757-2s7, n907wa phoenix suns taxis past n902aw teamwork america west america west 757-2s7, n907wa phoenix suns taxis past n901aw arizona at phoenix sky harbor on july 6, 1997.”
Feature work • Use of machine translation for making paired dataset • Experiments with SVEZ-IJS English-Slovene ACQUIS Corpus • Sparse version of KCCA