1 / 34

Jerome R. Bellegarda

Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling. Jerome R. Bellegarda. Outline. Introduction LSM Applications Conclusions. Introduction. LSA in IR: Words of queries and documents Recall and precision

shino
Télécharger la présentation

Jerome R. Bellegarda

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda

  2. Outline • Introduction • LSM • Applications • Conclusions

  3. Introduction • LSA in IR: • Words of queries and documents • Recall and precision • Assumption: There is some underlying latent semantic structure in the data • Latent structure is conveyed by correlation patterns • Documents: bag-of-words model • LSA improves separability among different topics

  4. Introduction

  5. Introduction • Success of LSA: • Word clustering • Document clustering • Language modeling • Automated call routing • Semantic Inference for spoken interface control • These solutions all leverage LSA’s ability to expose global relationships in context and meaning

  6. Introduction • Three unique factors for LSA: • The mapping of discrete entries • The dimensionality reduction • The intrinsically global outlook • The change of terminology to latent semantic mapping (LSM) to convey increased reliance on the general properties

  7. Latent Semantic Mapping • LSA defines a mapping between the discrete sets • M: an inventory of M individual units, such as words • N: an collection of N meaningful compositions of units, such as documents • L: a continuous vector space • ri: unit in M • cj: composition in N

  8. Feature Extraction • Construction of a matrix W of co-occurrences between units and compositions • The cell of W:

  9. Feature Extraction • The entropy of ri: • Value of Entropy Close to 0 means that the unit is present only in a few specific compositions. • The global weight is therefore a measure of the indexing power of the unit ri

  10. Singular Value Decomposition • The MxN unit-composition matrix W defines two vector representations for the units and the compositions • ri: a row factor of dimension N • cj: a column factor of dimension M • Unpractical: • M,N can be extremely large • Vector ri, cj are typically sparse • Two spaces are distinct from each other

  11. Singular Value Decomposition • Employ SVD: • U: MxR left singular matrix with row vectors ui • S: RxR diagonal matrix of singular values • V: NxR right singular matrix with row vector vj • U, V are column-orthonormal • UTU=VTV=IR • R<min(M, N)

  12. Singular Value Decomposition

  13. Singular Value Decomposition • captures the major structural associations in and ignores higher order effects • The closeness of vector in L: • Unit-unit comparison • Composition-composition comparison • Unit-Composition comparison

  14. Closeness Measure • WWT: co-occurrences between units • WTW: co-occurrences between compositions • ri, rj: units which have similar pattern of occurrence across the composition • ci, cj: compositions which have similar pattern of occurrence across the unit

  15. Closeness Measure • Unit-Unit Comparisons: • Cosine measure: • Distance: [0, π]

  16. Unit-Unit Comparisons

  17. Closeness Measure • Composition-Composition Comparisons: • Cosine measure: • Distance: [0, π]

  18. Closeness Measure • Unit-Composition Comparisons: • Cosine measure: • Distance: [0, π]

  19. LSM Framework Extension • Observe a new composition , p>N, the tilde symbol reflects the fact that the composition was not part of the original N • , a column vector of dimension M, can be thought of as an additional column of the matrix W • U, S do not change:

  20. LSM Framework Extension : pseudo-composition : pseudo-composition vector • If the addition of causes the major structural associations in W to shift in some substantial manner, the singular vectors will become inadequate.

  21. LSM Framework Extension • It would be necessary to re-compute SVD to find a proper representation for

  22. Salient Characteristics of LSM • A single vector embedding for both units and compositions in the same continuous vector space L • A relatively low dimensionality, which make operations such as clustering meaningful and practical • An underlying structure reflecting globally meaningful relationships, with natural similarity metrics to measure the distance between units, between compositions or between units and compositions in L

  23. Applications • Semantic classification • Multi-span language modeling • Junk e-mail filtering • Pronunciation modeling • TTS Unit Selection

  24. Semantic Classification • Semantic classification refers to determine which one of predefined topic a given document is most closely aligned with • The centroid of each clusters can be viewed as the semantic representation of this outcome in LSM space • Semantic anchor • A newly observed word sequence measures by computing the distance between the document and semantic anchor, and pick minimum

  25. Semantic Classification • Domain knowledge is automatically encapsulated in the LSM space in a data-driven fashion • For Desktop interface control: • Semantic inference

  26. Semantic Inference

  27. Multi-Span Language Modeling • In a standard n-gram , the history is string • In LSM language modeling, the history is the current document up to word • Pseudo-document: • Continually updated as q increases

  28. Multi-Span Language Modeling • An Integrated n-gram + LSM formulation for the overall language model probability: • Different syntactic constructs can be used to carry the same meaning (content words)

  29. Multi-Span Language Modeling Assume that the probability of the document History given the current word is not affected by immediate context preceding it

  30. Multi-Span Language Modeling

  31. Junk E-mail Filtering • It can be viewed as a degenerate case of semantic classification (two categories) • Legitimate • Junk • M: an inventory of words, symbols • N: a binary collection of email messages • Two semantic anchors

  32. Pronunciation Modeling • Also called grapheme-to-phoneme conversion (GPC) • Orthographic anchors • (one for each in-vocabulary word) • Orthographic neighborhood • In-vocabulary word with High closeness for out-vocabulary word

  33. Pronunciation Modeling

  34. Conclusions • Descriptive Power • Forgoing local constraints is not acceptable in some situations • Domain Sensitivity • Depend on the quality of the training data • polysemy • Updating the LSM Space • SVD on the fly is not practical • Success of LSM for three characteristics

More Related