1 / 26

Logistics

Logistics. Course reviews Project report deadline: March 16 Poster session guidelines: 2.5 minutes per poster (3 hrs / 55 minus overhead) presentations will be videotaped food will be provided . Task: Named-Entity Recognition in new corpus. Named-Entity Recognition.

ima
Télécharger la présentation

Logistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Logistics • Course reviews • Project report deadline: March 16 • Poster session guidelines: • 2.5 minutes per poster (3 hrs / 55 minus overhead) • presentations will be videotaped • food will be provided 

  2. Task: Named-Entity Recognition in new corpus

  3. Named-Entity Recognition • Fragment of an example sentence: PER PER Other Other LOC Julian Assange accused the United

  4. NER as Machine Learning • Fragment of an example sentence: PER PER Other Other LOC Yi Word label  {Other, LOC, PER, ORG} Julian Assange accused the United Xi Some feature representation of the word

  5. Feature Vector: Three Choices Words:  current word Context: current word, previous word, next word Features: current word, previous word, next word is the word capitalized? "word shape" (compact summary of orthographic information, like internal digits and punctuation) prefixes up to length 5, suffixes up to length 5 any word in a +/- six word window (*not* differentiated by position the way previous word and next word are)

  6. Discriminative vs Generative I Y Y X X Previous=Julian POS= noun Capitalized=1 Previous=Julian POS= noun Capitalized=1 Assange Assange

  7. Generative vs Discriminative I • 10K training words from CoNLL (British newswire) looking only for PERSON • Metric: F1 81.5 70.8 65.5 59.1 52.8 51.3

  8. Do More Features Always Help? • How do we evaluate multiple feature sets? • On validation set, not test set! • Detecting underfitting • Train & test performance similar and low • Detecting overfitting • Train performance high, test performance low • The same holds every time we want to consider models of varying complexity!

  9. Sequential Modeling • Fragment of an example sentence: PER PER Other Other LOC Yi Random variable with domain {Other, LOC, PER, ORG} Julian Assange accused the United Xi Random variable for vector of features about the word

  10. Hidden Markov Model (HMM) Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 Julian Assange accused the United

  11. Hidden Markov Model (HMM) Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 Julian Assange accused the United

  12. Hidden Markov Model (HMM) Y1 Y2 Y3 Y4 Y5 X2 Previous=Julian X1 X3 X4 X5 Capitalized=1 Julian accused the United POS= noun Assange

  13. Advantage of Sequential Modeling 70.8 70.8 61.8 59.1 57.4 51.3 Reminder: Plain logistic regression gives us 81.5!

  14. Max Entropy Markov Model (MEMM) • Markov chain over Xi’s • Each Xi has logistic regression CPD given Yi X1 X2 X3 X4 X5 Y2 Previous=Julian Y1 Y3 Y4 Y5 Capitalized=1 Julian accused the United POS= noun Assange

  15. Max Entropy Markov Model (MEMM) • Pro: uses features in a powerful way • Con: downstream evidence doesn’t help because of v-structures X1 X2 X3 X4 X5 Y2 Previous=Julian Y1 Y3 Y4 Y5 Capitalized=1 Julian accused the United POS= noun Assange

  16. MEMM vs HMM vs NB 84.6 68.3 59.1 Finally beat logistic regression!

  17. Conditional Random Field (CRF) Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 Julian Assange accused the United

  18. Comparison: Sequence Models 85.8 84.6 70.8 70.2 68.3 61.8 59.6 59.1 57.4

  19. Tradeoffs in Learning I • HMM • Simple closed form solution • MEMM • Gradient ascent for parameters of logistic P(Yi | Xi) • But no inference required for learning • CRF • Gradient ascent for all parameters • Inference over entire graph required at each iteration

  20. Tradeoffs in Learning: II • Can we learn from unsupervised data? • HMM • Yes, using EM • MEMM/CRF • No • Discriminative objective: maximize log P(Y | X) • But if Y is not observed, we can’t maximize its probability

  21. PGMs and ML • PGMs deal well with predictions of structured objects (sequences, graphs, trees) • Exploit correlations between multiple parts of the prediction task • Can easily incorporate prior knowledge into model • Learned model can often be used for multiple prediction tasks • Useful framework for knowledge discovery

  22. Inference • Exact marginals? • Clique tree calibration gives all marginals • Final labeling might not be jointly consistent • Approximate marginals? • Doesn’t make sense in this context • MAP? • Gives single coherent solution • Hard to get ROC curves (tradeoff precision & recall)

  23. Mismatch of Objectives • MAP inference optimizes LL = log P(Y | X) • Actual performance metric is usually different (e.g., F1) • Performance is best if we can get these two metrics to be relatively well-aligned • If MAP assignment gets significantly lower F1 than ground truth, model needs to be adjusted • Very useful for debugging approximate MAP • If LL(y*) >> LL(yMAP) • If LL(y*) << LL(yMAP) - algorithm found local optimum - LL bad surrogate for objective

  24. Richer Models Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 Julian Assange accused the United Y101 Y102 Y103 Y104 Y105 X101 X102 X103 X104 X105 said Stephen, Assange’s laywer to

  25. Summary • Foundation I:Probabilistic model • Coherent treatment of uncertainty • Declarative representation: • separates model and inference • separates inference and learning • Foundation II: Graphical model • Encode and exploit structure for compact representation and efficient inference • Allows modularity in updating the model

More Related