1 / 8

11-761 Language and Statistics

11-761 Language and Statistics. Spring 2016 Roni Rosenfeld http://www.cs.cmu.edu/~ roni/ 11761-s16/. Course Goals and Style. Teaching statistical foundation and techniques for language technologies

lhyde
Télécharger la présentation

11-761 Language and Statistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 11-761 Language and Statistics Spring 2016 Roni Rosenfeld http://www.cs.cmu.edu/~roni/11761-s16/

  2. Course Goals and Style • Teaching statistical foundation and techniques for language technologies • Plugging gaping holes in LTI/CS grad student education in probability, statistics and information theory.

  3. Course philosophy • Socratic Method • Will try to maintain in spite of large class size • participation strongly encouraged (pls state your name) • Highly interactive • Highly adaptable • based on how fast we move • Lots of Probability, Statistics, Information theory • not in the abstract, but rather as the need arises • Lectures emphasize intuition, not rigor or detail • background reading will have rigor & detail

  4. Course Prerequisites & Mechanics • You need to be able to program, from scratch. • Largest program is O(100) lines • You need to be comfortable with probabilities • Can you derive Bayes equation in your sleep? • 11661 (masters level): no final project • Hand in assignments via Blackboard • Vigorous enforcement of collaboration & disclosure policy

  5. Background Material No single book exists which covers the course material. • “Foundations of Statistical NLP”, Manning & Schutze • Computational Linguistics perspective • “Statistical Methods in Speech Recognition”, Jelinek • “Text Compression”, Bell, Cleary & Witten • first 4 chapters; rest is mostly text compression • “Probability and Statistics”, DeGroot • “All of Statistics” & “All of nonparametric Statistics”, Wasserman • Lots of individual articles

  6. High Level Syllabus (subject to change) • Language Technology formalisms • source-channel formulation • Bayes classifier • Words, Words, Words • type vs, token, Zipf, Mandlebrot, heterogeneity of langauge • Modeling Word distributions - the unigram: • [estimators, ML, zero frequency, smoothing, shrinkage, G-T] • N-grams: • Deleted Interpolation Model, backoff, toolkit • Measuring Success: perplexity • [entropy, KL-div, MI], the entropy of English, alternatives

  7. Syllabus (continued) • Clustering: • class-based N-grams, hierarchical clustering • hard and soft clustering • Latent Variable Models, EM • Hidden Markov Models, revisiting interpolated and class n-grams • Part-Of-Speech tagging, Word Sense Disambiguation • Decision & Regression Trees • Particularly as applied to language • Stochastic Grammars • (SCFG, inside-outside alg., Link grammar)

  8. Syllabus (continued) • Maximum Entropy Modeling • exponential models, ME principle, feature induction... • Language Model Adaptation • caches, backoff • Dimensionality reduction • latent semantic analysis, word2vec • Syntactic Language Models

More Related