230 likes | 470 Vues
Natural Language Toolkit(NLTK). April Corbet. Overview. What is NLTK? NLTK Basic Functionalities Part of Speech Tagging Chunking and Trees Example: Calculating WordNet Synset Similarity Other Functionalities. What is NLTK?.
E N D
Natural Language Toolkit(NLTK) April Corbet
Overview • What is NLTK? • NLTK Basic Functionalities • Part of Speech Tagging • Chunking and Trees • Example: Calculating WordNetSynset Similarity • Other Functionalities
What is NLTK? • A tool consisting of a collection of libraries and programs in python that allows for customization and optimization of NLP processes • Downloading
What is NLTK? • NLP tools typically use other NLP tools • Other tools include • Wordnet • Stanford Dependency Parser • Conceptnet • DBPedia • Google Mate-Tools
Overview • What is NLTK? • NLTK Basic Functionalities • Part of Speech Tagging • Chunking and Trees • Other Functionalities • Works Cited
NLTK Basic Functionalities • Sentence Tokenization • Word Tokenization • Wordnet, Synsets, and Synonyms • Stemming Words and Lemmas
Sentence Tokenization • Basic Tokenization • Statistically Based Training Methodology • Tokenizing for Multiple Sentences • Pickle File • Tokenizing with Other Languages
Word Tokenization • Basic Word Tokenizer • Penn Treebank Project • Other Types of Word Tokenizers: • PunctWordTokenizer: splits on punctuation but keeps it with the punctuation with the associated word token • WordPunctTokenizer: splits all punctuation onto separate tokens • Word Tokenizers and Regular Expressions • Match on tokens separators, or gaps • Stopwords and Filtering
Wordnet, Synsets, and Synonyms • Wordnet is a tool integrated into NLTK that contains listings of word relations (i.e. a lexical database) • Groupings of synonymous meanings that express the same concept are synsetinstances • Expressed in a tree • Hypernyms and Hyponyms • Synonyms and Antonyms
Overview • What is NLTK? • NLTK Basic Functionalities • Part of Speech Tagging • Chunking and Trees • Other Functionalities • Works Cited
POS Tagging • String Representation for Tagged Tokens (tuples) • Default Tagging • Tagging based off a Trained Corpus (Brown)
POS Tagging • Types of Tagging • Unigram/Bigram Tagger • Regexp Tagging • Brill: uses and initial tagger than then applies transformation rules learned from the training corpus using “rule templates”
Overview • What is NLTK? • NLTK Basic Functionalities • Part of Speech Tagging • Chunking and Trees • Other Functionalities • Works Cited
Chunking and Trees • Default Chunking • Trees and Parsing • Drawing Trees
Overview • What is NLTK? • NLTK Basic Functionalities • Part of Speech Tagging • Chunking and Trees • Other Functionalities • Works Cited
Other Functionalities • Replacing and Correcting Words • Calculating WordNetSynsetSimilarity • Word Collections • Text Classification • Transforming Chunks and Trees • Processes for Distributed Processing and Handling Large Datasets • Parsing for Specific Data(Location, Dates and Times)
Works Cited • Perkins, Jacob. Python Text Processing with NLTK 2.0 Cookbook. • http://wordnet.princeton.edu/ • http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html • http://nltk.org