100 likes | 104 Vues
CSA3180: Natural Language Processing. Text Processing 1 Language Encoding Issues Common Corpora Handling Large Document Collections Applications: Anatomy of a Search Engine NLTK. Language Encoding Issues. Different encoding methods Different languages Unicode Standard
E N D
CSA3180: Natural Language Processing Text Processing 1 Language Encoding Issues Common Corpora Handling Large Document Collections Applications: Anatomy of a Search Engine NLTK CSA3180: Text Processing I
Language Encoding Issues • Different encoding methods • Different languages • Unicode Standard • Further information: • Unicode Consortium • Jukka Korpela Tutorial http://www.cs.tut.fi/~jkorpela/chars.html CSA3180: Text Processing I
Language Encoding Issues • Character Repertoire – set of distinct characters • Character Code – mapping between characters and positive integers • Character Encoding – algorithm for presenting characters using particular code CSA3180: Text Processing I
Language Encoding Issues • Encoding using octets • Common Encodings: • ASCII • ISO Latin I (ISO 8859-1) • ISO Latin II + III Extensions (for Maltese) • Unicode & UTF-8 • ANSI • Cyrillic and Chinese Encodings CSA3180: Text Processing I
Language Encoding Issues • Text encoding on the Web • MIME Standard • Content-Type: text/html; charset=iso-8859-1 • Used in Email and Web Servers • Problems in implementation: few encodings properly supported • UTF-8 recommended CSA3180: Text Processing I
Common Corpora • WordNet • TREC/ACE/TIDES Corpora • Linguistic Data Consortium (LDC) • GigaWord (News) • Tree Banks • MUC (Message Understanding Conference) • TIPSTER (Information Retrieval) CSA3180: Text Processing I
Handling Large Document Collections • Special issues involved in processing • Hierarchical directory structures • File indexes • Batch processing – start, resume, pause, end • Job scheduling CSA3180: Text Processing I
Applications • Anatomy of a Search Engine (Larry Page and Sergey Brin) • Describes the internals of Google • NLP in everyday life! CSA3180: Text Processing I
Next Sessions… • Natural Language Toolkit (NLTK) • http://nltk.sourceforge.net/ • Please download and install! CSA3180: Text Processing I