250 likes | 816 Vues
Text segmentation. Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation, numbers. This process is called tokenization and segmented units are called word tokens. Ex: In addition, she was there.
E N D
Text segmentation AmanyAlKhayat
Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation, numbers. • This process is called tokenization and segmented units are called word tokens. • Ex: In addition, she was there. • After segmentation: In addition , she was there .
Tokenization • Tokenization and sentence splitting can be described as ‘low-level’ segmentation which is performed at the initial level of text processing. The tasks are handled by reg. ex. Written in perlor any other programming language.
Tokenization II • High-level text segmentation or intrasenetential segmentation involves segmentation of linguistic groups such as named entities, segmentation of noun groups. • Inter-sentential segmentation involves grouping of sentences and paragraphs into discourse topics which are also called text tiles.
Word segmentation • Multiple occurrence of words in a text. • Word types are word of vocabulary. • Ex. If Shakespeare’s works included more than 8oo,ooo word tokens, it has 31,000 types of vocabulary
Tokenizing sentences • It is tiresome to tokenize sentences by adding white space. Moreover, if you tokenize sentences they cannot be put back to normal. • SGML or XML are cleaner strategies for tokenization to revert it easily to original text. • Ex. <w c=w> it</w> <w c=w> is </w> <w c=w> here </w> <w c=p>. </w>
Sentence segmentation • Important for many text processing apps: syntactic parsing, information extraction, text alignment, Machine translation…etc.
Accurate splitting is known as sentence boundary disambiguation (SBD) requires analysis of the local context around the periods and othe punctuations • Compare: • He stopped to see Dr. White. • He stopped at Meadows Dr. Whie falcon was still open. Which period is sentence internal and which one is sentence terminal?
Simplist algorithm for sentence boundary disambiguation • ‘period- space- capital letter’ • It marks all periods, exclamation marks and q marks that are followed by a space and a capital letter. • Regex: • [.?!][ ()”]+[A-Z]
Part of speech tagging • Criteria: • 1- syntactic distribution • 2- syntactic function • 3- morphological and syntactic classes that different parts of speech can be assigned to.
Applications • Preprocessors • Large tagged text corpora (see Mark Davies Corpus) • Info technology apps: text indexing and retrieval (nouns and adjectives are better candidates for good indexing than adverbs, verbs and pronouns
Parsing • See Stanford university parser online (http://nlp.stanford.edu:8080/parser/index.jsp) • Using grammar to assign syntactic analysis to a string of words. • Shallow parsing: partition of the input into chunks identifying the headword of each chunk.
CFP context free parsing • Context-free grammars are important in linguistics for describing the structure of sentences and words in natural language, and in computer science for describing the structure of programming languages and other formal languages. (wikipedia)