120 likes | 309 Vues
LingPipe. http://www.alias-i.com/lingpipe/. Does a variety of tasks. Tokenization Part of Speech Tagging Named Entity Detection Clustering Identifies Significant Phrases Other Topic Classification Database Text Mining Spell Checker Sentiment Analysis Chinese Word Segmentation .
 
                
                E N D
LingPipe http://www.alias-i.com/lingpipe/
Does a variety of tasks • Tokenization • Part of Speech Tagging • Named Entity Detection • Clustering • Identifies Significant Phrases • Other • Topic Classification • Database Text Mining • Spell Checker • Sentiment Analysis • Chinese Word Segmentation
Other Niceties • Its free • Plenty of documentation • Tutorials for every subtask • Highly Configurable • Source Code • Very complex, but well written • Good comments • Gives examples on how to edit code • Can be trained in several languages.
Tokenization • Divides up text in sentences and words using pretty sophisticated methods.
Part of Speech Tagging • You can output the N-best results • You can output a confidence score for each word. • You can also retrain the Part of Speech Tagger. • You can also edit how it runs.
Named Entity Detection • The default detection distinguishes between three types of entities. • People (distinguishes male and female) • Place • Organization • It can be trained to recognize any type of entity. • You can get corpora from online • You can annotate your own corpora using WordFreak, which also comes with LingPipe.
Sample Input/Output • - <DOCUMENT><P>This is Mr. Bob Smith. Bob lives in Redmond. He works for Microsoft.</P></DOCUMENT> • - <DOCUMENT><P><sent>This is Mr. <ENAMEX id="13" type="PERSON">Bob Smith.</ENAMEX> </sent> • <sent><ENAMEX id="13" type="PERSON">Bob</ENAMEX> lives in • <ENAMEX id="14" type="LOCATION">Redmond</ENAMEX> . </sent> • - <sent><ENAMEX id="13" type="MALE_PRONOUN">He</ENAMEX> • works for <ENAMEX id="15" type="ORGANIZATION">Microsoft</ENAMEX> . </sent></P></DOCUMENT>
Dictionary • To increase the accuracy of LingPipe, you can import a Dictionary. • A dictionary will force the recognition of certain strings to be certain types. • Common dictionaries include: • Gazeteer • List of people’s names • Company names
Coreference • It identifies different references to the same entity, such Bob Smith and Bob. • It does not identify entities across documents. • It identifies pronouns with its antecedent. • It does not do other anaphora resolution, like “Jane was the woman who pulled the trigger.”
Clustering • Single-link Clustering • chops off longest link • Clustering with proximity bounds • Merges based on proximity • Extract for K-clusters • You can specify how many clusters you want • Complete-Link Clustering • var of single link using a whole cluster • Within-Cluster Point Scatter • You don’t need to specify the number of clusters. • It detects the best breaking point. • This is the method used to do NER across documents.
Significant Phrases • Determines phrases that are seen together more often than coincidence • Seems to be mostly named entities • Puget Sound, George Bush • Helps tell the genre of an article