130 likes | 254 Vues
Named Entity Tagging with Conditional Random Fields. Ryan McDonald, Fernando Pereira and Fei Sha Computer and Information Science University of Pennsylvania. Goals. Improve on the results of the current NE tagger used by UPenn ACE
E N D
Named Entity Tagging withConditional Random Fields Ryan McDonald, Fernando Pereira and Fei Sha Computer and Information Science University of Pennsylvania
Goals • Improve on the results of the current NE tagger used by UPenn ACE • Accomplish through Conditional Random Field Model (Lafferty et al. 2001) • Compare MaxEnt and CRFs in a controlled environment
ACE Definition • Find entities and classify them as Person, GPE, Organization, Location and/or Facility • “Bush took over the White House from the Clinton Administration” • Bush: Person • White House: Facility, GPE • The Clinton Administration: Organization • Clinton: Person
MaxEnt vs. CRFs • Ran an MEMM tagger and a CRF tagger with: • The exact same features • Exact same training algorithm (limited memory quasi-Newton) • Exact same training data and test data • Have not used Sept. test data yet since more improvements on the way
Features • Word: Unigram* • 1-suffix, 2-suffix, 3-suffix and 4-suffix: Unigram and Bigram • Word length bins: Unigram and bigram • Word features defined by Tom's script: Caps, Numeric, etc.* * used in original ACE system
MEMM vs. CRF • Same feature set • Same training algorithm
ACE vs. CRF • Different feature sets (CRF is richer)
Summary • These results and (Sha 2002) show that CRFs perform slightly better than MEMMs • Richer feature set leads to larger improvement • Portable CRF, MEMM code • Congugate Gradient, Limited Memory Quasi-Newton, Perceptron
Future and Current Work • “Person” and “Organization” recall • Multilayer taggers • Name lists • Document class information
Multilayer Taggers • If entity information known, can lead to a 10-20% increase in F-Score • First layer of tagger attempts to find generic entities • Can achieve around F-Score of 0.87 • Second layer uses entity information as feature for each category classifier • Leads to about a 2-5% increase in F-Score
Name Lists • Aim is to increase Recall results for person and organization categories • Name list size: 80,000 • Organization list size: 30,000 • Binary feature: is token in name list? • Increase Person F-Score to 0.793 (From 0.755) • Binary feature: is token in organization list? • Increase Person F-Score to 0.601 (From 0.569)
Name Lists • Small name lists can lead to a substantial improvement in F-Score • Even features were simplistic • Investigating better name lists • MT name list of 500,000 names and 50,000 orgs • Investigating more sophisticated features • frequency
Document Class Features • “Atlanta defeated Florida in extra innings ...” • Atlanta and Florida should be tagged as organizations • Mistakenly tagged as GPE • If document classified as SPORTS, NE classifier may recognize things normally tagged GPE should be orgs • Currently beginning to look at state of the art document classification algorithms • Could provide a richer source of knowledge