90 likes | 102 Vues
Thoughts on Treebanks. Christopher Manning Stanford University. Q1: What do you really care about when you're building a parser?. Completeness of information There’s not much point in having a treebank if really you’re having to end up doing unsupervised learning
E N D
Thoughts on Treebanks Christopher Manning Stanford University
Q1: What do you really care about when you're building a parser? • Completeness of information • There’s not much point in having a treebank if really you’re having to end up doing unsupervised learning • You want to be giving human value add • Classic bad example: • Noun compound structure in the Penn English Treebank • Consistency of information • If things are annotated inconsistently, you lose both in training (if it is widespread) and in evaluation • Bad example • Long ago constructions: as long ago as …; not so long ago • Mutual information • Categories should be as mutually informative as possible
Q3: What info (e.g., function tags, empty categories, coindexation) is useful, what is not? • Information on function is definitely useful • Should move to always having typed dependencies. • Clearest example in Penn English Treebank: temporal NPs • Empty categories don’t necessarily give much value in the dumbed-down world of Penn English Treebank parsing work • Though it should be tried again/more • But definitely useful if you want to know this stuff! • Subcategorization/argument structure determination • Natural Language Understanding!! • Cf. Johnson, Levy and Manning, etc. work on long distance dependencies • I’m sceptical that there is a categorical argument adjunct distinction to be make • Leave it to the real numbers • This means that subcategorization frames can only be statistical • Cf. Manning (2003) • I’ve got some more slides on this from another talk if you want…
Q3: What info (e.g., function tags, empty categories, coindexation) is useful, what is not? • Do you prefer a more refined tagset for parsing? • Yes. I mightn’t use it, but I often do • The transform-detransform framework: • RawInput TransformedInput Parser TransformedOutput DesiredOutput • I think everyone does this to some extent • Some like Johnson, Klein and Manning have exploited it very explicitly: NN-TMP, IN^T, NP-Poss, VP-VBG, NP-v, • Everyone else should think about it more • It’s easy to throw away too precise information, or to move information around deterministically (tag to phrase or vice versa), if it’s represented completely and consistently!
Q4: How does grammar writing interact with treebanking? • In practice, they often haven’t interacted much • I’m a great believer that they should • Having a grammar is a huge guide to how things should be parsed and to check parsing consistency • It also allows opportunities for analysis updating, etc. • Cf. the Redwoods Treebank, and subsequent efforts • The inability to automatically update treebanks is a growing problem • Current English treebanking isn’t having much impact because of annotation differences with original PTB • Feedback from users has only rarely been harvested
Q5: What methodological lessons can be drawn for treebanking? • Good guidelines (loosely, a grammar!) • Good, trained people • Annotator buy-in • Ann Bies said all this … I strongly agree! • I think there has been a real underexploitation of technology for treebank validation • Doing vertical searches/checks almost always turns up inconsistencies • Either these or a grammar should give vertical review
Q6: What are advantages and disadvantages of pre-processing the data to be treebanked with an automatic parser? • The economics are clear • You reduce annotation costs • The costs are clear • The parser places a large bias on the trees produced • Humans are lazy/reluctant to correct mistakes • Clear e.g.: I think it is fair to say that many POS errors in the Penn English Treebank can be traced to the POS tagger • E.g., sentence initial capitalized Separately, Frankly, Currently, Hopefully analyzed as NNP • Doesn’t look like a human being’s mistakes to me. • The answer: • More use of technology to validate and check humans
Q7: What are the advantages of a phrase-structure and/or a dependency treebank for parsing? • The current split in the literature between “phrase-structure” and “dependency” parsing is largely bogus (in my opinion) • The Collins/Bikel parser operates largely in the manner of a dependency parser • The Stanford parser contains a strict (untyped) dependency parser • Phrase structure parsers have the advantage of phrase structure labels • A dependency parser is just a phrase structure parser where you cannot refer to phrasal types or conditional on phrasal span • This extra info is useful; it’s silly not to use it • Labeling phrasal heads=dependencies is useful. Silly not to do it • Automatic “head rules” should have had their day by now!! • Scoring based on dependencies is much better than Parseval !!! • Labeling dependency types is useful • Especially, this will be the case in free-er word order languages