Thoughts on Treebanks

Thoughts on Treebanks Christopher Manning Stanford University

Q1: What do you really care about when you're building a parser? • Completeness of information • There’s not much point in having a treebank if really you’re having to end up doing unsupervised learning • You want to be giving human value add • Classic bad example: • Noun compound structure in the Penn English Treebank • Consistency of information • If things are annotated inconsistently, you lose both in training (if it is widespread) and in evaluation • Bad example • Long ago constructions: as long ago as …; not so long ago • Mutual information • Categories should be as mutually informative as possible

Q3: What info (e.g., function tags, empty categories, coindexation) is useful, what is not? • Information on function is definitely useful • Should move to always having typed dependencies. • Clearest example in Penn English Treebank: temporal NPs • Empty categories don’t necessarily give much value in the dumbed-down world of Penn English Treebank parsing work • Though it should be tried again/more • But definitely useful if you want to know this stuff! • Subcategorization/argument structure determination • Natural Language Understanding!! • Cf. Johnson, Levy and Manning, etc. work on long distance dependencies • I’m sceptical that there is a categorical argument adjunct distinction to be make • Leave it to the real numbers • This means that subcategorization frames can only be statistical • Cf. Manning (2003) • I’ve got some more slides on this from another talk if you want…

Q3: What info (e.g., function tags, empty categories, coindexation) is useful, what is not? • Do you prefer a more refined tagset for parsing? • Yes. I mightn’t use it, but I often do • The transform-detransform framework: • RawInput  TransformedInput  Parser  TransformedOutput  DesiredOutput • I think everyone does this to some extent • Some like Johnson, Klein and Manning have exploited it very explicitly: NN-TMP, IN^T, NP-Poss, VP-VBG, NP-v, • Everyone else should think about it more • It’s easy to throw away too precise information, or to move information around deterministically (tag to phrase or vice versa), if it’s represented completely and consistently!

Q4: How does grammar writing interact with treebanking? • In practice, they often haven’t interacted much • I’m a great believer that they should • Having a grammar is a huge guide to how things should be parsed and to check parsing consistency • It also allows opportunities for analysis updating, etc. • Cf. the Redwoods Treebank, and subsequent efforts • The inability to automatically update treebanks is a growing problem • Current English treebanking isn’t having much impact because of annotation differences with original PTB • Feedback from users has only rarely been harvested

Q5: What methodological lessons can be drawn for treebanking? • Good guidelines (loosely, a grammar!) • Good, trained people • Annotator buy-in • Ann Bies said all this … I strongly agree! • I think there has been a real underexploitation of technology for treebank validation • Doing vertical searches/checks almost always turns up inconsistencies • Either these or a grammar should give vertical review

Q6: What are advantages and disadvantages of pre-processing the data to be treebanked with an automatic parser? • The economics are clear • You reduce annotation costs • The costs are clear • The parser places a large bias on the trees produced • Humans are lazy/reluctant to correct mistakes • Clear e.g.: I think it is fair to say that many POS errors in the Penn English Treebank can be traced to the POS tagger • E.g., sentence initial capitalized Separately, Frankly, Currently, Hopefully analyzed as NNP • Doesn’t look like a human being’s mistakes to me. • The answer: • More use of technology to validate and check humans

Q7: What are the advantages of a phrase-structure and/or a dependency treebank for parsing? • The current split in the literature between “phrase-structure” and “dependency” parsing is largely bogus (in my opinion) • The Collins/Bikel parser operates largely in the manner of a dependency parser • The Stanford parser contains a strict (untyped) dependency parser • Phrase structure parsers have the advantage of phrase structure labels • A dependency parser is just a phrase structure parser where you cannot refer to phrasal types or conditional on phrasal span • This extra info is useful; it’s silly not to use it • Labeling phrasal heads=dependencies is useful. Silly not to do it • Automatic “head rules” should have had their day by now!! • Scoring based on dependencies is much better than Parseval !!! • Labeling dependency types is useful • Especially, this will be the case in free-er word order languages

Thoughts on Treebanks

Thoughts on Treebanks

Presentation Transcript

Thoughts on Leadership

Thoughts on AI

Thoughts on Context

Thoughts on Summaries

Thoughts on Homelessness

Thoughts on KeySec

Thoughts on Loneliness

Extracting LTAGs from Treebanks

Thoughts on Courage

Thoughts on documentation

Thoughts on Teaching

Thoughts on Printmaking

Thoughts on Models

Thoughts on Collaboration

Thoughts on Mission:

THOUGHTS ON LEADERSHIP

Thoughts on Righteousness

Thoughts on ENERGY

Random Thoughts On