1 / 9

Conversion of Penn Treebank Data to Text

Conversion of Penn Treebank Data to Text. Penn TreeBank Project “A Bank of Linguistic Trees” (as of 11/1992). University of Pennsylvania, LINC Laboratory 4.5 million words of American English Annotation of naturally-occurring text for linguistic structure. Tree Linguistic Components.

juliet
Télécharger la présentation

Conversion of Penn Treebank Data to Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Conversion of Penn Treebank Data to Text

  2. Penn TreeBank Project“A Bank of Linguistic Trees”(as of 11/1992) • University of Pennsylvania, LINC Laboratory • 4.5 million words of American English • Annotation of naturally-occurring text for linguistic structure

  3. Tree Linguistic Components • Tokenization • Treatment of punctuation, words, etc. as separate tokens • Children’s  Children ’s • Part-of-speech (POS) tagging • Text first assigned POS tags automatically • Human annotators correct first-pass POS tags • Bracketing • (Fidditch, a deterministic parser (Hindle 1983, 1989) ) • Two-stage parsing process made explicit with brackets

  4. Penn TreeBank: Brown Corpus (as of 11/1992) • POS Tags (Tokens) 1,172,041 • Skeletal Parsing (Tokens) 1,172,041

  5. You know you’re in trouble when … “0. You will always have a certain amount of error. Sometimes there is just no way to find the head of a phrase, because it is tagged or parsed completely incorrectly. (no big surprise, that)” Robert MacIntyre Programmer/Data Manager Penn Treebank Project robertm@unagi.cis.upenn.edu ftp://ftp.cis.upenn.edu/pub/treebank/doc/faq.cd2

  6. Tree Conversion: Clean Case • ( END_OF_TEXT_UNIT ) • ( END_OF_TEXT_UNIT ) • ( END_OF_TEXT_UNIT ) • ( (`` ``) • (S • (S • (NP (PRP I) ) • (VP (VBP leave) • (NP (DT this) (NN church) ) • (PP (IN with) • (NP (DT a) (NN feeling) • (SBAR (IN that) • (S • (NP (DT a) (JJ great) (NN weight) ) • (AUX (VBZ has) ) • (VP (VBN been) • (VP (VBN lifted) • (PP (IN off) • (NP (PRP$ my) (NN heart) )))))))))) • (, ,) • (S • (NP (PRP I) ) • (AUX (VBP have) ) • (VP • (VP (VBN left) • (NP (PRP$ my) (NN grudge) ) • (PP (IN at) • (NP (DT the) (NN altar) ))) • (CC and) • (VP (VBN forgiven) • (NP (PRP$ my) (NN neighbor) ))))) • ('' '') (. .) ) • ( END_OF_TEXT_UNIT ) cb08_42 ``I leave this church with a feeling that a great weight has been lifted off my heart, I have left my grudge at the altar and forgiven my neighbor''.

  7. Tree Conversion : Problematic Case (NP (DT the) (NNS Women) ) (POS 's) (NN S.P.C.A.) ))) (: ;) (: ;) (NP (NP ($ $) (CD 15,000) ) (S (NP (-NONE- T) ) (AUX (TO to) ) (VP (VB pay) (NP (NP (CD six) (NNS policemen) ) • ( (S • (NP (PRP He) ) • (VP (VBD reported) • (SBAR (IN that) • (S • (NP • (NP (DT the) (NN city) ) • (POS 's) (NNS contributions) • (PP (IN for) • (NP (NN animal) (NN care) ))) • (VP (VBD included) • (NP • (NP ($ $) (CD 67,000) • (PP (TO to) • (NP • (NP (DT the) (NNS Women) ) • (POS 's) (NN S.P.C.A.) ))) • (: ;) (: ;) • (NP • (NP ($ $) (CD 15,000) ) • (S • (NP (-NONE- T) ) • (AUX (TO to) ) • (VP (VB pay) • (NP • (NP (CD six) (NNS policemen) ) • (VP (VBN assigned) • (PP (IN as) • (NP (NN dog) (NNS catchers) ))))))) • (CC and) • (NP • (NP ($ $) (CD 15,000) ) • (S • (NP (-NONE- T) ) • (AUX (TO to) ) • (VP (VB investigate) • (NP (NN dog) (NNS bites) )))))))))) • (. .) ) • ( END_OF_TEXT_UNIT ) ca09_46 He reported that the city's contributions for animal care included $67,000 to the Women's S.P.C.A.;; $15,000 to pay six policemen assigned as dog catchers and $15,000 to investigate dog bites.

  8. Summary of Problems Encountered • Typing Errors • Punctuation duplication in data • Special notation for delimiter characters • RRB, LRB, RSB, LSB, RCB, LCB • Special Null Elements • ( -NONE- ) * 0 T NIL ** Conventions for final output need to consider these lessons

  9. Future Recommendations • Put POS tree data into proper database • Increases confidence in correctness of data • Minimizes error • Spend more effort upfront *once* to clean data • SQL queries more reusable than (write-only) perl scripts • Due to random graduate student ability • If DB option not available • Avoid duplication of data in final output • Avoid text delimiters that exist as data tokens (“ ‘ , \s ) • Do thoughtful labeling conventions

More Related