1 / 12

Regular Expressions and Xkwic

Regular Expressions and Xkwic. LING 5200 Computational Corpus Linguistics Martha Palmer February 28, 2006. grep/egrep. X+ instead of xx* (xxx|yyy) xxx OR yyy ? Matches a single character of the preceding character set, or nothing. More grepping/egrepping.

basil
Télécharger la présentation

Regular Expressions and Xkwic

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Regular Expressions and Xkwic LING 5200 Computational Corpus Linguistics Martha Palmer February 28, 2006

  2. grep/egrep • X+ instead of xx* • (xxx|yyy) xxx OR yyy • ? Matches a single character of the preceding character set, or nothing BASED on Kevin Cohen’s LING 5200

  3. More grepping/egrepping • /corpora/celex/english/epw/epw.cd • Find all capitalized words • grep ^'[0-9][0-9]*.[A-Z]' epw.cd | wc –l • OR • egrep ^'[0-9]+.[A-Z]‘epw.cd | wc –l BASED on Kevin Cohen’s LING 5200

  4. Homework 3 • Please give me command AND results! • 1. In the file /corpora/celex/english/epw/epw.cd, find all words that contain only upper-case letters, e.g. USSR and VTOL. • ANS:158 • grep '^[0-9][0-9]*\\[A-Z][A-Z]*\\' epw.cd | wc –l • egrep '^[0-9]+\\[A-Z]+\\' epw.cd | wc –l • egrep ^'[0-9]+[\][A-Z]+\\' epw.cd | wc -l • egrep ^'[0-9]+.[A-Z]+\\' epw.cd | wc –l BASED on Kevin Cohen’s LING 5200

  5. Homework 3 • 2. How many entries have a syllable that ends with a 4-consonant cluster? • ANS: 45 • egrep 'CCCC]' epw.cd (why not \] )? 56 • grep 'CCCC]' epw.cd 56 • grep 'CCCC]' epw.cd | grep –v ‘ed[ \\]’ 36 • egrep 'CCCC]\\' epw.cd 45 BASED on Kevin Cohen’s LING 5200

  6. Homework 3 • 3. Find all multi-word terms in which only the first letter is capitalized, e.g. Colorado potato beetle. • ANS: 238/243 • egrep ^'[0-9]+.[A-Z][a-z]+( [a-z]+)+\\' epw.cd | wc –l • egrep ^'[0-9]+\\[A-Z][a-z]*( [a-z]+)+\\' epw.cd | wc -l 100184\X chromosome\0\52203\1\P\'Eks-"kr5-m@-s5m\[VCC][CCVV][CV][CVVC]\[Eks][kr@U][m@][s@Um] 100185\X chromosomes\0\52203\1\P\'Eks-"kr5-m@-s5mz\[VCC][CCVV][CV][CVVCC]\[Eks][kr@U][m@][s@Umz] 100287\Y chromosome\0\52250\1\P\'w2-"kr5-m@-s5m\[CVV][CCVV][CV][CVVC]\[waI][kr@U][m@][s@Um] 100288\Y chromosomes\0\52250\1\P\'w2-"kr5-m@-s5mz\[CVV][CCVV][CV][CVVCC]\[waI][kr@U][m@][s@Umz] BASED on Kevin Cohen’s LING 5200

  7. Homework 3 • 4. Find all multi-word terms in which the first letter (and only the first letter) of each word is capitalized, e.g. Union Jacks and Royal Automobile Club. Note: your regex should be able to accommodate an arbitrary number of words. • ANS: 296/298 • egrep ^'[0-9]+.[A-Z][a-z]+( [A-Z][a-z]*)+\\' epw.cd egrep ^'[0-9]+.[A-Z][a-z]*( [A-Z][a-z]*)+\\' epw.cd BASED on Kevin Cohen’s LING 5200

  8. Homework 3 • 5. Find all disyllabic words that contain only vowels. • ANS: 4 • egrep '\\\[V+\]\[V+\]\\' epw.cd 5\AA\52\5\1\P\"1-'1\[VV][VV]\[eI][eI] 6\AA\95\6\1\P\"1-'1\[VV][VV]\[eI][eI] 4727\ayah\13\2714\2\P\'2-@\[VV][V]\[aI][@]\S\'#-j@\[VV][CV]\[A:][j@] 43355\i.e.\424\22210\1\P\"2-'i\[VV][VV]\[aI][i:] BASED on Kevin Cohen’s LING 5200

  9. Homework 3 • 6. Multiword expressions (Find a similar phrase in the wsj/raw corpus, and search for all variants of it in the entire corpus. ) • egrep –i ‘.tip of the *[a-z] iceberg’ • egrep ‘[Tt]he tip of (a|the).* iceberg’ • patriarchical /a more alarming BASED on Kevin Cohen’s LING 5200

  10. Homework 3 • 6. Other multiword expressions • war on (inflation/drugs/the dictator) • fight the war on the expenditure side rather • rule of (the day/journalism/Ferdinand Marcos) • cream of the (British) crop BASED on Kevin Cohen’s LING 5200

  11. Searching the treebank • cat ??/* | egrep -i '(push|pull)[a-z]*’ • OR xkwic? BASED on Kevin Cohen’s LING 5200

  12. XWin 32 • See e-mail • Load on laptops, bring laptops to class if any issues • Go to Feb 9 Emacs & Xkwic lecture BASED on Kevin Cohen’s LING 5200

More Related