Advanced Text Processing Techniques in Corpus Linguistics
120 likes | 219 Vues
Explore advanced text processing techniques using regular expressions and Xkwic for computational corpus linguistics. Learn to find specific patterns in text data for linguistic analysis.
Advanced Text Processing Techniques in Corpus Linguistics
E N D
Presentation Transcript
Regular Expressions and Xkwic LING 5200 Computational Corpus Linguistics Martha Palmer February 28, 2006
grep/egrep • X+ instead of xx* • (xxx|yyy) xxx OR yyy • ? Matches a single character of the preceding character set, or nothing BASED on Kevin Cohen’s LING 5200
More grepping/egrepping • /corpora/celex/english/epw/epw.cd • Find all capitalized words • grep ^'[0-9][0-9]*.[A-Z]' epw.cd | wc –l • OR • egrep ^'[0-9]+.[A-Z]‘epw.cd | wc –l BASED on Kevin Cohen’s LING 5200
Homework 3 • Please give me command AND results! • 1. In the file /corpora/celex/english/epw/epw.cd, find all words that contain only upper-case letters, e.g. USSR and VTOL. • ANS:158 • grep '^[0-9][0-9]*\\[A-Z][A-Z]*\\' epw.cd | wc –l • egrep '^[0-9]+\\[A-Z]+\\' epw.cd | wc –l • egrep ^'[0-9]+[\][A-Z]+\\' epw.cd | wc -l • egrep ^'[0-9]+.[A-Z]+\\' epw.cd | wc –l BASED on Kevin Cohen’s LING 5200
Homework 3 • 2. How many entries have a syllable that ends with a 4-consonant cluster? • ANS: 45 • egrep 'CCCC]' epw.cd (why not \] )? 56 • grep 'CCCC]' epw.cd 56 • grep 'CCCC]' epw.cd | grep –v ‘ed[ \\]’ 36 • egrep 'CCCC]\\' epw.cd 45 BASED on Kevin Cohen’s LING 5200
Homework 3 • 3. Find all multi-word terms in which only the first letter is capitalized, e.g. Colorado potato beetle. • ANS: 238/243 • egrep ^'[0-9]+.[A-Z][a-z]+( [a-z]+)+\\' epw.cd | wc –l • egrep ^'[0-9]+\\[A-Z][a-z]*( [a-z]+)+\\' epw.cd | wc -l 100184\X chromosome\0\52203\1\P\'Eks-"kr5-m@-s5m\[VCC][CCVV][CV][CVVC]\[Eks][kr@U][m@][s@Um] 100185\X chromosomes\0\52203\1\P\'Eks-"kr5-m@-s5mz\[VCC][CCVV][CV][CVVCC]\[Eks][kr@U][m@][s@Umz] 100287\Y chromosome\0\52250\1\P\'w2-"kr5-m@-s5m\[CVV][CCVV][CV][CVVC]\[waI][kr@U][m@][s@Um] 100288\Y chromosomes\0\52250\1\P\'w2-"kr5-m@-s5mz\[CVV][CCVV][CV][CVVCC]\[waI][kr@U][m@][s@Umz] BASED on Kevin Cohen’s LING 5200
Homework 3 • 4. Find all multi-word terms in which the first letter (and only the first letter) of each word is capitalized, e.g. Union Jacks and Royal Automobile Club. Note: your regex should be able to accommodate an arbitrary number of words. • ANS: 296/298 • egrep ^'[0-9]+.[A-Z][a-z]+( [A-Z][a-z]*)+\\' epw.cd egrep ^'[0-9]+.[A-Z][a-z]*( [A-Z][a-z]*)+\\' epw.cd BASED on Kevin Cohen’s LING 5200
Homework 3 • 5. Find all disyllabic words that contain only vowels. • ANS: 4 • egrep '\\\[V+\]\[V+\]\\' epw.cd 5\AA\52\5\1\P\"1-'1\[VV][VV]\[eI][eI] 6\AA\95\6\1\P\"1-'1\[VV][VV]\[eI][eI] 4727\ayah\13\2714\2\P\'2-@\[VV][V]\[aI][@]\S\'#-j@\[VV][CV]\[A:][j@] 43355\i.e.\424\22210\1\P\"2-'i\[VV][VV]\[aI][i:] BASED on Kevin Cohen’s LING 5200
Homework 3 • 6. Multiword expressions (Find a similar phrase in the wsj/raw corpus, and search for all variants of it in the entire corpus. ) • egrep –i ‘.tip of the *[a-z] iceberg’ • egrep ‘[Tt]he tip of (a|the).* iceberg’ • patriarchical /a more alarming BASED on Kevin Cohen’s LING 5200
Homework 3 • 6. Other multiword expressions • war on (inflation/drugs/the dictator) • fight the war on the expenditure side rather • rule of (the day/journalism/Ferdinand Marcos) • cream of the (British) crop BASED on Kevin Cohen’s LING 5200
Searching the treebank • cat ??/* | egrep -i '(push|pull)[a-z]*’ • OR xkwic? BASED on Kevin Cohen’s LING 5200
XWin 32 • See e-mail • Load on laptops, bring laptops to class if any issues • Go to Feb 9 Emacs & Xkwic lecture BASED on Kevin Cohen’s LING 5200