Extraction of Coocurrence Data

Extraction of Coocurrence Data

Basic Assumptions • The collocates of a collocation cooccur more frequently within text than arbitrary word combinations. (Recurrence) • Stricter control of cooccurrence data leads to more meaningful results in collocation extraction.

Word (Co)occurrence • Distribution of words and word combinations in text approximately described by Zipf’s law. • Distribution of combinations is more “extreme” than that of individual words.

Word (Co)occurrence • Zipf 's law: • nm is the number of different words occurring m times • i.e., there is a large number of low-frequency words, and few high-frequency ones

Word (Co)occurrence • Collocations will be preferably found among highly recurrent word combinations extracted from text. • Large amounts of text need to be processed to obtain sufficient number of high-frequency combinations.

Control of Candidate Data • Extract collocations from relational bigrams • Syntactic homogeneity of candidate data • (Grammatical) cleanness of candidates e.g. N+V pairs: Subject+V vs. Object+V • Text type, domain, and size of source corpus influence the outcome of collocation extraction

Terminology • Extraction corpustokenized, pos-tagged or syntactically analysed text • Base datalist of bigrams found in corpus • Cooccurrence databigrams with contingency tables • Collocation candidatesranked bigrams

Types and Tokens • Frequency counts (from corpora) • identify labelled units (tokens),e.g. words, NPs, Adj-N pairs • set of different labels (types) • type frequency = number of tokens labelled with this type • example:... whatthe blackbox does ...

box Types and Tokens • Frequency counts (from corpora) • identify labelled units (tokens),e.g. words, NPs, Adj-N pairs • set of different labels (types) • type frequency = number of tokens labelled with this type • example:... whatthe blackbox does ...

Types and Tokens • Counting cooccurrences • bigram tokens = pairs of word tokens • bigram types = pairs of word types • contingency table = four-way classification of bigram tokens according to their components

Contingency Tables contingency table for pair type (u,v)

Collocation Extraction: Processing Steps • Corpus preprocessing • tokenization (orthographic words) • pos-tagging • morphological analysis / lemmatization • partial parsing • (full parsing)

Collocation Extraction: Processing Steps • Extraction of base data from corpus • adjacent word pairs • Adj-N pairs from NP chunks • Object-V & Subject-V from parse trees • Calculation of cooccurrence data • compute contingency table for each pair type (u,v)

Collocation Extraction: Processing Steps • Ranking of cooccurrence data by "association scores" • measure statistical association between types u and v • true collocations should obtain high scores • using association measures (AM) • N-best list = listing of N highest-ranked collocation candidates

Base Data:How to get? • Adj-N • adjacency data • numerical span • NP chunking • (lemmatized)

Base Data:How to get? • V-N • adjacency data • sentence window • (partial) parsing • identification of grammatical relations • (lemmatized)

Base Data:How to get? • PP-V • adjacency data • PP chunking • separable verb particles(in German) • (full syntactic analysis) • (lemmatization?)

Adj-N In the first place, the ‘less genes, more behavioural flexibility’ argument is a total red herring. In/PRP the/ART first/ORD place/N ,/$, the/ART ‘/$’ less/ADJ genes/N ,/$, more/ADJ behavioural/ADJ flexibility/N ’/$’ argument/N is/V a/ART total/ADJ red/ADJ herring/N ./$.

span size 1 (adjacency)wj, j= -1 first/ORD place/N less/ADJ genes/N behavioural/ADJ flexibility/N ’/$’ argument/N red/ADJ herring/N Adj-N:poswi = N

more/ADJ flexibility/N ’/$’ argument/N flexibility/N argument/N red/ADJ herring/N total/ADJ herring/N span size 2wj, j = -2, -1 first/ORD place/N the/ART place/N less/ADJ genes/N ‘/$’ genes/N behavioural/ADJ flexibility/N Adj-N:poswi = N

more/ADJ flexibility/N ’/$’ argument/N flexibility/N argument/N red/ADJ herring/N total/ADJ herring/N span size 2wj, j = -2, -1 first/ORD place/N the/ART place/N less/ADJ genes/N ‘/$’ genes/N behavioural/ADJ flexibility/N Adj-N:poswj = ADJ, poswi = N

Adj-N (S (PP In/PRP (NP the/ART first/ORD place/N ) ) ,/$, (NP the/ART ‘/$’ less/ADJ genes/N ,/$, more/ADJ behavioural/ADJ flexibility/N ’/$’ argument/N ) (VP is/V (NP a/ART total/ADJ red/ADJ herring/N ) ) ) ./$.

Adj-N (S (PP-mod In/PRP (NP the/ART first/ORD place/N ) ) ,/$, (NP-subj the/ART ‘/$’ less/ADJ genes/N ,/$, more/ADJ behavioural/ADJ flexibility/N ’/$’ argument/N ) (VP-copula is/V (NP a/ART total/ADJ red/ADJ herring/N ) ) ) ./$.

Adj-N:NP chunks NP chunks • (NP the/ART first/ORD place/N ) • (NP the/ART ‘/$’ less/ADJ genes/N ,/$, more/ADJ behavioural/ADJ flexibility/N ’/$’ argument/N) • (NP a/ART total/ADJ red/ADJ herring/N ) Adj-N Pairs • less/ADJ genes/N • more/ADJ flexibility/N • behavioural/ADJ flexibility/N • more/ADJ argument/N • behavioural/ADJ argument/N • total/ADJ herring/N • red/ADJ herring/N

N-V: Object-VERB • spill the beans • Good for you for guessing the puzzle but from the beans Mike spilled to me, I think those kind of twists are more maddening than fun. • bury the hatchet • Paul McCartney has buried the hatchet with Yoko Ono after a dispute over the songwriting credits of some of the best-known Beatles songs.

N-V: Object-Mod-VERB • keep <one‘s> nose to the grindstone • I'm very impressed with you for having keptyour nose to the grindstone, I'd like to offer you a managerial position. • We’ve learned from experience and keptour nose to the grindstone to make sure our future remains a bright one. • She keepsher nose to the grindstone.

N-V: Object-Mod-VERB • keep <one‘s> nose to the grindstone (VP {kept, keeps, ...} {(NP-obj your nose), (NP-obj our nose), (NP-obj her nose), ... } (PP-mod to the grindstone) )

PN-V: P-Object-VERB • zur Verfügung stellen (make available)Peter stellt sein Auto Maria zur Verfügung (Peter makes his car available to Maria) • in Frage stellen (question) Peter stellt Marias Loyalität in Frage (Peter questions Maria’s loyalty) • in Verbindung setzen (to contact) Peter setzt sich mit Maria in Verbindung (Peter contacts Maria)

Contingency Tablesfor Relational Cooccurrences (big, dog) (black, box) (black, dog) (small, cat) (small, box) (black, box) (old, box) (tabby, cat) pair type: (u,v) = (black, box)

Contingency Tablesfor Relational Cooccurrences (big, dog) (black, box) (black, dog) (small, cat) (small, box) (black, box) (old, box) (tabby, cat) f(u,v) = 2 f1(u) = 3 f2(v) = 4 N = 8

Contingency Tablesfor Relational Cooccurrences f(u,v) = 2 f1(u) = 3 f2(v) = 4 N = 8

Contingency Tablesfor Relational Cooccurrences real data from the BNC(adjacent adj-noun pairs, lemmatised)

Contingency Tables in Perl %F = (); %F1 = (); %F2 = (); $N = 0; while (($u, $v)=get_pair()) { $F{"$u,$v"}++; $F1{$u}++; $F2{$v}++; $N++; }

Contingency Tables in Perl foreach $pair (keys %F) { ($u,$v) = split /,/, $pair; $f = $F{$pair}; $f1 = $F1{$u}; $f2 = $F2{$v}; $O11 = $f; $O12 = $f1 - $f; $O21 = $f2 - $f; $O22 = $N - $f1 - $f2 - $f; # ... }

Reminder: Contingency Tablewith Row and Column Sums

Why are Positional Cooccurrences Different? • adjectives and nous cooccurring within sentences • "I saw a black dog" (black, dog)f(black, dog)=1, f1(black)=1, f2(dog)=1 • "The old man with the silly brown hat saw a black dog" (old, dog), (silly, dog), (brown, dog), (black, dog), ... , (black, man), (black, hat)f(black, dog)=1, f1(black)=3, f2(dog)=4

Why are PositionalCooccurrences Different? • "wrong" combinations could be considered as extraction noise( association measures distinguish noise from recurrent combinations) • but: very large amount of noise • statistical models assume that noise is completely random • but: marginal frequencies often increase in large steps

Contingency Tables for Segment-Based Cooccurrences • within pre-determined segments (e.g. sentences) • components of cooccurring pairs may be syntactically restricted(e.g. adj-noun, nounSg-verb3.Sg) • for given pair type (u,v),set of all sentences is classified into four categories

Contingency Tables for Segment-Based Cooccurrences • u  S = at least one occurrence of u in sentence S • u  S = no occurrences of u in sentence S • v  S = at least one occurrence of v in sentence S • v  S = no occurrences of v in sentence S

Contingency Tables for Segment-Based Cooccurrences • fS(u,v) = number of sentences containing both u and v • fS(u) = number of sentences containing u • fS(v) = number of sentences containing v • NS= total number of sentences

Frequency Counts for Segment-Based Cooccurrences • adjectives and nous cooccurring within sentences • "I saw a black dog" (black, dog)fS(black, dog)=1, fS(black)=1, fS(dog)=1 • "The old man with the silly brown hat saw a black dog" (old, dog), (silly, dog), (brown, dog), (black, dog), ... , (black, man), (black, hat)fS(black, dog)=1, fS(black)=1, fS(dog)=1

Segment-Based Cooccurrences in Perl foreach $S (@sentences) { %words = map {$_ => 1} words($S); %pairs = map {$_ => 1} pairs($S); foreach $w (keys %words) { $FS_w{$w}++; } foreach $p (keys %pairs) { $FS_p{$p}++; } $NS++; }

Contingency Tables for Distance-Based Cooccurrences • problems are similar to segment-based cooccurrence data • but: no pre-defined segments • accurate counting is difficult • here: sketched for special case • all orthographic words • numerical span: nL left, nR right • no stop word lists

Contingency Tables for Distance-Based Cooccurrences • nL = 3, nR = 2

Extraction of Coocurrence Data

Extraction of Coocurrence Data

Presentation Transcript

Web Data Extraction

Inpatient Pharmacy Data Extraction

Data Extraction Workshop

Data extraction

Coding Procedures (Data Extraction)

Feature extraction/data compression

Extraction of Parameters from Collider Data?

Overview of Data Extraction/Availability

Statistics Session3: Data Extraction

New Data Extraction Tool

Modal Data Extraction

Modal Data Extraction

Data extraction services

Web Scraping ,Data Scraping,Web Extraction,Data Extraction - USA

Dentists Data Extraction

Lawyers Data Extraction _ Data Scraping

DATA EXTRACTION SERVICES

Data Extraction

Data Extraction

Companies Data Extraction

Data Extraction of Worldwide Travel Agents

Data Extraction