660 likes | 810 Vues
Dataset Profiling. Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University, Walton Hall, Milton Keynes, MK7 6AA, UK. Fact or Factoid: Hyperlinks.
E N D
Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University, Walton Hall, Milton Keynes, MK7 6AA, UK.
Fact or Factoid: Hyperlinks • Hyperlinks do not significantly improve recall and precision in diverse domains, such as the TREC test data (Savoy and Pickard 1999, Hawking et al 1999).
Fact or Factoid: Hyperlinks • Hyperlinks do not significantly improve recall and precision in diverse domains, such as the TREC test data (Savoy and Pickard 1999, Hawking et al 1999). • Hyperlinks do significantly improve recall and precision in narrow domains and Intranets (Chen et al 1999, Kruschwitz 2001).
Fact or Factoid: Stemming • Stemming does not improve effectiveness of retrieval (Harman 1991)
Fact or Factoid: Stemming • Stemming does not improve effectiveness of retrieval (Harman 1991) • Stemming improves performance for morphologically complex languages (Popovitch and Willett 1992)
Fact or Factoid: Stemming • Stemming does not improve effectiveness of retrieval (Harman 1991) • Stemming improves performance for morphologically complex languages (Popovitch and Willett 1992) • Stemming improves performance on short documents (Krovetz 1993)
Fact or Factoid: Long or Short. • Stemming improves performance on short documents (Krovetz 1993) • Short keyword based queries behave differently from long structured queries (Fujii and Croft 1999) • Keyword based retrieval works better on long texts (Jurawsky and Martin 2000)
Assumption • Successful (statistical?) techniques can be successfully ported to other languages. • Western European languages • Japanese, Chinese, Malay, …
Assumption • Successful (statistical?) techniques can be successfully ported to other languages. • Western European languages • Japanese, Chinese, Malay, … • WordSmith: Effective use requires 5M word corpus (Garside 2000)
Fact • Performance of IR and NLP techniques depends on the characteristics of the dataset.
Fact • Performance of IR and NLP techniques depends on the characteristics of the dataset. • Performance will vary with task, technique and language
Cargo Cult Science? • Richard Feynman (1974)
Cargo Cult Science? • Richard Feynman (1974) “It's a kind of scientific integrity, a principle of scientific thought that corresponds to a kind of utter honesty--a kind of leaning over backwards. For example, if you're doing an experiment, you should report everything that you think might make it invalid--not only what you think is right about it: other causes that could possibly explain your results; and things you thought of that you've eliminated by some other experiment, and how they worked--to make sure the other fellow can tell they have been eliminated.”
Cargo Cult Science? • Richard Feynman (1974) “Details that could throw doubt on your interpretation must be given, if you know them. You must do the best you can--if you know anything at all wrong, or possibly wrong--to explain it.” “In summary, the idea is to give all of the information to help others to judge the value of your contribution; not just the information that leads to judgement in one particular direction or another.”
Cargo Cult Science? • The role of data in the outcome of experiments must be clarified • Why? • How?
Why Profile Datasets? • Methodological: Replicability • Barbu and Mitkov (2001) – Anaphora resolution • Donaway et al (2000) – Automatic Summarisation
Why Profile Datasets? • Methodological: Replicability • Barbu and Mitkov (2001) – Anaphora resolution • Donaway et al (2000) – Automatic Summarisation • Epistemological: Theory induction • What is the relationship between dataset properties and technique performance?
Why Profile Datasets? • Methodological: Replicability • Barbu and Mitkov (2001) – Anaphora resolution • Donaway et al (2000) – Automatic Summarisation • Epistemological: Theory induction • What is the relationship between dataset properties and application performance? • Practical: Application • What is relationship between two datasets? • What is this dataset (language?) like?
Why Profile Datasets? • And by the way, the others think it is vital. (Machine Learning, Data Mining, Pattern Matching etc.)
Why Profile Datasets? • And by the way, the others think it is vital. (Machine Learning, Data Mining, Pattern Matching etc.) • And so did we! (or do we?)
Profiling: An Abandoned Agenda? • Sparck-Jones (1973) “Collection properties influencing automatic term classification performance.” Information Starage and Retrieval. Vol 9 • Sparck-Jones(1975) “A performance Yardstick for test collections.” Journal of Documentation. 31:4
Profiling: An Abandoned Agenda • Term weighting formula tailored to query • Salton 1972 • Stop word identification relative to collection/query • Wilbur & Sirotkin1992; Yang & Wilbur 1996 • Effect of collection homogeneity on language model quality • Rose & Haddock 1997
What has changed? • Proliferation of (test) collections • More data per collection • Increased application need
What has changed? • Proliferation of (test) collections • More data per collection • Increased application need • Better (ways of computing) measures?
What has changed? • Sparck-Jones (1973) • Is a collection useably classifiable? • Number of query terms which can be used for matching. • Is a collection usefully classifiable? • Number of useful, linked terms in document or collection • Is a collection classifiable? • Size of vocabulary and rate of incidence
Profiling Measures • Requirements: measures should be • relevant to NLP techniques • fine grained • cheap to implement
Profiling Measures • Requirements: measures should be • relevant to NLP techniques • fine grained • cheap to implement(!) • Simple starting point: • Vital Statistics
Profiling Measures • Requirements: measures should be • relevant to NLP techniques • fine grained • cheap to implement(!) • Simple starting point: • Vital Statistics • Zipf (sparseness; ideosyncracy)
Profiling Measures • Requirements: measures should be • relevant to NLP techniques • fine grained • cheap to implement(!) • Simple starting point: • Vital Statistics • Zipf (sparseness; ideosyncracy) • Type to token ratio (sparseness, specialisation)
Profiling Measures • Requirements: measures should be • relevant to NLP techniques • fine grained • cheap to implement(!) • Simple starting point: • Vital Statistics • Zipf (sparseness; ideosyncracy) • Type to token ratio (sparseness, specialisation) • Manual sampling (quality; content)
Profiling by Measuring Heterogeneity • Homogeneity Assumption • Bag of Words • Function word distribution • Content word distribution • Measure of Heterogeneity as dataset profile • Measure distance between corpora • Identify genre
Heterogeneity Measures • 2 (Kilgariff 1997; Rose & Haddock 1997) • G2 (Rose & Haddock 1997; Rayson & Garside 2000 ) • Correlation, Mann-Whitney (Kilgariff 1996) • Log-likelihood (Rayson & Garside 2000) • Spearman’s S (Rose & Haddock 1997) • Kullback-Leibler divergence (Cavaglia 2002)
Kilgariff’s Methodology • Divide corpus using 5000 word chunks in random halves • Frequency list for each half • Calculate 2 for term frequency distribution differences between halves • Normalise for corpus length • Iterate over successive random halves
Kilgariff’s Findings • Registers values of 2 statistic • High value indicates high heterogeneity • Finds high heterogeneity in all texts
Defeating the Homogeneity Assumption • Assume word distribution is homogeneous (random) • Kilgariff methodology • Explore chunk sizes • Chunk size 1 -> homogeneous (random) • Chunk size 5000 -> heterogeneous (Kilgariff 1997) • 2 test (statistic + p-value) • Defeat assumption with statistical relevance • Focus on frequent terms (!)
Homogeneity detection at a level of statistical significance • p-value: evidence for/against the hypothesis • < 0.1 -- weak evidence against • < 0.01 -- strong evidence against • < 0.001 -- very strong evidence against • < 0.05 -- significant (moderate evidence against the hypothesis) • Indication of statistically significant non-homogeneity
Frequent Term Distribution • Lots of them • Reputedly “noise-like” (random?) • Present in most datasets (comparison) • Cheap to model
Dividing a Corpus • docDiv: place documents in random halves • term distribution across documents • halfdocDiv: place half documents in random halves • term distribution within the same document • chunkDiv: place chunks (between 1 and 5000 words) in random halves • term distribution between text chunks (genre?)