250 likes | 395 Vues
This paper discusses the implementation of Term Co-occurrence Analysis (TCA) as a vital interface tool for digital libraries. The authors present methodologies on how to store and retrieve information efficiently, emphasizing the importance of user-defined terms and visual information retrieval interfaces. They explore the significance of co-occurring terms and their associations through examples, demonstrating how TCA can facilitate text mining and enhance user experience. Future plans for user studies and interactive retrieval systems are also addressed.
E N D
Term Co-occurrence Analysis as an Interface to Digital Libraries Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology Drexel University, Philadelphia, Pennsylvania, USA
Digital Library Research • First Wave • How to store it • Next Wave • How to retrieve it (IR) • Text Mining • Visual Information Retrieval Interface (VIRI) • Term Co-occurrence Analysis (TCA) • Co-occurrence vs. lexical associations • Maps vs. lists
Term Definition • Unit of Analysis • Words • Documents • Authors • Journals • Section of Focus • Abstract/Text • Title • Bibliography • Keywords
Words in Title Term Co-occurrence Analysis Interface Digital Library Authors in Bibliography Salton-G Chen-C White-HD Ding-Y Cleveland-W McCain-K Lin-X Schvaneveldt-R Kamada-T Fruchterman-T Example
Term Co-occurrence Methodology • User determines which terms are of interest • Via a seed term • From a pre-defined list • The system returns the pair-wise co-occurrence counts of the terms over the collection of records
Example • Unit: Author; Section: Bibliography • User Supplied List: Plato, Aristotle, Smith, Brown • For a given data set (N = 4 unique terms) • Article 1: Plato, Aristotle, Smith, … • Article 2: Plato, Smith, … • Article 3: Plato, Aristotle, Smith, Brown, … • The following co-citations (C(4,2) = 6) are found • COMBINATIONCOUNTARTICLES • Plato and Smith 3 1, 2, 3 • Plato and Aristotle 2 1, 3 • Plato and Brown 1 3 • Aristotle and Smith 2 1, 3 • Aristotle and Brown 1 3 • Smith and Brown 1 3
Term Co-occurrence Significance • The frequent co-occurrence of term pairs within a set of documents indicates a strong association between those terms, whereas a infrequent count indicates the opposite • The association you would expect is borne out by the frequency • The frequency you compute suggests a level of association • Pain and Management Pain and Obtainment • Plato and Aristotle Plato and Cher • Science and Nature Science and National Tattler • A and B C and D
Term Co-occurrence Uses • Allows a user to get a “foothold” with just one term • One seed term returns many other related terms • Allows a user to get a “overview” with user-supplied/system-supplied terms • Co-occurrence counts with visualization
Seeding • User types in • One term, e.g., Plato • Boolean expression, e.g., Plato AND Brown • System supplies top n terms, in ranked order of frequency of co-occurrence with the initial term
Example • For Plato seed: • ARISTOTLE • PLUTARCH • CICERO • HOMER • BIBLE • EURIPIDES • ARISTOPHANES • XENOPHON • AUGUSTINE • HERODOTUS • KANT-I • AESCHYLUS • SOPHOCLES • THUCYDIDES • OVID • HESIOD • DIOGENES-LAERTI • HEIDEGGER-M • DERRIDA-J • PINDAR • NIETZSCHE-F • HEGEL-GWF • VERGIL • AQUINAS-T
Need for Visualization • Given a list of user- / system-supplied terms • Find the frequency of co-occurrence of each pair-wise combination of terms • Plato AND Aristotle = 1,920 • Plato AND Plutarch = 380, • … • Too many numbers to take in at once • C(25, 2) = (25 * 24)/ 2 = 300 pairs • Three major visualization techniques • Multidimensional Scaling (MDS) • Self-Organizing (Kohonen) Maps (SOMs) • PathFinder Networks (PFNETs)
P Arabie JH Ward JC Gower M Wish RN Shepard RR Sokal JB Kruskal SC Johnson PHA Sneath JD Carroll PE Green JA Hartigan HA Skinner VE McGee RK Blashfield White’s MDS map of 15 co-cited classificationists, ca. 1990
White’s PFNet of co-cited authors in Biblical and literary hermeneutics, 1988-1997
Three tiered User interface Server Database Real-time and interactive Significant data sources ISI AHCI MedLine Live interface for retrieval Our System
Database Interface • API • String [ ] findRel( String, int ) • Int [ ] findOcc( String [ ] ) • Implemented on: • BRS • API via a wrapper • Oracle • API via JDBC • Noah • Specialized co-occurrence database • API via JNI
Future Plans • User Study • Preference • Type of map, etc. • Cognitive map • How well does the map match experts’ mental models • Larger datasets • Additional data sources