270 likes | 394 Vues
This presentation by Nina Wacholder explores the critical role of language in information science. It covers various aspects of language, including its unique human characteristics, complexities such as ambiguity and synonymy, and how they influence human-computer interaction. The session also delves into Natural Language Processing (NLP), comparing rule-based approaches with statistical methods, and discusses the effectiveness of different information access tasks such as retrieval and extraction. Examples highlight techniques for identifying key terms and their relevance in document understanding and retrieval.
E N D
Language and Information LIS 610 November 6, 2002 Nina Wacholder nina@scils.rutgers.edu
Agenda • Role of language in information science • Current research: Human Computer Interaction with Electronic Indexes and Index Terms Language and Information 11/06/02 Nina Wacholder
Textual information • Information conveyed by alphabets, digits and punctuation • Organized into meaningful units recognized by some group of people Language and Information 11/06/02 Nina Wacholder
Other techniques for conveying information • Spoken language • Gesture • Facial expression • Sound • Images (drawings, photographs …) Language and Information 11/06/02 Nina Wacholder
Language • Uniquely human • Learned • Conventional Language and Information 11/06/02 Nina Wacholder
Understanding language is hard • Expresses complex concepts • Ambiguity – words, phrases and sentences have more than one meaning • Synonymy – words, phrases and sentences have more than one meaning Language and Information 11/06/02 Nina Wacholder
Complex concepts • Pencil • Face • Directions to Alexander Library • Theory of relativity • U.S. election law Language and Information 11/06/02 Nina Wacholder
Synonymy • child, kid, adolescent, baby • flammable, inflammable • I was walking up the street that day. • I was walking down the street that day. • Moxie wrote that report. That report was written by Moxie. Language and Information 11/06/02 Nina Wacholder
Ambiguity-- semantic • Bat • Make a bed • Moxie ate potatoes with a fork. • Moxie ate potatoes with fish. Language and Information 11/06/02 Nina Wacholder
Ambiguity– structural (syntactic) • Red airplane terminal • [[red airplane] terminal] • [red [airplane terminal]] • Moxie saw Toxie in the park with a telescope • Moxie saw [Toxie in the park with a telescope] • Moxie [saw] Toxie in the park [with a telescope] Language and Information 11/06/02 Nina Wacholder
Natural language processing (NLP) • Natural language • Computer language Language and Information 11/06/02 Nina Wacholder
The NLP controversy: rules vs. statistics Language and Information 11/06/02 Nina Wacholder
NLP by rule • Lexicon (vocabulary) • Det: a • ProperName: Moxie • Noun: report • Verb: wrote • Syntactic rules • NounPhrase[a report] Det[a] Noun[report] • NounPhrase[Moxie] ProperName[Moxie] • VerbPhrase[wrote a report] Verb[wrote] NounPhrase[a report] • Sentence[Moxie wrote a report] NounPhrase[Moxie] VerbPhrase[wrote a report] Language and Information 11/06/02 Nina Wacholder
NLP by statistics • Luhn (1958) • tf*idf (Salton and Buckley 1988) • Maximum entropy (Berger, Della Pietra and Della Pietra 1996) Language and Information 11/06/02 Nina Wacholder
Information-access tasks with significant natural language component • Information retrieval • Information extraction • Automatic summarization • Question answering Language and Information 11/06/02 Nina Wacholder
Sparck Jones (2001) • Task core vs. task context • Information retrieval: 30-40% accuracy for systems in natural environment • Information extraction: 50% for core systems • Automatic summarization: no sound basis for core evaluation Language and Information 11/06/02 Nina Wacholder
Task compare domain-independent, corpus-independent methods for automatic identification of terms to represent a document or collection of documents Methods for term identification Head-sorted NPs (HS) (Wacholder 1998) Keywords (KW) Technical Terms (TT) (Justeson and Katz 1995) Evaluation of Head Sorting MechanismWacholder, Klavans and Evans (2000) Language and Information 11/06/02 Nina Wacholder
Examples of terms identified by indexing method Keywords Head-sorted NPs Technical terms asbestos/asbestosis workers cancer deaths worker/workers/worked asbestos workers lung cancer cancer 160 workers kent cigarette death cancer dr. talcott make lung cancer cigarette filter lorillard asbestos u.s. fiber cancer causing asbestos dr. lung cancer deaths … ... Language and Information 11/06/02 Nina Wacholder
Ranking of terms by cumulative percentage Language and Information 11/06/02 Nina Wacholder
Ranking by cumulative number of terms 1 = best; 5 = worst Language and Information 11/06/02 Nina Wacholder
Summary of results • Head-sorted terms • mixed quality terms • good document coverage • Technical terms • high quality terms • poor document coverage • Keywords • low quality terms • good document coverage Language and Information 11/06/02 Nina Wacholder
ISATC Pilot Project • Nina Wacholder, PI • PhD Students: Lu Liu, Mark Sharp, Peng Song, Xiaojun Yuan Language and Information 11/06/02 Nina Wacholder
Research question • Null hypothesis: Properties of index terms do not affect information seeker’s selection of terms • What properties of index terms affect the selection of terms? • What effects do these properties have? Language and Information 11/06/02 Nina Wacholder
Material • Text • Rice, McCreadie and Chang (2001) • Index terms • Head sorted terms (Wacholder 1998) • Technical terms (Justeson and Katz) • Human index terms Language and Information 11/06/02 Nina Wacholder
Experimental Searching and Browsing Interface (ESBI) http://www.scils.rutgers.edu/cgi-bin/indexer.cg Language and Information 11/06/02 Nina Wacholder
Initial results Language and Information 11/06/02 Nina Wacholder
Future work • Further analysis of experimental data • Compare subjects by type (e.g., undergraduate, MLIS) • Effectiveness of searches (ie did they get the right answer) • Overlap of words in index terms with words in question • … • Evaluation of ESBI interface • Comparison of additional techniques for identifying terms • Use of different texts Language and Information 11/06/02 Nina Wacholder