LIS618 lecture 4 before searching + introduction to dialog Thomas Krichel 2011-11-01
structure of talk • some generalities about searching • Working with DIALOG • Overview • Search command
online information retrieval • This subject can be though off as a subset of information retrieval (IR). Most IR is online or digital. • IR concentrates on textual data. • We can think of online IR to fall under two categories • database IR • web IR
database / web IR • Database IR look at systems that have • controlled set of record • low heterogeneity • use requires authentication • advanced search features • Web IR has opposite characteristics
traditional social model • User goes to a library • Describes problem to the librarian • Librarian does the search • without the user present • with the user present • Hands over the result to the user • User fetches full-text or asks a librarian to fetch the full text.
economic rational for traditional model • In olden days the cost of telecommunication was high. • Database use costs • cost of communication • cost of access time to the database • The traditional model controls an upper limit to the costs.
disintermediation • With access cost time gone, the traditional model is under threat • There is disintermediation where the librarian looses her role of doing the search. • But that may not be good news for information retrieval results • user knows subject matter best • librarian knows searching best
Web searching • IR has received a lot of impetus through the web, which poses unprecedented search challenges. • With more and more data appearing on the web DS may be a subject in decline • It is primarily concerned with non-web databases • There is more and more web-based methods of searching
Public access vs quality • Now the public at large is able to do online searching. • At the same time need for quality answers has grown. • Quality-filtered services will become more important. • In the current databases, there is as lot that would already be available for free mixed with quality-controlled stuff. • Publishers have direct offerings and intermediated vending is in decline.
components of the IR process • provider • define data that is available • documents that can be used • document operations • document structure • index • user • user need • IR system familiarity
the IR process • Query expresses user need in a query language • Processing of query yields retrieved documents • Calculation of relevance ranking • Examination of retrieved documents • Possible return to the start, another query.
main problem • User is not an expert at the formulation of a query • Garbage in garbage out, the retrieval yields poor result • Ways around that problem • design very intuitive interface for the query • give expert guidance
before a search I • What is the purpose of the query? • brief overview • comprehensive search • What perspective on the topic is required? • scholarly • technical • business • popular
before search II • What type of information does the patron want? • fulltext • bibliographic • directory • numeric • Are there any known sources? • authors • journals • papers • conferences
before search III • What are the language restrictions? • What, if any, are the cost restrictions? • How current need the data to be? • How much of each record is required?
concept analysis • This is the art/science of taking the topic to search for and develop facets. Example “Internet filtering in Libraries” • Internet filter • Libraries • Controversy not technical issues • We may also need the think about the aim of the search.
search aims • a known needle in a known haystack • a known needle in an unknown haystack • an unknown needle in an unknown haystack • any needle in a haystack • the sharpest needle in a haystack • most of the sharpest needles in a haystack
search aims • all the needles in a haystack • affirmation of no needles in a haystack • things like needles in a haystack • is there a new needle in the haystack • where are the haystacks • needles, haystacks, anything
types of searches • known-item searches • negative searches • selective dissemination of information • topical or subject searches • passage searching, where the user is only interested in part of the item
search strategies I • Building block approach • Do a number of elementary searches • Combine the resulting sets with Boolean operators • This is what I did in the example in the previous lecture • Works only with the Boolean model
search strategies II • Snowballing approach • Start with a very specific query • Think of other term that can be added to get more results • Stop when a reasonable number of results are achieved. • Not sure this really works well in practice.
search strategies III • The successive fraction approach is the opposite of the snowballing approach • First search for a broad concept • Then repeat the query by adding various limiting factors. • Can work well if the IR system allows to repeat and edit queries. • But queries can become unwieldy.
search strategies IV • Most specific facet first • Conduct concept analysis • Look for the most specific facet • Search that first, add others later • Presupposes that you have done a decent concept analysis.
taxonomy of classic IR models • Boolean, or set-theoretic • fuzzy set models • extended Boolean • Vector, or algebraic • generalized vector model • latent semantic indexing • neural network model • Probabilistic • inference network • belief network
summary • There are three basic types of models in classic information retrieval. • Extensions of these types are a matter of research concern and require good mathematical skills. • All classic models treat document as individual pieces.
key aid: index • An index is a list of terms, with a list of locations where the term is to be found. • The way to express locations usually depends on the form that the indexed data takes. • for a book, it is usually the page number, e.g. "shmoo 34, 75" • for computer files it is usually the name of the file plus the number of the byte where the indexed term starts, e.g. "krichel index.html 34, cv.html 890 1209" • There is usually more than one location of the term.
key aid: index terms • The index term is a part of the document that has a meaning on its own. • It is usually a noun word. • Retrieval based on index term raises questions • semantics in query or document is lost • matching done in imprecise space of index terms • One way out is to specify several terms and require that they have to be close to each other.
basic concept: weight of index term • Given all nouns, not all appear to have the same relevance to the text • Sometimes, we can have a simple measure of the importance of a term, example? • More generally, for each indexing term and each document we can associate a weight with the term and the document. • Usually, if the document does not contain the term, its weight is zero
Boolean model • In the Boolean model, the index weight of all index term for any document is 1 if the term appears in the document. It is 0 otherwise. • This allows to combine query terms with Boolean operator AND, OR, and NOT • Thus powerful queries can be written.
Dialog is a databank • over 500 databases • these are also known as files and cover • references and abstracts for published literature, • business information and financial data; • complete text of articles and news stories; • statistical tables • directories • DIALOG uses the Boolean model
DIALOG interface • It is still rooted in "traditional" database systems. • It has been dismissed as "dial-a-dog". • It uses a command-driven interface. • It is very complicated to learn fully. • It is not suitable for the end-user. • It therefore offers a valuable skill to the information professional.
Accessing DIALOG • On the web, go to • http://www.dialogclassic.com/ • Enter username and password. • Forget about subaccount. • Then click on logon. • You are in the classic interface. Let’s hear three cheers for being old-fashioned.
two steps in DIALOG • Step one: select databases (aka files) to look at • Step two: perform searches on the selected databases • You may wonder why one does not have one single step like in a search engine. Discuss.
sample search • We want to know something about “current awareness in digital libraries”. • Let us assume something of this is in the ERIC database and we know that ERIC is the database number 1. • We issue the command "b 1" to begin working with ERIC.
Boolean search • Do a number of searches • s current(N)awarness • s digital(N)library • s digital(N)libraries • Each search retrieves a set of documents • The sets can be combined • s s1 and (s2 or s3)
What is the deal? • There are two stages. • At stage two we make Boolean queries. • Each query splits the records into matching and non-matching records. • The set of matching records is return. • It can be further searched or combined with other sets using Boolean operators. • Try this at home.
two steps in DIALOG • step one: select databases (aka files) to look at • step two: perform searches on the selected databases • You may wonder why one does not have one single step like in a search engine. Discuss. • today we concentrate on the second step
working on selected files • We assume that we have selected database that we know and we look at the search interface on the selected database. • The database selection process is a bit more complicated, covered next week. • First, let us login and look at the command prompt. • Then we select the first database (file) with the begin command
the ‘begin’ command • As its name suggests, usually the first command. • begin number, number,… • selects files with numbers number • Once they are selected they can be searched. • Now select the ERIC "begin 1" • "Begin 1" can be abbreviated as "b 1"
substeps in the second step • Identify search terms • Use Dialog basic commands to conduct a search • View records online or print the results
the 's' (select) command • Once issued the "begin" command to select a database, we issue the "s" command on the database. • "s query_expression" where query_expression is a query expression. • This will search the index of selected database in full-text view for the query issued • It will not find any of the following: "an and by for from of the to with". They are stop words.
query expression • A query expression contains search terms expressed in special ways • You can truncate search terms. • You can build an elementary expression by putting several keywords together. This is achieved by DIALOG's connectors. • You can combine several expressions with the use of Boolean operators • We will cover this is in turn now.
truncation of terms I • Open Truncation • "select path?" retrieves all words that begin with path: paths, pathos, pathway, pathology • Controlled-Length Truncation • "select path??" retrieves the root and up to two additional characters: paths, pathos
truncation of terms II • Embedded Character truncation can be used for variant spellings: • "select organi?ation" -> organization organisation • "select fib??board" -> fiberboard fibreboard • This truncation feature is also useful for searching for unusual plural forms: • "select wom?n" -> woman women • Apparently you can also do prefixes by putting the ? in the beginning. • "?mobile" -> automobile metamobile
use of connectors • Connectors are used to put several words together. • One instance where this is useful is when you have words that on their own mean different things. • For example "mate" is a herbal beverage consumed in South America. Looking for mate on the Internet retrieves a lot of singles' pages.
example: terms related to "mate" What other terms to be used? • matear (drink mate) • matero (mate drinker) • cebar (prepare mate) • cebador (mate preparer) • yerba (mate herb) • bombilla (mate straw)
connectors I • '(W)' requires terms to appear one after the other next to each other e.g. 'yerba(W)mate?' matches "yerba mate". • '(i W)' where i is an integer, means followed by at most i words, e.g. 'ceba?(3W)mate?' matches "cebar un maravilloso mate" but not "cebador guapo mirando un buen mate"
connectors II • '(N)' requires terms to be next to each other e.g. 'yerba(N)mate?' matches "yerba mate" or "mate yerba". • '(i N)' where i is an integer, means proximity by at most i words, e.g. 'ceba?(3N)mate?' matches "cebar mate" or "matear con la cebadora". • '(S)' searches for the occurrence of connected terms in the same paragraph.
using Boolean operators • In your query, you can combine several expressions with Boolean operators • Example: "S LIBRARY(W)SCHOOL? AND DISTANCE(W)EDUCATION" • But I usually do not issue such fancy queries.
executing several searches • There can be several searches done sequentially, and the results sets are saved by the system. • Each time the system assigns a set number, Si, • These can be combined in Boolean expressions, e.g. 's S1 or S2 and S3' • Remember that Boolean operations are set-theoretic!