1 / 57

LIS6 18 lecture 4 before searching + introduction to dialog

LIS6 18 lecture 4 before searching + introduction to dialog. Thomas Krichel 2011-11-01. structure of talk. some generalities about searching Working with DIALOG Overview Search command. online information retrieval.

owen
Télécharger la présentation

LIS6 18 lecture 4 before searching + introduction to dialog

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LIS618 lecture 4 before searching + introduction to dialog Thomas Krichel 2011-11-01

  2. structure of talk • some generalities about searching • Working with DIALOG • Overview • Search command

  3. online information retrieval • This subject can be though off as a subset of information retrieval (IR). Most IR is online or digital. • IR concentrates on textual data. • We can think of online IR to fall under two categories • database IR • web IR

  4. database / web IR • Database IR look at systems that have • controlled set of record • low heterogeneity • use requires authentication • advanced search features • Web IR has opposite characteristics

  5. traditional social model • User goes to a library • Describes problem to the librarian • Librarian does the search • without the user present • with the user present • Hands over the result to the user • User fetches full-text or asks a librarian to fetch the full text.

  6. economic rational for traditional model • In olden days the cost of telecommunication was high. • Database use costs • cost of communication • cost of access time to the database • The traditional model controls an upper limit to the costs.

  7. disintermediation • With access cost time gone, the traditional model is under threat • There is disintermediation where the librarian looses her role of doing the search. • But that may not be good news for information retrieval results • user knows subject matter best • librarian knows searching best

  8. Web searching • IR has received a lot of impetus through the web, which poses unprecedented search challenges. • With more and more data appearing on the web DS may be a subject in decline • It is primarily concerned with non-web databases • There is more and more web-based methods of searching

  9. Public access vs quality • Now the public at large is able to do online searching. • At the same time need for quality answers has grown. • Quality-filtered services will become more important. • In the current databases, there is as lot that would already be available for free mixed with quality-controlled stuff. • Publishers have direct offerings and intermediated vending is in decline.

  10. components of the IR process • provider • define data that is available • documents that can be used • document operations • document structure • index • user • user need • IR system familiarity

  11. the IR process • Query expresses user need in a query language • Processing of query yields retrieved documents • Calculation of relevance ranking • Examination of retrieved documents • Possible return to the start, another query.

  12. main problem • User is not an expert at the formulation of a query • Garbage in garbage out, the retrieval yields poor result • Ways around that problem • design very intuitive interface for the query • give expert guidance

  13. before a search I • What is the purpose of the query? • brief overview • comprehensive search • What perspective on the topic is required? • scholarly • technical • business • popular

  14. before search II • What type of information does the patron want? • fulltext • bibliographic • directory • numeric • Are there any known sources? • authors • journals • papers • conferences

  15. before search III • What are the language restrictions? • What, if any, are the cost restrictions? • How current need the data to be? • How much of each record is required?

  16. concept analysis • This is the art/science of taking the topic to search for and develop facets. Example “Internet filtering in Libraries” • Internet filter • Libraries • Controversy not technical issues • We may also need the think about the aim of the search.

  17. search aims • a known needle in a known haystack • a known needle in an unknown haystack • an unknown needle in an unknown haystack • any needle in a haystack • the sharpest needle in a haystack • most of the sharpest needles in a haystack

  18. search aims • all the needles in a haystack • affirmation of no needles in a haystack • things like needles in a haystack • is there a new needle in the haystack • where are the haystacks • needles, haystacks, anything

  19. types of searches • known-item searches • negative searches • selective dissemination of information • topical or subject searches • passage searching, where the user is only interested in part of the item

  20. search strategies I • Building block approach • Do a number of elementary searches • Combine the resulting sets with Boolean operators • This is what I did in the example in the previous lecture • Works only with the Boolean model

  21. search strategies II • Snowballing approach • Start with a very specific query • Think of other term that can be added to get more results • Stop when a reasonable number of results are achieved. • Not sure this really works well in practice.

  22. search strategies III • The successive fraction approach is the opposite of the snowballing approach • First search for a broad concept • Then repeat the query by adding various limiting factors. • Can work well if the IR system allows to repeat and edit queries. • But queries can become unwieldy.

  23. search strategies IV • Most specific facet first • Conduct concept analysis • Look for the most specific facet • Search that first, add others later • Presupposes that you have done a decent concept analysis.

  24. taxonomy of classic IR models • Boolean, or set-theoretic • fuzzy set models • extended Boolean • Vector, or algebraic • generalized vector model • latent semantic indexing • neural network model • Probabilistic • inference network • belief network

  25. summary • There are three basic types of models in classic information retrieval. • Extensions of these types are a matter of research concern and require good mathematical skills. • All classic models treat document as individual pieces.

  26. key aid: index • An index is a list of terms, with a list of locations where the term is to be found. • The way to express locations usually depends on the form that the indexed data takes. • for a book, it is usually the page number, e.g. "shmoo 34, 75" • for computer files it is usually the name of the file plus the number of the byte where the indexed term starts, e.g. "krichel index.html 34, cv.html 890 1209" • There is usually more than one location of the term.

  27. key aid: index terms • The index term is a part of the document that has a meaning on its own. • It is usually a noun word. • Retrieval based on index term raises questions • semantics in query or document is lost • matching done in imprecise space of index terms • One way out is to specify several terms and require that they have to be close to each other.

  28. basic concept: weight of index term • Given all nouns, not all appear to have the same relevance to the text • Sometimes, we can have a simple measure of the importance of a term, example? • More generally, for each indexing term and each document we can associate a weight with the term and the document. • Usually, if the document does not contain the term, its weight is zero

  29. Boolean model • In the Boolean model, the index weight of all index term for any document is 1 if the term appears in the document. It is 0 otherwise. • This allows to combine query terms with Boolean operator AND, OR, and NOT • Thus powerful queries can be written.

  30. Dialog is a databank • over 500 databases • these are also known as files and cover • references and abstracts for published literature, • business information and financial data; • complete text of articles and news stories; • statistical tables • directories • DIALOG uses the Boolean model

  31. DIALOG interface • It is still rooted in "traditional" database systems. • It has been dismissed as "dial-a-dog". • It uses a command-driven interface. • It is very complicated to learn fully. • It is not suitable for the end-user. • It therefore offers a valuable skill to the information professional.

  32. Accessing DIALOG • On the web, go to • http://www.dialogclassic.com/ • Enter username and password. • Forget about subaccount. • Then click on logon. • You are in the classic interface. Let’s hear three cheers for being old-fashioned.

  33. two steps in DIALOG • Step one: select databases (aka files) to look at • Step two: perform searches on the selected databases • You may wonder why one does not have one single step like in a search engine. Discuss.

  34. sample search • We want to know something about “current awareness in digital libraries”. • Let us assume something of this is in the ERIC database and we know that ERIC is the database number 1. • We issue the command "b 1" to begin working with ERIC.

  35. Boolean search • Do a number of searches • s current(N)awarness • s digital(N)library • s digital(N)libraries • Each search retrieves a set of documents • The sets can be combined • s s1 and (s2 or s3)

  36. What is the deal? • There are two stages. • At stage two we make Boolean queries. • Each query splits the records into matching and non-matching records. • The set of matching records is return. • It can be further searched or combined with other sets using Boolean operators. • Try this at home.

  37. two steps in DIALOG • step one: select databases (aka files) to look at • step two: perform searches on the selected databases • You may wonder why one does not have one single step like in a search engine. Discuss. • today we concentrate on the second step

  38. working on selected files • We assume that we have selected database that we know and we look at the search interface on the selected database. • The database selection process is a bit more complicated, covered next week. • First, let us login and look at the command prompt. • Then we select the first database (file) with the begin command

  39. the ‘begin’ command • As its name suggests, usually the first command. • begin number, number,… • selects files with numbers number • Once they are selected they can be searched. • Now select the ERIC "begin 1" • "Begin 1" can be abbreviated as "b 1"

  40. substeps in the second step • Identify search terms • Use Dialog basic commands to conduct a search • View records online or print the results

  41. the 's' (select) command • Once issued the "begin" command to select a database, we issue the "s" command on the database. • "s query_expression" where query_expression is a query expression. • This will search the index of selected database in full-text view for the query issued • It will not find any of the following: "an and by for from of the to with". They are stop words.

  42. query expression • A query expression contains search terms expressed in special ways • You can truncate search terms. • You can build an elementary expression by putting several keywords together. This is achieved by DIALOG's connectors. • You can combine several expressions with the use of Boolean operators • We will cover this is in turn now.

  43. truncation of terms I • Open Truncation • "select path?" retrieves all words that begin with path: paths, pathos, pathway, pathology • Controlled-Length Truncation • "select path??" retrieves the root and up to two additional characters: paths, pathos

  44. truncation of terms II • Embedded Character truncation can be used for variant spellings: • "select organi?ation" -> organization organisation  • "select fib??board" -> fiberboard fibreboard  • This truncation feature is also useful for searching for unusual plural forms: • "select wom?n" -> woman women • Apparently you can also do prefixes by putting the ? in the beginning. • "?mobile" -> automobile metamobile

  45. use of connectors • Connectors are used to put several words together. • One instance where this is useful is when you have words that on their own mean different things. • For example "mate" is a herbal beverage consumed in South America. Looking for mate on the Internet retrieves a lot of singles' pages.

  46. example: terms related to "mate" What other terms to be used? • matear (drink mate) • matero (mate drinker) • cebar (prepare mate) • cebador (mate preparer) • yerba (mate herb) • bombilla (mate straw)

  47. connectors I • '(W)' requires terms to appear one after the other next to each other e.g. 'yerba(W)mate?' matches "yerba mate". • '(i W)' where i is an integer, means followed by at most i words, e.g. 'ceba?(3W)mate?' matches "cebar un maravilloso mate" but not "cebador guapo mirando un buen mate"

  48. connectors II • '(N)' requires terms to be next to each other e.g. 'yerba(N)mate?' matches "yerba mate" or "mate yerba". • '(i N)' where i is an integer, means proximity by at most i words, e.g. 'ceba?(3N)mate?' matches "cebar mate" or "matear con la cebadora". • '(S)' searches for the occurrence of connected terms in the same paragraph.

  49. using Boolean operators • In your query, you can combine several expressions with Boolean operators • Example: "S LIBRARY(W)SCHOOL? AND DISTANCE(W)EDUCATION" • But I usually do not issue such fancy queries.

  50. executing several searches • There can be several searches done sequentially, and the results sets are saved by the system. • Each time the system assigns a set number, Si, • These can be combined in Boolean expressions, e.g. 's S1 or S2 and S3' • Remember that Boolean operations are set-theoretic!

More Related