1 / 48

Dealing with Information Overload in the Digital Age

Learn about the challenges of information overload and how disciplines like Information Science and Information Retrieval help provide efficient search and retrieval methods. Explore fundamental concepts and methods in information sciences.

rhoughton
Télécharger la présentation

Dealing with Information Overload in the Digital Age

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SEL2211: ContextsLecture 17: Information Science One of the consequences of the advent and rapid development of digital electronic language technology has been information overload. In many areas of human activity the large amount of available digital electronic text makes it impractical or even impossible to find information on any given topic using the traditional, paper-based methods of sequential reading or, where available, of printed indexes. The Web is an extreme example: on current estimates it comprised 6.57 billion pages on 9 January 2012, and sequential search is clearly out of the question.

  2. SEL2211: ContextsLecture 17: Information Science In response, research disciplines like Information Science, Information Retrieval, and Library Science have come into being, whose remit is to alleviate information overload by providing users with methods for efficient search and retrieval of the information they require from very large document collections like the Web. One of these methods, the Web search engine, has become an indispensible tool for many millions of people worldwide. This lecture gives an overview of some fundamental concepts and methods in IR: The nature of information The nature of information overload A selection of concepts in the information sciences

  3. SEL2211: ContextsLecture 17: Information Science 1. The nature of information It would be useful to be able to give a single, clear definition of 'information', but this is impossible at present. The word is used as a technical term in a variety of research disciplines in often-incompatible and sometimes careless ways, and attempts to unify these uses into a coherent general definition have thus far been unsuccessful, as reading of the relevant links at the end of this lecture will readily show. The philosophical issues surrounding what information might be, and how it relates to equally difficult-to-define words like 'data' and 'knowledge', are much debated, and the associated academic literatures are vast.

  4. SEL2211: ContextsLecture 17: Information Science 1. The nature of information We can't go into all this, but, to get any further, we need a working definition of 'information' which doesn't beg too many philosophical questions. To that end, its colloquial meaning is adopted. When we or others use 'information' in everyday speech, we know or think we know what it means: it is a sense-perception which adds to what we know about our natural environment and our place it. One might, for example, take a child to a zoo where it perceives --that is, sees, hears, and smells-- a lion for the first time; these perceptions are information, that is, additions to what the child knows about the world. One might then say that this creature is called a 'lion' and that it is dangerous; the child's linguistic perception is information that adds further to its knowledge of the world.

  5. SEL2211: ContextsLecture 17: Information Science 2. Information overload When 'information' is understood in the general sense just proposed, information overload can take many forms. In stressful situations such as war or civil unrest, for example, humans can and do become disoriented because the senses are overwhelmed with sensory input and can no longer cope. Less dramatically, too many road signs can confuse drivers and cause the accidents they were intended to prevent. For the remainder of this lecture, however, we shall deal with only one type of information: natural language text.

  6. SEL2211: ContextsLecture 17: Information Science 2. Information overload As early as the 3rd or 4th century BC, and presumably before, people have felt oppressed by excessive information. The Book of Ecclesiastes in the Bible complains that 'of the making of books there is no end‘. In the 1st century AD the Roman author Seneca lamented that 'the abundance of books is a distraction' (Wikipedia, Information overload). This was at a time when books were hand-written and, compared to the present day, in very short supply, as we saw in an earlier lecture.

  7. SEL2211: ContextsLecture 17: Information Science 2. Information overload Several centuries after the advent of printing and the consequent increase in the book supply, the 18th century AD French author Diderot predicted that: 'as long as the centuries continue to unfold, the number of books will grow continually, and one can predict that a time will come when it will be almost as difficult to learn anything from books as from the direct study of the whole universe. It will be almost as convenient to search for some bit of truth concealed in nature as it will be to find it hidden away in an immense multitude of bound volumes' (Wikipedia, Information overload).

  8. SEL2211: ContextsLecture 17: Information Science 2. Information overload Diderot's prediction has come to pass. The advent of electronic text has greatly added to the already-immense amount of printed material available on virtually any topic, and the consequent body of available textual information has become a tsunami. No academic or medical doctor or lawyer, for example, can any longer claim to have read everything in his or her subject, even in restricted topic domains: there is too much to know and not enough time in a lifetime to know it, that is, to read and, equally importantly, to assimilate it. Information is overwhelming the humans who generate it and whom it is intended to serve, and, ironically, books and online commentaries on the problems this causes are adding to the deluge (Wikipedia, Information overload). This is information overload.

  9. SEL2211: ContextsLecture 17: Information Science 3. Information sciences The information sciences aim to render textual information overload tractable, so that  it once again becomes a help rather than a hindrance.

  10. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.1 History Though the expression is recent, the information sciences have existed since literacy was first invented. Text documents are only useful if the information they contain can be recovered and used. To achieve this the documents must be archived in a systematic manner. The need for this was realized soon after the invention of writing, and the solution was to create libraries.

  11. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.1 History The first libraries appear in ancient Mesopotamia to store cuneiform tablets, and to the present day libraries have been consistent feature of literate cultures. The earliest surviving library is in Mari on the Euphrates in present-day Syria. Mari was an ancient Mesopotamian city which flourished from about 3000 BC until it was destroyed in 1759 BC. (From: http://www.bible-history.com/geography/maps/Map-Ancient-Near-East.gif)

  12. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.1 History The river Euphrates near Mari

  13. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.1 History Part of the library at Mari

  14. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.1 History Each document in a library is labelled, and a catalogue recording its presence together with its shelf location is maintained. Until recently these catalogues were collections of cards, one per document, stored in alphabetically-arranged drawers which, in large libraries, could occupy entire rooms.

  15. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.1 History Much of this traditional information retrieval system remains a standard aspect of every library, and will remain so for the foreseeable future because so much legacy information is in manuscript and printed form. For digital electronic text, however, new forms of information retrieval have been developed.

  16. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval Digital electronic text is archived and searched, and the information it contains is retrieved, by software applications designed for the purpose. These applications fall into two broad categories: those that are essentially digital electronic versions of the traditional library-based organizational structure for manuscript and print text, and those that dispense with this structure and are able to access, search, and retrieve information from unstructured document collections. We will look at these separately.

  17. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval 3.2.1 Computational implementations of traditional library-based information structure Many digital electronic information retrieval systems are based on a single and highly efficient conceptual structure: the tree. Linguists are familiar with trees as a means of representing sentence phrase structure, but they are not confined to that and can represent structure of any kind. Most computer users will, for example, be familiar with a representation of their computer's file system which looks something like the next slide.

  18. SEL2211: ContextsLecture 17: Information Science 3. Information sciences The structure in (a) is in fact a tree shown horizontally, that is, with the branches pointing to the side rather than down. The exactly equivalent and more familiar vertical version is shown in (b). This tree structure is a way of organizing the computer's files conceptually.

  19. SEL2211: ContextsLecture 17: Information Science 3. Information sciences Books in an electronic library catalogue can be organized in exactly the same way. The following tree is identical in terms of structure to (b) previously, but has been relabelled in a way appropriate to a library catalogue. This tree imposes a conceptual structure on the library's holdings, and any cataloguing software application can use it to locate books efficiently.

  20. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval 3.2.1 Computational implementations of traditional library-based information structure In what sense is a tree efficient? Let's say a library has grown over centuries, and each time a new book comes in it is added to a list of its holdings; the library has now grown very large, and contains 10 million books. Because the list grew haphazardly over time and thus lacks any structure, users have to search the card index sequentially one card at a time until they find the books they are looking for, or, if the catalogue is electronic, the software has to carry out such sequential searches.

  21. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval 3.2.1 Computational implementations of traditional library-based information structure Let's say a user wants to see books by an author named Mann: she might be lucky and find this author as the first entry in the list, or be unlucky and have to search until she reaches the last, that is, the 10 millionth, entry; on average, a user can expect to have to search 5 million entries until the required item is found. This level of inefficiency is intractable for a human and needlessly time-consuming for a computer.

  22. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval Alternatively, one could reform the haphazard list of holdings into a tree. A small fragment  of the 'computational linguistics' subtree in the preceding figure might look like this.

  23. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval The full tree containing all the authors in the catalogue would be very large and could not be conveniently be displayed. Only the path to 'MANN' is therefore shown here; the small triangles indicate subtrees similar to the one for MANN. Only 12 steps rather than an average of 5 million are needed to find the required author. Searches for other authors would take one or two fewer steps or a few more, but the efficiency gain over sequential search remains huge.

  24. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval Such a tree is called a binary search tree because at each node (except at the leaves) there are two and only two branches. The efficiency lies in this: at each choice, the number of search possibilities is halved, and this narrows the search extremely quickly. Referring again to the foregoing example, the search sequence goes like this:

  25. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval 3.1.2 Information retrieval from unstructured document collections The traditional primacy of libraries as an information source is increasingly being challenged by large collections of online text documents in general and by the Web in particular. These are not generally in disciplined composite structures like library catalogues, and the Web certainly is not, so some new approach to finding the documents which contain information of interest has to be found.  That approach is to search for documents not by name or title, as in libraries, but by conceptual content.

  26. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval 3.1.2 Information retrieval from unstructured document collections A foundational principle underlying such search is the mathematical concept of vector space. We will first look at this concept and then at its application in document retrieval.

  27. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval Vector space Geometry is based on human intuitions about the world around us: that we exist in a space, that there are directions in that space, that distances along those directions can be measured, that relative distances between and among objects in the space can be compared, that objects in the space themselves have size and shape which can be measured and described. The earliest geometries were attempts to define these intuitive notions of space, direction, distance, size, and shape in terms of abstract principles which could, on the one hand, be applied to scientific understanding of physical reality, and on the other to practical problems like construction and navigation.

  28. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval Vector space Basing their ideas on the first attempts at this by the early Mesopotamians and Egyptians, Greek philosophers from the seventh century BC onwards developed such abstract principles systematically, and their work culminated in the geometrical system attributed to Euclid. This Euclidean geometry was the unquestioned framework for understanding of physical reality until the 18th century CE.

  29. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval Vector space Developments in European mathematics from the 18th century CE onwards showed that Euclidean geometry was not the only logically possible one, and that it was in fact problematical with respect to a fundamental Euclidean notion --that of parallel lines. In the 19th century, moreover, developments in science increasingly showed that human intuitions about space were in  fact mistaken and therefore not a good basis for a geometrical understanding of the physical world. New geometries that both addressed the mathematical problems of Euclidean geometry and were better suited to scientific results about the structure of physical reality were developed.

  30. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval Vector space For the human-scale world, however, Euclidean geometry remains an adequate and useful interpretative framework for physical reality, and it is so used in a range of sciences and branches of engineering that deal with human-scale problem areas. Document retrieval is one of these areas, and as such the present discussion is based on Euclidean geometry. Some fundamental ideas of that geometry are presented in what follows.

  31. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval Vector space Euclidean geometry describes the structure of the physical world in terms of an abstract space defined by axes. A 1-dimensional Euclidean space in one in which certain types of physical property can be described, such as distance between objects. Only one dimension is required fully to describe distance  --a single numerical measure.

  32. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval Vector space The corresponding 1-dimensional Euclidean space is an axis, graphically represented as a line, which has a maximum length and which is divided into intervals between  0 and the maximum. Any physical measurement is then represented by a point on that axis line.

  33. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval Vector space There are some kinds of physical property which cannot be described by only one dimension, such as the area of, say, a farmer's field. Two measurements are required, length and width, and these are represented in Euclidean geometry as a 2-dimensional space defined by two axes at right angles to one another. One axis represents length and the other width, each with appropriate maximum and gradations; the axes are at right angles to represent the independence of the two dimensions --a field can be as long as one likes, and that length has no implications for its width.

  34. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval Vector space There are still other kinds of physical property which cannot be described in two dimensions but require 3, such as the volume of a box. Three measurements are required, length, width, and height, and these are represented in Euclidean geometry as a 3-dimensional space defined by three axes all at right angles to one another for the reason just given in the 2-dimensional case.

  35. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval Vector space Euclid stopped at three dimensions, since he and Greek philosophers generally were concerned with what they took to be abstractions of fundamental forms in the natural world --lines, squares, triangles, circles, spheres, and so on, and three dimensions were sufficient for this. Modern geometry has, however, extended the notion of Euclidean space to arbitrary dimensionalities --4, 5, 10, 20, 1000... . The motivation for doing this is the insight that Euclidean space can be used far more generally in description of the world than the Greeks originally intended.

  36. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval Vector space The Greeks wanted to describe fundamental natural forms, as noted, but the is no reason to restrict Euclidean space to that. IQ, for example, has nothing to do with fundamental forms, but it is 1-dimenional in that it requires only one measurement and can be represented in a 1-dimensional Euclidean space. A social profile in terms of income and age again has nothing to do with fundamental forms, but it requires two measurements and can be represented in a 2-dimensional Euclidean space. Characteristics of plants in terms of height, petal length, and flowering duration has nothing to do with fundamental forms but can be represented in 3-dimensional space.

  37. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval Vector space This continues indefinitely: the national economy can be represented by an arbitrarily large number of dimensions --GDP, balance of payments, taxation revenue, average income, interest rate, and so on to some number n of dimensions, and this could be represented using an n-dimensional Euclidean space.

  38. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval Vector space The obvious objection is that it is pretty much impossible to think about spaces of dimensionality higher than 3 or to represent them graphically as in preceding examples, but this objection is based on an ambiguity in and thus confusion of senses of the word 'space'. The Greeks were concerned with physical space with a maximum of three dimensions; we have moved away from physical space to a sense which would be better described as 'measurements' that have no connection with physical space. Mathematically there is no problem with Euclidean spaces of dimensionality higher than 3, and we will be using such spaces extensively; physical notions of space are, however, a good metaphor for conceptualizing n-dimensional spaces.

  39. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval Vector space In mathematical terms, a Euclidean space is a vector space. A vector is a sequence of n numbers, and the sequence is conventionally represented as comma-separated numerals between square brackets. The figure below shows n = 4 real-valued numbers, where the first number v1 is 1.6, the second v2 is 2.4, and so on.

  40. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval Vector space A vector has a Euclidean geometrical interpretation the dimensionality of the vector, that is, the number of its components n, defines an n-dimensional space. the sequence of n numbers comprising the vector specifies the coordinates of the vector in the space. the vector itself is a point at the specified coordinates

  41. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval For example, the components of the 2-dimensional vector v = [36 160] in (a) below are its coordinates in a 2-dimensional vector space with axes 0..100 and 0..200, counting 36 along the horizintal axis and 160 along the vertical, and the components of the 3-dimensional vector v = [36, 160, 30] in (b) are its coordinates in a 3-dimensional vector space with axes 0..100, 0..300, 1..00, counting 36 along the horizontal axis, 160 along the vertical, and 70 along third.

  42. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval More than one vector can exist in a given vector space, as shown below.

  43. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval Application of vector space concept to document retrieval Let's say we have a large number of 3-dimensional vectors and plot them in a three-dimensional space. They might look like this: Looking at their distribution, they look random, that is, there is no obvious patterning in their distance relationships to one another.

  44. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval Application of vector space concept to document retrieval Now take another set of vectors and plot them in the space: This time there is a pattern: the vectors cluster in three groups: A and B are quite close to one another, and both are quite distant from C.

  45. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval Now, what if the vectors had an interpretation, that is, each vector represented a document? We would be able to say, based on relative distance, that the documents in cluster A and those in cluster B are quite similar, and that both are quite different from cluster C. Such an interpretation presupposes that the documents have been described in some meaningful way, that is, in a way that captures what each one is about and that allows them to be compared in terms of that 'aboutness'.

  46. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval Here's a simple way is doing this. Say we have 100 documents, and each one is described by how frequently three words which are taken to be of particular interest occur --one might, for example, be interested in retrieving all the documents having to do with libraries. Construct a list of 100 vectors, one for each document, such that each vector is 3-dimensional, each of the dimensions represents a different word, and the numbers in the vectors are frequencies of occurrence in the documents.

  47. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval Document 1 contains the word 'book' 3 times, document 4 the word 'date' 27 times, and so on. Just looking at the table reveals three groups: (i) documents 1,2, 3, 5, 8 which contain few instances of the words, (ii) documents 9, 100 which contain a moderate number of them, and (iii) documents 4, 6, 7 which contain a large number of instances. If one were interested in retrieving the documents which are most about libraries, one would select cluster C.

  48. SEL2211: ContextsLecture 17: Information Science 3. Information sciences 3.2 Digital electronic information retrieval Web search engines use this basic idea, though the word-vectors are much higher-dimensional. Using the vector space concept, therefore, it is possible to extract information from an huge unstructured document collection like the Web, which would otherwise be completely intractable.

More Related