1.22k likes | 1.4k Vues
Building an Intelligent Web by Rajendra Akerkar and Pawan Lingras. Presentation Outline. Introduction To Web Intelligence Information Retrieval Semantic Web The Role of Traditional Data Mining.
E N D
Building an Intelligent Webby Rajendra Akerkar and Pawan Lingras
Presentation Outline • Introduction To Web Intelligence • Information Retrieval • Semantic Web • The Role of Traditional Data Mining
The past few years have produced an enormous amount of written information, resulting in the advent of massive digital libraries. • With the advent of the World Wide Web, publishing is no longer the domain of a small number of elite scholars.
Search engines guarantee that readers around the world will be made aware of publications. • But despite the best efforts of today's search engines, the abundance of information on the Web is mostly unorganized. Making sense of the available data is a very difficult task.
Data Mining and the Web Data-mining techniques are used extensively to get required information from different databases, so why not use the same techniques to extract implicit and unknown information from the massive collection of documents available on the Web, which, in a sense, can be viewed as one large database?
What is Web Mining? • In order to extract useful information from the Web, we may use existing data-mining techniques, as well as new techniques designed specifically for the Web. Web mining—data mining applied to the Web—can be said to include the following techniques: • Clustering: Finding natural groupings of users or pages • Classification and prediction: Determining the class or behavior of a user or resource • Associations: Determining which URLs tend to be requested together • Sequence Analysis: Studying the order in which URLs tend to be accessed
How do we interact with the Web? • Gaining new knowledge from the Web • Searching for relevant information • Personalizing Web pages • Learning about individual users
Benefits of Web Mining • It provides techniques to create direct solutions for our Web interactions. • Can be used as part of a bigger application that addresses a wider issue. • For more information on Web intelligence, visit the Web Intelligence Consortium website at http://wi-consortium.org
Information Retrieval • The Information Retrieval (IR) community has been tackling querying issues for several decades, but we are just beginning to see the appearance of text-based knowledge-discovery systems. • It is important to study the relationship between areas such as information retrieval, information extraction, and computational linguistics with text-data mining.
What is IR? • IR is concerned with finding and ranking documents that match the users' information needs. • Most search engines are adept at information retrieval, bringing a list of documents to the user's attention that may contain the desired information. • It is the user's responsibility to go through these documents to extract the information.
Early History of IR Vannevar Bush is credited with an early vision of information retrieval and hypertext. He proposed an imaginary information retrieval machine called Memex in a paper.
SMART Gerard Salton, one of the prominent personalities in information retrieval, developed SMART—the System for the Manipulation and Retrieval of Text SMART, first developed at Harvard and matured at Cornell, provided the first practical implementation of an IR system. The basic theoretical foundations of SMART still play a major role in today's IR systems.
The IR process consists of retrieving desired information from textual data. • A single generic IR solution cannot be applied to every website. • Web developers need to understand the fundamentals of information retrieval, including document representation, retrieval models, and analysis of retrieval performance.
The 4 Major Components of an IR process • document representation • query representation • ranking the documents by comparing them against a query using a retrieval model • evaluation of the quality of retrieval.
Text-based IR Documents on the Web consist of a variety of different formats, and the information may consist of text, graphics, audio, and video. However, multimedia information retrieval is still in its infancy.
Creating a list of words The first step in the transformation of a document is simply listing all the words in a document by removing spaces, tabs, new-line characters, and other special characters such as commas, periods, exclamation points, and parentheses.
Removing Stopwords • The second step is the removal of some of the most commonly occurring words. • Words that appear in the majority of documents will not be very useful in discriminating documents. • Natural candidates for stopwords are articles, prepositions, and conjunctions. • Another advantage of eliminating stopwords is the reduction in the size of the document representation.
Tokenizing Documents A list of words (also referred to as terms, index terms, or tokens) is created from a file. The words are listed in alphabetical order and exclude stopwords.
Sample Document Data Mining has emerged as one of the most exciting and dynamic fields in computing science. The driving force for data mining is the presence of petabyte-scale online archives that potentially contain valuable bits of information hidden in them. Commercial enterprises have been quick to recognize the value of this concept; consequently, within the span of a few years, the software market itself for data mining is expected to be in excess of $10 billion. Data mining refers to a family of techniques used to detect interesting nuggets of relationships/knowledge in data. While the theoretical underpinnings of the field have been around for quite some time (in the form of pattern recognition, statistics, data analysis and machine learning) , the practice and use of these techniques have been largely ad-hoc. With the availability of large databases to store, manage and assimilate data, the new thrust of data mining lies at the intersection of database systems, artificial intelligence and algorithms that efficiently analyze data. The distributed nature of several databases, their size and the high complexity of many techniques present interesting computational challenges.
The list is alphabetically sorted to make it easy to count the frequency of each word. • It also underscores the fact that keyword-based retrieval tends to ignore the semantic structure of documents.
Stemming • A given word may occur in a variety of syntactic forms, such as plurals, past tense, or gerund forms (a noun derived from a verb). • A stem is what is left after a word’s affixes (prefixes and suffixes) are removed.
Why Use Stemming? • It can be argued that the use of stems will improve retrieval performance. • Users rarely specify the exact forms of the word they are looking for. • It seems reasonable to retrieve documents that contain a word similar to the one included in a user request. • Reduces the storage required for a document representation by reducing the number of distinct index terms.
Stemming Strategies • Maintain a table of all the words and corresponding stems. • Use methods based on structural linguistics, or X-grams based on term clustering. • Affix removal, which is one of the simplest stemming strategies because it is intuitive and can be easily implemented. It may also be combined with table lookup for those words that cannot be easily stemmed.
Porter’s Stemming Algorithm • Although affixes mean prefixes and suffixes, suffixes appear more frequently than prefixes. • Martin Porter proposed the most popular algorithm, the Porter algorithm, which is known for its simplicity and elegance. • Even though it is simple, the stemming results from the Porter algorithm compare favorably to more sophisticated algorithms. • http://www.tartarus.org/~martin/PorterStemmer/index.html
Basics for Porter’s Stemmer • Letters A, E, I, O, and U are vowels. A consonant is a letter other than A, E, I, O, or U, with the exception of Y. The letter Y is a vowel if it is preceded by a consonant, otherwise it is a consonant. • A consonant in the algorithm description is denoted by c, and a vowel by v. A list ccc. . . of length greater than 0 will be denoted by C, and a list vvv... of length greater than 0 will be denoted by V. Any word, or part of a word, therefore has one of the four forms: • CVCV …. C • CVCV … V • VCVC …. C • VCVC…. V
Square brackets are used to denote the optional presence of a sequence. Therefore, a single form can represent the previous four forms : [C] VCVC ... [V]
Braces { } are used to represent repetition; for example, (VC){m} means VC repeated m times, therefore all words can also be written as [C] (VC) {m} [V]
Measure of a word m will be called the "measure" of any word or word part For a "null" word m = 0 The following are some of the examples of various values of the measure: • m = 0 TR, EE, TREE, Y, BY. • m = 1 TROUBLE, OATS, TREES, IVY. • m = 2 TROUBLES, PRIVATE, OATEN, ORRERY.
Rule Forms • The rules for removing a suffix are given in the form (condition) SI → S2 • The condition is usually given in terms of m. • If the stem before SI satisfies the condition, then replace SI by S2.
The Porter algorithm consists of five steps which strip away suffixes in successive passes. For example, step 1 removes suffixes denoting plurals and past participles. Complex suffixes are removed in several stages.
GENERALIZATIONS Step 1: GENERALIZATION Step 2: GENERALIZE Step 3: GENERAL Step 4: GENER OSCILLATORS Step 1: OSCILLATOR Step 2: OSCILLATE Step 4: OSCILL Step 5: OSCIL Examples using Porter’s Stemmer
Effect of Measure on Stemming The algorithm does not remove a suffix when the length of the stemis small. The words from List A have small measures, hence -ate is not removed. However -ate is removed from the words from List B, which have larger measures.
Reductions due to Suffix Stripping Example based on a vocabulary of 10,000 words: Number of words reduced in step 1: 3597 “ 2: 766 “ 3: 327 “ 4: 2424 “ 5: 1373 Number of words not reduced: 3650
Term Document Matrix • A two-dimensional representation of a document collection. • The rows of the matrix represent various documents, and the columns correspond to various index terms. • The values in the matrix can be either the frequency or weight of the index term in the document.
Representation of TDM Using Triplets (0,2,5) (2,4,4) (4,4,2) (0,3,2) (2,5,2) (4,5,1) (0,5,1) (2,6,7) (4,6,1) (0,6,2) (3,0,2) (5,0,7) (1,0,4) (3,1,2) (5,2,2) (1,1,1) (3,3,7) (5,6,3) (1,2,1) (3,4,1) (6,4,2) (2,1,3) (4,3,5) (6,5,3)
Weight of each term in TDM where wij is the weight, and freqij is the frequency of the jth keyword in ith document. It assumed that there are m terms in a document collection; that is, the number of columns in the TDM is m.
Standard Document Collections The explanations so far have been based on small document sets with short documents. Real experimentation requires using a larger document set with longer documents. Standard document collects are available to IR researchers for conducting studies.
Text Retrieval Conference (TREC) – collection cosponsored by he National Institute of Standards and Technology (NIST) and the U.S. Department of Defense. http://trec.nist.gov/ The TREC test-collections and evaluation software are available for anyone to evaluate the retrieval effectiveness of their systems at any time • CACM and CISI Collections – part of the SMART project. The CACM collections (ftp://ftp.cs.cornell.edu/pub/smart/cacm/) consist of abstracts from the Communications of the ACM (CACM) from the first issue in 1958 to the last issue in 1979. The CISI (ftp://ftp.cs.cornell.edu/pub/smart/cisi/) collections consist of the 1460 most cited documents in information sciences.
Linguistic Model for Document Representation • A term phrase consists of terms that tend to occur close to each other in the same sentence. Use of term phrases is expected to improve precision. • A thesaurus consists of a list of words with similar meaning. The thesaurus-group generation is designed to improve recall.
Handling HTML and Other Formats • Text-based IR can be easily extended to documents in other formats. • For most of the document formats, there usually exists a utility for converting it to machine-readable text; but these conversions usually do not retain all the formatting features. • The text conversion also leads to a loss of other embedded objects, such as pictures or mathematical equations.
Classic Retrieval Models • Boolean • Vector Space • Arguably the most popular IR model • Probabilistic
Boolean Retrieval Model The Boolean retrieval model uses the standard and, or, and not Boolean operators.
DNF Boolean queries can be represented in a “disjunctive normal form” (DNF) which can make processing very efficient. If any one of the conjunctive expressions is true, the entire DNF will be true. If none of the conjunctive expressions match a given document, that document is considered nonrelevant to the query.