Lecture Outline • Introduction • Data Representations For Text • Preprocessing • Dimensionality Reduction • Text-data Analysis • IE • IR & IF • Clustering • Categorization
Introduction • The importance of text is directly associated with the fact that it currently expresses a vast, rich range of information in an unstructured or semi-structured format. • However, information text is difficult to retrieve automatically. • Text mining is defined as the process of automatically extracting useful knowledge from enormous collections of natural text documents (referred to as document collections) which are very dynamic and contain documents from various sources.
One of the main reasons for the rapid growth in the sizes of those collections is the vast and continuous spread of information over the Internet • Text mining emerged as an independent research area from the combination of older research areas like machine learning, natural language processing, and information retrieval • It is sometimes viewed as an adapted form of a very similar research field, namely data mining.
Data mining deals with structured data represented in relational tables or multidimensional cubes. • Being based on machine learning, both fields share a great deal of ideas and algorithms; however, each deals with a different type of data and thus has to adapt its ideas and algorithms to its own perspective.
Data Representations For Text • Text mining deals with text data which are unstructured (or semi-structured in the case if XML) in nature. • In order to alleviate this problem, indexing is utilized. • Indexing is the process of mapping a document into a structured format that represents its content. • It can be applied to the whole document or to some parts of it, though the former option is the rule.
In indexing, usually the terms occurring in the given collection of documents are used to represent the documents. • Documents contain a lot of terms that are frequently repeated or that have no significant relation to the context in which they exist. • Using all the terms would certainly result in high inefficiency that could be eliminated with some preprocessing steps, as we shall later.
The Vector Space Model • One very widely used indexing model is the vector space model which is based on the bag-of-words or set-of-words approach. • This model has the advantages of being relatively computationally efficient and having conceptual simplicity. • However, it suffers from the fact that it loses important information about the original text, such as information about the order of the terms in the text or about the frontiers between sentences or paragraphs.
Each document is represented as a vector, the dimensions of which are the terms in the initial document collection. • The set of terms used as dimensions is referred to as the term space. • Each vector coordinate is a term having a numeric value representing its relevance to the document. Usually, higher values imply higher relevance. • The process of giving numeric values to vector coordinates is referred to as weighting. • From an indexing point of view, weighting is the process of giving more emphasis to more important terms.
Three popular weighting schemes have been discussed in the literature: • binary, • TF, and • TF*IDF • For a term t in document d, the binary scheme records binary coordinate values, where a 1-value is given to t if it occurs at least once in d, and a 0-value is given otherwise.
The term frequency (TF) scheme records the number of occurrences of t in d. Usually, TF measurements are normalized to help overcome the problems associated with document sizes. • Normalization may be achieved by dividing all coordinate measurements for every document by the highest coordinate measurement for that document.
The term frequency by inverse document frequency (TF*IDF) simply weights TF measurements with a global weight, the IDF measurement. • The IDF measurement for a term t is defined as log2 (N/Nt), where N is the total number of documents in the collections, and Nt is the total number of documents containing at least one occurrence of t. • Note that the IDF weight increases as Nt decreases, i.e., as the uniqueness of the term among the documents increases, thus giving the term a higher weight. • As in TF, normalization is usually done here.
More Sophisticated Representations • A number of experiments were conducted in an attempt to find representations that are more sophisticated than the term-based scheme. • Some tried using phrases for indexing instead of terms, while others used string kernels • One noteworthy model which has been applied mainly in document categorization is the Darmstadt Indexing Approach (DIA)
DIA considers properties of terms, documents, categories, or any pair-wise combination of any of those as dimensions. • An example of • property of a term is its IDF measurement • property of a document is its length • property of a relationship between a term t and a document d is the TF measurement of t in d or the location of t in d.
For every considered pair-wise combination of dimensions, every possible value is collected in a relevance description vector, rd (di,dj), where di,dj is a selected pair-wise combination of dimensions. • An example is the (document, category) pair where we collect the (document, category) values of all possible combinations and store them in the corresponding relevance description vector.
Sophisticated representations may appear to have more superior qualities, however most of the conducted experiments did not yield any significant improvement over the traditional term-based schemes • There are suggestions that the use of a combination of terms and phrases together might improve results • Technically, using single terms as the dimensions of the vectors used to describe documents is referred to as post-coordinationof terms, while using compound words for the same purpose is referred to as pre-coordinationof terms.
Preprocessing • Before indexing is performed, a sequence of preprocessing steps is usually performed in an attempt to optimize the indexing process • The main target is to reduce the number of terms used, thus leading to more optimal text-based applications later. • Case folding is the process of converting all the characters in a document into the same case, either all upper case or lower case. • For example, the words “Did,” “Did,” “DiD,” and “dID” are all converted to “did” or “DID” depending on the chosen case. • This step has the advantage of speeding up comparisons in the indexing process
Stemming is the process of removing prefixes and suffixes from words so that all words are reduced to their stems or original forms. • For example, the words “Computing,” “Computer,” and “Computational” all map to “Compute.” • This step has the advantage of eliminating suffixes and prefixes indicating tag-of-speech and verbal or plural inflections. • Stemming algorithms employ a great deal of linguistics and are language dependent.
Stop words are words having no significant semantic relation to the context in which they exist. • Stop words can be terms that occur frequently in most of the documents in a document collection, i.e., have low uniqueness and thus low IDF measurements. • A list containing all the stop words in a system or application is referred to as a stop list. Stop words are not included as indexing terms. • For example, the words “the,” “on,” and “with” are usually stop words. Also the word “blood” would probably be a stop word in a collection of articles addressing blood infections, but not in a collection of articles describing the events of the 2002 FIFA World Cup.
An N-gram representation uses n-character slices of longer strings–referred to as n-grams–as dimensions. • The string “MINE” can be represented by the tri-grams _MI, MIN, INE, and NE_ • or the quad-grams _MIN, MINE, and INE_, where the underscore character is a space. • This representation offers an alternative to stemming and stop words. • The advantage of this representation is that it is less sensitive to grammatical and typographical errors and requires no linguistics knowledge like stemming. • Therefore, it is more language independent; however, in practice, it is not very effective in dimensionality reduction.
Dimensionality Reduction • Perhaps one of the most evident problems in dealing with text is the high dimensionality of its representations • As a matter of fact, some claim that text mining came to light as a major research area only in the last decade when computer hardware became very affordable.
Dimensionality reduction is the process through which the number of vector coordinates (or dimensions) in the vector space model is reduced from T to T’, where T’<<T. • The new set of dimensions is referred to as the reduced term set. The issue of dimensionality reduction might be application dependent, i.e., text-clustering applications might use different schemes for dimensionality reduction than text-categorization applications.
The rationale behind this application dependency is that after dimensionality reduction is performed, some information is lost, regardless of the significance of this information. • To reduce the effect of this loss, each application uses dimensionality reductionschemes that ensure that the lost information has minimal significance on the application at hand. • However, many generic schemes have been developed that apply to most text-mining applications.
The general reduction schemes can be classified into two categories: • Reduction by term selection (term space reduction or TSR) • Reduces the number of terms from T to T1 (T1<<T) such that when T1 is used for indexing instead of T, T1 yields the most effective results when compared to any other subset of T, T2. • One simple approach to perform TSR is to randomly select a set of terms from T, apply the algorithm of the application, and compare the results to those received while using all the terms in T. • Then repeat this process a fixed number of times, each time with another random set from T. • Finally, we select the subset that yields the best results.
Another approach to determine T1 is to select a set of terms T that have the highest score according to some function. • One function might be the IDF measurement, i.e., select the terms having the highest IDF weights. • Reduction by term extraction • attempts to generate from T another set T1 where the terms in T1 do not necessarily exist in T. • A term in T1 can be a combination of terms from T or a transformation of some term (or group of terms) in T.
One approach is to use term clusteringwhere all the terms in T are clustered such that terms in the same cluster have a high degree of semantic relatedness. • The centers of the clusters are then used as the terms in T1. • This approach helps solve the problem of synonymy: terms having similar meaning with different spelling
Text-data Analysis • Text mining is a very broad process which is usually refined into a number of tasks. • Today, the number of tasks available in text mining exceeds those available in data mining which is based entirely on three pillars: association rule mining, classification, and clustering.
In text mining, we have: • Information Extraction • Information Retrieval and Information Filtering • Document Clustering • Document Categorization
Information Extraction • Considered to be the most important text-mining task • Some even go to the extreme of using the two terms interchangeably, though this is not the case • IE has emerged as a joint research area between text-mining and natural language processing (NLP)
It is the process of extracting predefined information about objects and relationships among those objects from streams of documents and usually storing this information in pre-designed templates. • It is very important to associate information extracting with streams of documents rather than static collections.
As an example, some might extract information like promotions and sales from a stream of documents. • The information extracted might be the event, companies involved, or dates. • These kinds of systems are usually referred to as news-skimming systems.
IE proceeds in two steps • divides a document into relevant and irrelevant parts, • fills the predefined templates with the information extracted from the relevant parts. • Simple IE tasks, such as extracting proper names or companies from text, can be performed with high precision; • however, high precision is still not case in more complex tasks, like determining sequences of events from a document. • In complex tasks, IE systems are usually defined over and applied on very restricted domains, and transforming the system from one domain to another needs a lot of work and requires the support of domain experts.
In short, IE systems scan streams of documents in order to transform the associated documents into much smaller bits of extracted relevant information which are easier to be apprehended.
IE applications: • template filling • fills templates by information extracted from a document. • templates are then stored in structured environments such as databases for fast information retrieval later • Question-answering • a variant of template filling
can answer questions like, “Where is Lebanon?” • cannot answer complicated questions such as, "Which country had the lowest inflation in 1999?" • Summarization • maps documents into extracts which are machine-made summaries, as opposed to abstracts which are man-made summaries. • Applications usually extract a group of highly relevant sentences and present them as a summary.
Term extraction • extracts the most important terms from a set of documents • those extracted terms can be used for efficient indexing of the documents, as opposed to using all the terms in the document
Information Retrieval and Information Filtering • Information retrieval (IR) and information filtering (IF) are two separate processes having equivalent underlying goals • They deal with the problem of information seeking • IR is viewed as the ancestor of IF • The reason for this view is that IR is older, and IF bases a lot of its foundations on IR • Most of the research done in IR has been used and/or adapted to fit IF • Many differences exist between the two that render them as two separate application areas.
Given some information need represented by the user in a suitable manner, they are concerned with giving back a set of documents that satisfy that need. • Information need is represented via queries in IR systems and via profiles in IF systems
Differences • Goal • For IR the primary goal is to collect and organize documents that match a given query according to some ranking function. The primary goal of IF is to distribute newly received documents to all users with matching profiles. • Users • IR systems are usually used once by a single user–by a one-time query user –while IF systems are repeatedly used by the same user with some profile. • Usage • This fact makes IR systems more suitable to serve users with short-term needs as opposed to IF systems which serve users having needs that are relatively static over long periods of time.
Information Needs • IR systems are developed to tolerate some inadequacies in the query representation of the information need. On the other hand, profiles are assumed be highly accurate in IF systems. • Data types • In terms of data, IR systems usually operate over static collections of documents, while IF systems deal with dynamic streams of documents. In IF, the timeliness of a document is of great significance unlike in IR, where this is can be negotiated.
Evaluation • The precision and recall measures are used. • Precision is the percentage of retrieved documents that are relevant to the query or profile. • It is calculated as the number of relevant documents retrieved divided by the total number of retrieved documents.
Recall measures the percentage of relevant documents retrieved to the total number of relevant documents in the database. • Roughly speaking, those two measures have an inverse relationship. • As the number of results returned increases, the probability of having higher precision increases; but at the same time, the probability of returning wrong answers also increases. • Ideally, applications aim to maximize both.
IR • IR refers to the retrieval of a set of documents from a collection of documents that match a certain query posed a user • The retrieved documents are then rank ordered and presented to the user. • The query is simply the representation of the user’s information need in a language understood by the system. • This representation is considered as an approximation due to the difficulty associated with representing information needs accurately. The query is then matched against the documents, which are organized into text surrogates. • The collection of text surrogates can be viewed as a summarized structured representation of unstructured text data, such as lists of keywords, titles, or abstracts.
they provide an alternative to original documents as they take far less time to examine and at the same time encode enough semantic cues to be used in matching instead of the original documents. • As a result of matching, a set of documents would be selected and presented to the user. • The user either uses those documents or gives some feedback to the system resulting in modifications in the query and original information need or, in rare cases, in text surrogates • This interactive process goes on until the user is satisfied or until the user leaves the system.
IF • IF systems deal with large streams of incoming documents usually broadcasted via remote sources. • It is sometimes referred to as document routing. • The system maintains profiles created by users to describe their long-term interests. • Profiles may describe what the user likes or dislikes. New incoming documents are removed from the stream routed to some user if those documents do not match the user’s profile. • As a result, the user only views what is left in the stream after the mismatching documents have been removed; • an email filter, for example, removes all “junk” email
The first step in using an IF system is to create a profile. • A profile represents a user’s or a group of users’ information need, which is assumed to be stable over a long period of time. • Whenever a new document is received through the data stream, the system represents it as text surrogates and compares it against every profile stored in the system. • If the document matches a profile, it will be routed to the corresponding user. The user can then use the received documents and/or provide feedback. • The feedback provided may lead to modifications in the profile and the information need.
Constructing IR and IF systems • A number of models have been utilized to construct IR and IF systems. • Most of those models were initially developed for the purpose of IR. • With the advent of IF, IR methods were adapted to fit IF needs. • IF stimulated more research which resulted in the development of more new models that are currently used in both areas.
In the literature, IR and IF models are categorized into two groups: traditional and modern • The string-matching traditional model: • The user specifies his/her information needs by a string of words. • A document would match the information need of a user if the user-specified string exists in the document. • This method is one of the earliest and simplest approaches.