Processing of Large Document Collections: Part 1 Course Schedule

Processing of large document collections Part 1 Helena Ahonen-Myka Fall 2002

Organization of the course • Classes: 30.9., 28.10., 29.10., 18.11. • lectures (Helena Ahonen-Myka): 13.15-17 • exercise sessions (Juha Makkonen): 10.15-12 • Exercises are given and returned each week • deadline: Thursday midnight • Exam: Tue 3.12. at 16-20 (Auditorio) • Points: exam 40 pts, exercises 20 pts • required: exam 20 pts, exercises 10 pts • exercise sessions give extra points (1/session)

Schedule • 30.9. • preprocessing of text; representation of textual data: vector model; term selection; text categorization • 28.-29.10. • character set issues; text summarization; text compression • 18.11. • text indexing and querying

Schedule • self-study • basic transformations for text data • using Unix tools, Perl and XSLT • get to know some linguistic tools • morphological and grammatical analysis • Wordnet

1. Large document collections • What is a document? • “a document records a message from people to people” (Wilkinson et al., 1998) • each document has content, structure, and metadata (context) • in this course, we concentrate on content • particularly: textual content

Large document collections • large? • some person may have written a document, but it is not possible later to process the document manually -> automatic processing is needed • large w.r.t to the capacity of a device (e.g. a mobile phone) • collection? • documents somehow similar -> automatic processing is possible

Applications • text categorization • text summarization • text compression • text indexing and retrieval • information extraction • question answering • machine translation …

2. Representation of textual information • Text cannot be directly interpreted by the many document processing applications • we need a compact representation of the content • which are the meaningful units of text?

Terms • Words • typical choice • set of words, bag of words • phrases • syntactical phrases (e.g. noun phrases) • statistical phrases (e.g. frequent pairs of words) • usefulness not yet known?

Terms • Part of the text is not considered as terms • very common words (function words): • articles (a, the) , prepositions (of, in), conjunctions (and, or), adverbs (here, then) • numerals (30.9.2002, 2547) • these words can be removed • stopword list • other preprocessing possible • stemming (recognization -> recogn), base words (skies -> sky)

Vector model • A document is usually represented as a vector of term weights • the vector has as many dimensions as there are terms in the whole collection of documents • the weight represents how much the term contributes to the semantics of the document

Vector model • in our sample document collection, there are 118 words (terms) • in alphabetical order, the list of terms starts with: • absorption • agriculture • anaemia • analyse • application • …

Vector model • each document can be represented by a vector of 118 dimensions • we can think a document vector as an array of 118 elements, one for each term, indexed, e.g. 0-117

Vector model • let d1 be the vector for document 1 • record only which terms occur in document: • d1[0] = 0 -- absorption doesn’t occur • d1[1] = 0 -- agriculture -”- • d1[2] = 0 -- anaemia -”- • d1[3] = 0 -- analyse -”- • d1[4] = 1 -- application occurs • ... • d1[21] = 1 -- current occurs • …

Weighting terms • Usually we want to say that some terms are more important than the others -> weighting • weights usually range between 0 and 1 • binary weights may be used • 1 denotes presence, 0 absence of the term in the document

Weighting terms • if a word occurs many times in a document, it may be more important • but what about very frequent words? • often the TF*IDF function is used • higher weight, if the term occurs often in the document • lower weight, if the term occurs in many documents

Weighting terms: TF*IDF • TF*IDF = term frequency * inversed document frequency • weight of term tk in document dj: • where • #(tk,dj): the number of times tk occurs in dj • #Tr(tk): the number of documents in Tr in which tk occurs • Tr: the documents in the collection

Weighting terms: TF*IDF • in document 1: • term ’application’ occurs once, and in the whole collection it occurs in 2 documents: • tfidf (application, d1) = 1 * log(10/2) = log 5 ~ 0.7 • term ´current´occurs once, in the whole collection in 9 documents: • tfidf(current, d1) = 1 * log(10/9) ~ 0.05

Weighting terms: TF*IDF • if there were some word that occurs 7 times in doc 1 and only in doc 1, the TF*IDF weight would be: • tfidf(doc1word, d1) = 7 * log(10/1) = 7

Weighting terms: normalization • in order for the weights to fall in the [0,1] interval, the weights are often normalized (T is the set of terms):

Effect of structure • Either the full text of the document or selected parts of it are indexed • e.g. in a patent categorization application • title, abstract, the first 20 lines of the summary, and the section containing the claims of novelty of the described invention • some parts of a document may be considered more important • e.g. higher weight for the terms in the title

Term selection • a large document collection may contain millions of words -> document vectors would contain millions of dimensions • many algorithms cannot handle high dimensionality of the term space (= large number of terms) • usually only a part of terms are used • how to select terms that are used? • term selection (often called feature selection or dimensionality reduction) methods

Term selection • Goal: select terms that yield the highest effectiveness in the given application • wrapper approach • the reduced set of terms is found iteratively and tested with the application • filtering approach • keep the terms that receive the highest score according to a function that measures the ”importance” of the term for the task

Term selection • Many functions available • document frequency: keep the high frequency terms • stopwords have been already removed • 50% of the words occur only once in the document collection • e.g. remove all terms occurring in at most 3 documents

Term selection functions: document frequency • document frequency is the number of documents in which a term occurs • in our sample, the ranking of terms: • 9 current • 7 project • 4 environment • 3 nuclear • 2 application • 2 area … 2 water • 1 use …

Term selection functions: document frequency • we might now set the threshold to 2 and remove all the words that occur only once • result: 29 words of 118 words (~25%) selected

Term selection: other functions • Information-theoretic term selection functions, e.g. • chi-square • information gain • mutual information • odds ratio • relevancy score

3. Text categorization • Text classification, topic classification/spotting/detection • problem setting: • assume: a predefined set of categories, a set of documents • label each document with one (or more) categories

Text categorization • for instance • Categorizing newspaper articles based on the topic area, e.g. into the following 17 “IPTC” categories: • Arts, culture and entertainment • Crime, law and justice • Disaster and accident • Economy, business and finance • Education • Environmental issue • Health • …

Text categorization • categorization can be hierarchical • Arts, culture and entertainment • archaeology • architecture • bullfighting • festive event (including carnival) • cinema • dance • fashion • ...

Text categorization • ”Bullfighting as we know it today, started in the village squares, and became formalised, with the building of the bullring in Ronda in the late 18th century. From that time,...” • class: • Arts, culture and entertainment • Bullfighting • or both?

Text categorization • Another example: filtering spam • ”Subject: Congratulation! You are selected! • It’s Totally FREE! EMAIL LIST MANAGING SOFTWARE! EMAIL ADDRESSES RETRIEVER from web! GREATEST FREE STUFF!” • two classes only: Spam and Not-spam

Text categorization • Two major approaches: • knowledge engineering -> end of 80’s • manually defined set of rules encoding expert knowledge on how to classify documents under the given gategories • machine learning, 90’s -> • an automatic text classifier is built by learning, from a set of preclassified documents, the characteristics of the categories

Text categorization • Let • D: a domain of documents • C = {c1, …, c|C|} : a set of predefined categories • T = true, F = false • The task is to approximate the unknown target function ’: D x C -> {T,F} by means of a function  : D x C -> {T,F}, such that the functions ”coincide as much as possible” • function ’ : how documents should be classified • function  : classifier (hypothesis, model…)

We assume... • Categories are just symbolic labels • no additional knowledge of their meaning is available • No knowledge outside of the documents is available • all decisions have to be made on the basis of the knowledge extracted from the documents • metadata, e.g., publication date, document type, source etc. is not used

-> general methods • Methods do not depend on any application-dependent knowledge • but: in operational applications all kind of knowledge can be used (e.g. in spam filtering) • content-based decisions are necessarily subjective • it is often difficult to measure the effectiveness of the classifiers • even human classifiers do not always agree

Single-label, multi-label TC • Single-label text categorization • exactly 1 category must be assigned to each dj D • Multi-label text categorization • any number of categories may be assigned to the same dj D • Special case of single-label: binary • each dj must be assigned either to category ci or to its complement ¬ ci

Single-label, multi-label TC • The binary case (and, hence, the single-label case) is more general than the multi-label • an algorithm for binary classification can also be used for multi-label classification • the converse is not true

Single-label, multi-label TC • in the following, we will use the binary case only: • classification under a set of categories C = set of |C| independent problems of classifying the documents in D under a given category ci, for i = 1, ..., |C|

Hard-categorization vs. ranking categorization • Hard categorization • the classifier answers T or F • Ranking categorization • given a document, the classifier might rank the categories according to their estimated appropriateness to the document • respectively, given a category, the classifier might rank the documents

Machine learning approach • A general inductive process (learner) automatically builds a classifier for a category ci by observing the characteristics of a set of documents manually classified under ci or ci by a domain expert • from these characteristics the learner extracts the characteristics that a new unseen document should have in order to be classified under ci • supervised learning (= supervised by the knowledge of the training documents)

Machine learning approach • The learner is domain independent • usually available ’off-the-shelf’ • the inductive process is easily repeated, if the set of categories changes • manually classified documents often already available • manual process may exist • if not, it is still easier to manually classify a set of documents than to build and tune a set of rules

Training set, test set, validation set • Initial corpus of manually classified documents • let dj belong to the initial corpus • for each pair <dj, ci> it is known if dj should be filed under ci • positive examples, negative examples of a category

Training set, test set, validation set • The initial corpus is divided into two sets • a training (and validation) set • a test set • the training set is used to build the classifier • the test set is used for testing the effectiveness of the classifier • each document is fed to the classifier and the decision is compared to the manual category

Training set, test set, validation set • The documents in the test set are not used in the construction of the classifier • alternative: k-fold cross-validation • k different classifiers are built by partitioning the initial corpus into k disjoint sets and then iteratively applying the train-and-test approach on pairs, where k-1 sets construct a training set and 1 set is used as a test set • individual results are then averaged

Training set, test set, validation set • Training set can be split to two parts • one part is used for optimising parameters • test which values of parameters yield the best effectiveness • test set and validation set must be kept separate

Inductive construction of classifiers • A ranking classifier for a category ci • definition of a function that, given a document, returns a categorization status value for it, i.e. a number between 0 and 1 • documents are ranked according to their categorization status value

Inductive construction of classifiers • A hard classifier for a category • definition of a function that returns true or false, or • definition of a function that returns a value between 0 and 1, followed by a definition of a threshold • if the value is higher than the threshold -> true • otherwise -> false

Learners • probabilistic classifiers (Naïve Bayes) • decision tree classifiers • decision rule classifiers • regression methods • on-line methods • neural networks • example-based classifiers (k-NN) • support vector machines

Rocchio method • learner • for each category, an explicit profile (or prototypical document) is constructed • the same representation as for the documents • benefit: profile is understandable even for humans

Processing of Large Document Collections: Part 1 Course Schedule

Processing of Large Document Collections: Part 1 Course Schedule

Presentation Transcript

Processing of large document collections

Entity Categorization Over Large Document Collections

Processing of Large Document Collections 1

Entity Categorization Over Large Document Collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Automatic Document Indexing in Large Medical Collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections