430 likes | 510 Vues
Explore types of data organization, search methods, text indexing, word normalization, indexing images, audio, and video for effective web searches. Learn how computers interpret and retrieve information based on user queries.
E N D
Lecture #32 WWW Search
Review: Data Organization • Kinds of things to organize • Menu items • Text • Images • Sound • Videos • Records (I.e. a person’s name, address, & phone number, or a car’s year, make, & model)
Review: Data Organization • Three ways to find things: • Lists (in-order search, binary search) • Trees (balance number of branches with time to decide which is correct branch) • Search
Search issues • How do we say what we want? • I want a story about pigs • I want a picture of a rooster • How many televisions were sold in Vietnam during 2000? • Find a movie like this one • How does the computer find what we said?
Things to search for • Records • Text • Images • Audio • Video
Records • Car • Price • Miles • Year • Make • Doors • Queries • Price < 6000 & Miles<100000 • Make == Toyota & Year > 1993
Queries • Make == Toyota & Year >1993
Queries • Make == Toyota & Year >1993
Queries • Year >1993 or Price < $3,000
Queries • Year >1993 or Price < $3,000
Databases • Large collections of records • Accessed by queries
Things to search for • Records • Text • Images • Audio • Video
Text searching • How do I say what I want? • Type some phrase • I want a story about pigs • How will the computer match this? • What is text? • An array of characters • What can can a computer do with text? • Match characters
Text searching • People think in words not characters • How do I convert an array of characters into an array of words? • Collect together sequences of letters • How do I know if character C is a letter? • C>=“a” & C<=“z” | C>=“A” & C<=“Z”
Convert to words • Because people think in words
Every document is an array of words • I want a story about pigs • How will I find the right documents? • Find all documents that have the word “pigs”
Searching text • How will I find pigs fast? • Create an index of all words • With each word store the name or address of each document that contains that word • Search the index for “pigs” • Return the list of documents • Use a binary search on the word list (50,000 words)
Problems • What if a document has the word “Pig” but not “pigs”? • Normalize • Case - make all words lower case • Pig -> pig • Stemming - remove all suffixes and prefixes before putting a word into the index • pigs -> pig • piggy -> pig
Problems • I want a story about pigs? • How does the computer know to search for pigs? • It doesn’t • How does the computer know what a story is? • It doesn’t
Searching • I want a story about pigs • Pick out the important words and search for them • Which words are important? • D = number of times a word appears in a document • A = average number of times a word appears in all documents • Importance = D/A • Why?
How do we create an index of all documents on the Web? • Try = a list of URLs • Seen = all URLs you have seen While (Try is not empty) { Page = take a URL from Try Words = all the “important” words in Page add Page to the index using all of Words Links = all URLs in Page for every Link that is not in Seen add Link to Try and to Seen }
Other ways to find important words and important documents • A Document is important if many other documents point to it • A word is important in document D if that word occurs frequently in documents that link to document D.
Images • What will I say when searching for an image? • I want a rooster picture • Draw a picture of a rooster?
Search by picture? Is this possible? If so, how? ?
What’s in a picture? • Computers don’t understand the contents of images • To a computer an image is a bunch of colored pixels
I want a picture of a rooster • Label all of the pictures • How does Google Images do it? • File name of the picture “rooster-crossingSt.jpg” • Words around the picture in the HTML • Use “Safe Search” and set filters appropriately (http://www.youtube.com/watch?v=maWx-ApkBCs)
Audio • Talking • Use speech recognition to convert audio to text • With each recognized word keep track of where in the audio it was recognized. • Build an index using the recognized text • Normalize based on how words sound rather than are spelled.
Video • Where in “Casablanca” does Bogart say “Play it again Sam” ? • he never does, he just says “play it” • How can the computer find that? • Transcribe the audio • Speech recognition on the audio
Video • Does Woody ever kiss Bo Peep? • Exactly what color is a kiss?
Video • Does Woody ever kiss Bo Peep? • Annotate every frame with who is in the frame and search for frames with both Woody and Bo Peep.
Search • Records • Queries • < > = And Or • Text • Normalized words (case, stemming, thesaurus) • Images • Add words • Audio • Transcribe or recognize as words • Video • Transcribe • Annotate
“Re-Search” Directions in Image Recognition, Search and Retrieval
Face DetectionIn Commercial Digital Cameras • Train on • 1000’s of faces • Millions of non-faces Face Detection – Viola & Jones
Face Recognition(Eigenfaces [Turk and Pentland 1991]) Project image into higher-dimensional space 2 N N 0 71 250 68 210 44 128 53 N “Recognize” by grouping unknown image with closest training example
Face Recognition(Picasa - Google) • Image search/organization • Automatically finds, crops and groups images of the same person from a collection of photos • Allows user feedback (trainable) - user can indicate if it found the wrong person.
Bag of “words”* Face/Object Recognition/Search:Feature-Based Technology Extract Features Object *Li Fei-Fei (Princeton) Create visual“words” from image features.
Face/Object Recognition/Search:Feature-Based Technology *Li Fei-Fei (Princeton) Do this for multiple objects
Face/Object Recognition/Search:Bag of Words How to get matching images/documents?: Use “word” frequencies = where nid = # times word i occurs in document d nd = total # words in document d Then combine word frequency with inverse document frequency weighting to downweight words that occur frequently (D = # of occurrences; A = average # of occurrences)
Face/Object Recognition/Search:Feature-Based Technology *Li Fei-Fei (Princeton) Drop word features through a “vocabulary tree” to classify