430 likes | 508 Vues
Lecture # 32 WWW Search. Review: Data Organization. Kinds of things to organize Menu items Text Images Sound Videos Records (I.e. a person ’ s name, address, & phone number, or a car ’ s year, make, & model). Review: Data Organization. Three ways to find things:
E N D
Lecture #32 WWW Search
Review: Data Organization • Kinds of things to organize • Menu items • Text • Images • Sound • Videos • Records (I.e. a person’s name, address, & phone number, or a car’s year, make, & model)
Review: Data Organization • Three ways to find things: • Lists (in-order search, binary search) • Trees (balance number of branches with time to decide which is correct branch) • Search
Search issues • How do we say what we want? • I want a story about pigs • I want a picture of a rooster • How many televisions were sold in Vietnam during 2000? • Find a movie like this one • How does the computer find what we said?
Things to search for • Records • Text • Images • Audio • Video
Records • Car • Price • Miles • Year • Make • Doors • Queries • Price < 6000 & Miles<100000 • Make == Toyota & Year > 1993
Queries • Make == Toyota & Year >1993
Queries • Make == Toyota & Year >1993
Queries • Year >1993 or Price < $3,000
Queries • Year >1993 or Price < $3,000
Databases • Large collections of records • Accessed by queries
Things to search for • Records • Text • Images • Audio • Video
Text searching • How do I say what I want? • Type some phrase • I want a story about pigs • How will the computer match this? • What is text? • An array of characters • What can can a computer do with text? • Match characters
Text searching • People think in words not characters • How do I convert an array of characters into an array of words? • Collect together sequences of letters • How do I know if character C is a letter? • C>=“a” & C<=“z” | C>=“A” & C<=“Z”
Convert to words • Because people think in words
Every document is an array of words • I want a story about pigs • How will I find the right documents? • Find all documents that have the word “pigs”
Searching text • How will I find pigs fast? • Create an index of all words • With each word store the name or address of each document that contains that word • Search the index for “pigs” • Return the list of documents • Use a binary search on the word list (50,000 words)
Problems • What if a document has the word “Pig” but not “pigs”? • Normalize • Case - make all words lower case • Pig -> pig • Stemming - remove all suffixes and prefixes before putting a word into the index • pigs -> pig • piggy -> pig
Problems • I want a story about pigs? • How does the computer know to search for pigs? • It doesn’t • How does the computer know what a story is? • It doesn’t
Searching • I want a story about pigs • Pick out the important words and search for them • Which words are important? • D = number of times a word appears in a document • A = average number of times a word appears in all documents • Importance = D/A • Why?
How do we create an index of all documents on the Web? • Try = a list of URLs • Seen = all URLs you have seen While (Try is not empty) { Page = take a URL from Try Words = all the “important” words in Page add Page to the index using all of Words Links = all URLs in Page for every Link that is not in Seen add Link to Try and to Seen }
Other ways to find important words and important documents • A Document is important if many other documents point to it • A word is important in document D if that word occurs frequently in documents that link to document D.
Images • What will I say when searching for an image? • I want a rooster picture • Draw a picture of a rooster?
Search by picture? Is this possible? If so, how? ?
What’s in a picture? • Computers don’t understand the contents of images • To a computer an image is a bunch of colored pixels
I want a picture of a rooster • Label all of the pictures • How does Google Images do it? • File name of the picture “rooster-crossingSt.jpg” • Words around the picture in the HTML • Use “Safe Search” and set filters appropriately (http://www.youtube.com/watch?v=maWx-ApkBCs)
Audio • Talking • Use speech recognition to convert audio to text • With each recognized word keep track of where in the audio it was recognized. • Build an index using the recognized text • Normalize based on how words sound rather than are spelled.
Video • Where in “Casablanca” does Bogart say “Play it again Sam” ? • he never does, he just says “play it” • How can the computer find that? • Transcribe the audio • Speech recognition on the audio
Video • Does Woody ever kiss Bo Peep? • Exactly what color is a kiss?
Video • Does Woody ever kiss Bo Peep? • Annotate every frame with who is in the frame and search for frames with both Woody and Bo Peep.
Search • Records • Queries • < > = And Or • Text • Normalized words (case, stemming, thesaurus) • Images • Add words • Audio • Transcribe or recognize as words • Video • Transcribe • Annotate
“Re-Search” Directions in Image Recognition, Search and Retrieval
Face DetectionIn Commercial Digital Cameras • Train on • 1000’s of faces • Millions of non-faces Face Detection – Viola & Jones
Face Recognition(Eigenfaces [Turk and Pentland 1991]) Project image into higher-dimensional space 2 N N 0 71 250 68 210 44 128 53 N “Recognize” by grouping unknown image with closest training example
Face Recognition(Picasa - Google) • Image search/organization • Automatically finds, crops and groups images of the same person from a collection of photos • Allows user feedback (trainable) - user can indicate if it found the wrong person.
Bag of “words”* Face/Object Recognition/Search:Feature-Based Technology Extract Features Object *Li Fei-Fei (Princeton) Create visual“words” from image features.
Face/Object Recognition/Search:Feature-Based Technology *Li Fei-Fei (Princeton) Do this for multiple objects
Face/Object Recognition/Search:Bag of Words How to get matching images/documents?: Use “word” frequencies = where nid = # times word i occurs in document d nd = total # words in document d Then combine word frequency with inverse document frequency weighting to downweight words that occur frequently (D = # of occurrences; A = average # of occurrences)
Face/Object Recognition/Search:Feature-Based Technology *Li Fei-Fei (Princeton) Drop word features through a “vocabulary tree” to classify