Internet- scale MM retrieval

RNDr. Jakub Lokoč, Ph.D. Siret Research Group (www.siret.cz) Department of SW Engineering FacultyofMathematicsandPhysics Charles University in Prague Internet-scale MMretrieval Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

Whatdoesitmean „internet-scale“? http://royal.pingdom.com statisticsfor 2011 • 2.1 billion – Internet users worldwide • 3.146 billion – number of email accounts worldwide • 800+ million – number of users on Facebook • 555 million – number of websites (+300 million in 2011) • 1trillion – number of video playbacks on YouTube • 48 hours – amount of video uploaded to YouTube every minute • 100 billion – Estimated number of photos on Facebook • 4.5 million – Number of photos uploaded to Flickr each day MM data Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

Many problems to solve… Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

Searching huge MM collections • Text-based techniques • Advantage – scalable retrieval by inverted files • Problem – missing or misguiding annotations • Content-based techniques • Advantage – no annotation needed, visual similarity • Problem – slow retrieval for complex similarity models • Hybrid techniques • Text-based query + content-based reranking/exploration • Content-based query + text-based filtering • Adapting content-based data for invertedfiles Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

Text-based retrieval • Document vector model • User issues keywords query (google, bing, …) • Efficient query evaluation using inverted files • Problems • Manual annotation only for small data • Subjectivityoftheannotation • Homonyms, etc. • Automatic annotation • Surrounding text + linguistic methods + ontologies • Content-basedkeywordassignment • Still lot of problems to solve… Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

Example – www.google.com • Text-based retrieval Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

query object Content-based retrieval • All objects transformed into a similarity model • Objects represented by descriptors (histograms, signatures) • Descriptors measured by a distance measure d (Lp, SQFD, EMD) • User issues an example object as a query q • Objects x sorted according to the visual similarity d(q, x) • How to solve efficiency problem? • Hybrid techniques – not whole DB is searched inthe CB way • Distance-based indexes • Distributed architectures needed (storage, throughput, …) Feature extraction Similarity evaluation Feature extraction Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

Example – www.google.com • Hybrid techniques –reranking page 1 Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

Example – www.google.com • Hybrid techniques –reranking page 2 Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

Example – siret.ms.mff.cuni.cz/sir • Hybrid techniques –exploration J. Lokoč, T. Grošup, T. Skopal Image Exploration using Online Feature Extraction and Reranking ICMR, 2012, Hongkong, China, ACM J. Lokoč, T. Grošup, T. Skopal SIR: The Smart Image Retrieval Engine SISAP, 2012, Toronto, Canada, Springer Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

Distance-based indexing • MM objectsorganizedintoclustersaccording to theirsimilarity • Effectiveness depends on the similarity model Zezula, P., Amato, G., Dohnal, V., Batko, M. Similarity Search: The Metric Space Approach (Springer, 2006) J. Lokoč, P. Čech, J. Novák, T. Skopal, SISAP, 2012, Toronto, Canada, Springer Cut-region: A Compact Building Block For Hierarchical Metric Indexing D. Novak, M. Batko, P. Zezula, Information systems, 2011, Elsevier Metric Index: An efficient and scalable solution for precise and approximate similarity search Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

Example - Mufin • Content-based search in 100 million Flickr images Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

Example - Mufin • MPEG-7 descriptors used – efficient, but effective? Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

Distance-based indexing • Effective measure • Often complex and expensive • Efficiency • Depends on the index performance • Depends also on the data “indexability” Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

Distance-based indexing • Indexability depends onthe distance distribution ofused distance space E. Chavez, G. Navarro, R. Baeza-Yates, and J. L. Marroquin Searching in Metric Spaces, ACM Computing Surveys, 2001 Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

Facing bad indexability • Centralized computing • Approximate search • Parallel processing • Distributed computing • Peer-to-peer architecture Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

Approximate search • Based on various ideas • Early termination for good results • Reducing query radius • When time elapses • Accessing % of DB • Also distance modifications • However, for fast retrieval, the quality deteriorates rapidly Zezula, P., Amato, G., Dohnal, V., Batko, M. Similarity Search: The Metric Space Approach (Springer, 2006) Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

Parallel processing • Multi-core CPUs cheap and available • Intel Xeon Phi coprocessor • GPU cards with thousands of cores • Amdahl's and Gustafson's law Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

Distributed indexes • Peer-to-peer architecture • Chord protocol (efficient routing) • M-Chord, M-Index • Map objects to real domain R • Use chord protocol for object distribution • Query causes interval queries, results merged D. Novak, P. Zezula, M-Chord: a scalable distributed similarity search structure InfoScale, 2006, ACM D. Novak, M. Batko, P. Zezula, Large-scale similarity data management with distributed Metric Index, Information Processing & Management Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

And all together Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

Thanks for your attention … … any questions? Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

Internet- scale MM retrieval