210 likes | 345 Vues
The "Security Seminar on BIG DATA" at the Police Academy of the Czech Republic in Prague explores the concept of "internet-scale" multimedia retrieval. Presenting statistics on global internet use, the seminar addresses the issues of searching vast multimedia collections. It reviews various retrieval techniques, including text-based, content-based, and hybrid approaches, highlighting their advantages and limitations. Key topics include the challenges of annotation, efficient query evaluation, and feature extraction. This discussion aims to shed light on future directions for improving multimedia retrieval systems.
E N D
RNDr. Jakub Lokoč, Ph.D. Siret Research Group (www.siret.cz) Department of SW Engineering FacultyofMathematicsandPhysics Charles University in Prague Internet-scale MMretrieval Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
Whatdoesitmean „internet-scale“? http://royal.pingdom.com statisticsfor 2011 • 2.1 billion – Internet users worldwide • 3.146 billion – number of email accounts worldwide • 800+ million – number of users on Facebook • 555 million – number of websites (+300 million in 2011) • 1trillion – number of video playbacks on YouTube • 48 hours – amount of video uploaded to YouTube every minute • 100 billion – Estimated number of photos on Facebook • 4.5 million – Number of photos uploaded to Flickr each day MM data Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
Many problems to solve… Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
Searching huge MM collections • Text-based techniques • Advantage – scalable retrieval by inverted files • Problem – missing or misguiding annotations • Content-based techniques • Advantage – no annotation needed, visual similarity • Problem – slow retrieval for complex similarity models • Hybrid techniques • Text-based query + content-based reranking/exploration • Content-based query + text-based filtering • Adapting content-based data for invertedfiles Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
Text-based retrieval • Document vector model • User issues keywords query (google, bing, …) • Efficient query evaluation using inverted files • Problems • Manual annotation only for small data • Subjectivityoftheannotation • Homonyms, etc. • Automatic annotation • Surrounding text + linguistic methods + ontologies • Content-basedkeywordassignment • Still lot of problems to solve… Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
Example – www.google.com • Text-based retrieval Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
query object Content-based retrieval • All objects transformed into a similarity model • Objects represented by descriptors (histograms, signatures) • Descriptors measured by a distance measure d (Lp, SQFD, EMD) • User issues an example object as a query q • Objects x sorted according to the visual similarity d(q, x) • How to solve efficiency problem? • Hybrid techniques – not whole DB is searched inthe CB way • Distance-based indexes • Distributed architectures needed (storage, throughput, …) Feature extraction Similarity evaluation Feature extraction Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
Example – www.google.com • Hybrid techniques –reranking page 1 Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
Example – www.google.com • Hybrid techniques –reranking page 2 Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
Example – siret.ms.mff.cuni.cz/sir • Hybrid techniques –exploration J. Lokoč, T. Grošup, T. Skopal Image Exploration using Online Feature Extraction and Reranking ICMR, 2012, Hongkong, China, ACM J. Lokoč, T. Grošup, T. Skopal SIR: The Smart Image Retrieval Engine SISAP, 2012, Toronto, Canada, Springer Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
Distance-based indexing • MM objectsorganizedintoclustersaccording to theirsimilarity • Effectiveness depends on the similarity model Zezula, P., Amato, G., Dohnal, V., Batko, M. Similarity Search: The Metric Space Approach (Springer, 2006) J. Lokoč, P. Čech, J. Novák, T. Skopal, SISAP, 2012, Toronto, Canada, Springer Cut-region: A Compact Building Block For Hierarchical Metric Indexing D. Novak, M. Batko, P. Zezula, Information systems, 2011, Elsevier Metric Index: An efficient and scalable solution for precise and approximate similarity search Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
Example - Mufin • Content-based search in 100 million Flickr images Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
Example - Mufin • MPEG-7 descriptors used – efficient, but effective? Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
Distance-based indexing • Effective measure • Often complex and expensive • Efficiency • Depends on the index performance • Depends also on the data “indexability” Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
Distance-based indexing • Indexability depends onthe distance distribution ofused distance space E. Chavez, G. Navarro, R. Baeza-Yates, and J. L. Marroquin Searching in Metric Spaces, ACM Computing Surveys, 2001 Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
Facing bad indexability • Centralized computing • Approximate search • Parallel processing • Distributed computing • Peer-to-peer architecture Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
Approximate search • Based on various ideas • Early termination for good results • Reducing query radius • When time elapses • Accessing % of DB • Also distance modifications • However, for fast retrieval, the quality deteriorates rapidly Zezula, P., Amato, G., Dohnal, V., Batko, M. Similarity Search: The Metric Space Approach (Springer, 2006) Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
Parallel processing • Multi-core CPUs cheap and available • Intel Xeon Phi coprocessor • GPU cards with thousands of cores • Amdahl's and Gustafson's law Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
Distributed indexes • Peer-to-peer architecture • Chord protocol (efficient routing) • M-Chord, M-Index • Map objects to real domain R • Use chord protocol for object distribution • Query causes interval queries, results merged D. Novak, P. Zezula, M-Chord: a scalable distributed similarity search structure InfoScale, 2006, ACM D. Novak, M. Batko, P. Zezula, Large-scale similarity data management with distributed Metric Index, Information Processing & Management Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
And all together Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
Thanks for your attention … … any questions? Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze