230 likes | 389 Vues
Efficient summarization framework for multi-attribute uncertain data. Jie Xu, Dmitri V. Kalashnikov, Sharad Mehrotra. The Summarization Problem. Extractive. Uncertain Data Set. face ( e.g. Jeff, Kate ). O 1. O n. O 8. …. O 11. O 2. O 1. O 25. l ocation ( e.g. LA ).
E N D
Efficient summarization framework for multi-attribute uncertain data Jie Xu, Dmitri V. Kalashnikov, SharadMehrotra
The Summarization Problem Extractive Uncertain Data Set face(e.g. Jeff, Kate) O1 On O8 … O11 O2 O1 O25 location (e.g. LA) Abstractive visual concepts (e.g. water, plant, sky) Kate Jeff wedding at LA
Summarization Process Modeling Information Extract best subset information summary dataset object What information does this image contain? … • Metrics? • Coverage Agrawal, WSDM’09; Li, WWW’09; Liu, SDM‘’09; Sinha, WWW’11 • Diversity Vee, ICDE’08; Ziegler, WWW’05 • Quality Sinha, WWW’11
Existing Techniques image customer review doc/micro-blog • Do not consider information in multiple attributes • Do not deal with uncertain data Hu et al. KDD’04 Ly et al. CoRR’11 Inouye et al. SocialCom’11 Li et al. WWW’09 Liu et al. SDM’09 Kennedy et al. WWW’08 Simon et al. ICCV’07 Sinha et al. WWW’11
Challenges • Design a summarization framework for • Multi-attribute Data • Uncertain/Probabilistic Data. face tags visual concept event time location visual concepts P(sky) = 0.7, P(people) = 0.9 data processing (e.g. vision analysis)
Limitations of existing techniques - 1 Existing techniques typically model & summarize asingle information dimension Summarize only information about visual content (Kennedy et al. WWW’08, Simon et al. ICCV’07) Summarize only information about review content (Hu et al. KDD’04, Ly et al. CoRR’11)
What information is in the image? Elemental IU {sky}, {plant}, … {Kate}, {Jeff} {wedding} {12/01/2012} {Los Angeles} Is that all? Intra-attribute IU Inter-attribute IU {Kate, Jeff} {sky, plant} … {Kate, LA} {Kate, Jeff, wedding} Even more information from attributes? …
Are all information units interesting? Is {Liyan, Ling} interesting? Is {Sharad, Mike} an interesting intra-attribute IU? Yes, they often have coffee together and appear frequently in other photos Yes from my perspective, because they are both my close friends Are all of the 2n combinations of people interesting? Shall we select a summary that covers all these information? Well, probably not! I don’t care about person X and person Y who happen to be together in the photo of this large group.
Mine for interesting information units O1 face T1 {Jeff, Kate} O2 face T2 {Tom} O3 face T3 {Jeff, Kate, Tom} Modified Item-set mining algorithm frequent correlated {Jeff, Kate} O4 face T4 {Kate, Tom} O5 face T5 {Jeff, Kate} … … On face Tn {Jeff, Kate}
Mine for interesting information units O1 face {Jeff, Kate} O2 face {Jeff} Mine from social context (e.g. Jeff is friend of Kate, Tom is a close friend of the user) O3 face {Jeff, Kate, Tom} {Jeff, Kate} {Tom} O4 face {Kate, Tom} O5 face {Jeff, Kate} … On face {Jeff, Kate}
Limitation of existing techniques – 2 • Can not handle probabilistic attributes IU dataset summary Not sure whether an object covers an IU in another object P(Jeff) = 0.6 ? 1 P(Jeff) = 0.8 3 2 objects 3 n … n
Deterministic Coverage Model --- Example Coverage = 8 / 14 information summary dataset object
Probabilistic Coverage Model Simplify to compute efficiently Expected amount of information covered by S Expected amount of total information Can be computed in polynomial time The function is sub-modular
Optimization Problem for summarization • Parameters: • datasetO = {o1, o2, · · · , on} • positive number K • Finding summary with Maximum Expected Coverage is NP-hard. • We developed an efficient greedy algorithm to solve it.
Basic Greedy Algorithm Initialize S = empty set Too many operations of computing Cov. (Iteration-level Optimization) For each object o in O \ S, Compute hkjhkhk Expensive to compute Cov. It is (Object-level optimization) Select o* with max Yes No done
Efficiency optimization – Object-level • Reduce thetime required to compute the coverage for one object • Instead of directly compute and optimize coverage in each iteration, compute the gain of adding one object o to summary S gain(S,o) = - • Updating gain(S,o) is much more efficient ( )
Submodularity of Coverage Expected Coverage Cov(S,O)is submodular: Cov(S, O) Cov(T, O) Cov(S ∪ o, O) – Cov(S, O) Cov(T ∪ o) - Cov(T, O)
Efficiency optimization – Iteration-level • Reduce the number of object-level computations (i.e. gain(S,o) ) in each iterationof the greedy process • While traversing objects in O \ S, we maintain • the maximum gain so far gain*. • an upper bound Upper(S, O) on gain(S,o). For any • prune an object o if Upper(S, o) < gain*. Update in constant time By submodularity By definition
Experiment -- Datasets • Facebook Photo Set 200 photos uploaded by 10 Facebook users • Review Dataset Reviews about 10 hotels from TripAdvisor. Each hotel has about 250 reviews on average. • Flickr Photo Set 20,000 photos from Flickr. visual concept face event time visual concept rating facets event time visual
Experiment – Efficiency Basic greedy algorithm without optimization runs more than 1 minute
Summary • Developed a new extractive summarization framework • Multi-attribute data. • Uncertain/Probabilistic data. • Generates high-quality summaries. • Highly efficient.