Searching Web Forums

Searching Web Forums Amélie Marian, Rutgers University Joint work with Gayatree Ganu

Forum Popularity and Search • Forums with most traffic [http://rankings.big-boards.com] • BMW • 50K uniq visitors/day • 25M Posts • 0.6M Members • Filipino Community • Subaru Impreza Owners • Rome Total War • … • Pakistan Cricket Fan Site • Prison Talk • Online Money making Despite popularity, forums lack good search capabilities

Outline • Multi-Granularity Search • Challenges • Unstructured text • Background information omitted • Discussion digression • Contributions • Return each results at varying focus levels, allowing more or less context. (CIKM 2013) • Egocentric Search • Challenges • Multiple interpersonal relations with varying importance • Contributions • Proposed a multidimensional user similarity measure. • Use authorship for improving personalized and keyword search.

Dataset Thread 1 Thread 2 Post 1 Post 2 Post 3 Post 4 2 Sent 1 Sent 2 Sent 3 Sent 4 Sent 5 Sent 6 2 2 Word 1 Word 2 Word 3 Word 4 Word 1 Hierarchical Model • Hierarchy over objects at three searchable levels • pertinent sentences, larger posts, entire discussions or threads • Hierarchy • captures strength of association, containment relationship • Lower levels for smaller objects • Edge represents containment • Edge weight of 2 indicates that the text in child was repeated in the text of parent

Alternate Scoring Functions Score tf*idf (t,d) = (1+log(tft,d)) * log(N/dft) * 1/CharLength

Scoring Multi-Granularity Results Goal: Unified scoring for objects at multiple granularity levels • largely varying sizes • with inherent containment relationship Hierarchical Scoring Function (HScore) • Score for node i with respect to search term tand having j children: … if i is a non-leaf node = 1 … if i is a leaf node containing t = 0 … if i is a leaf node not containing t ewij = edge weight between parent i and child j P(j) = number of parents of j C(i) = number of children of i

Effect of Size Weighting Parameter  on HScore • Parameter controls the intermixing of granularities Number of results in top-20 list Size parameter  HScore

0.1 0.1 Thread 1 Thread 2 2.1 2 2.5 0.1 Post 1 Post 2 Post 3 Post 4 1.6 1.5 1.4 1.3 0.1 0.4 Sent 1 Sent 2 Sent 3 Sent 4 Sent 5 Sent 6 Multi-Granularity Result Generation Sorted Ordering: Post3(2.5), Post1(2.1), Post2(2), Sent1(1.6), Sent2(1.5), Sent3(1.4), Sent4(1.3), Sent6(0.4), Sent5(0.1), Post4(0.1), Thread1(0.1), Thread2(0.1) For result size k=4, optimizing for the sum of scores: • Overlap: {Post3, Post1, Post2, Sent1} Sum Score = 8.2 (minus 1.6?) • Greedy: {Post3, Post1, Post2, Sent6} Sum Score = 7.0 • Best: {Post3, Post2,Sent1, Sent2} Sum Score = 7.6 33% sample queries had overlap amongst at least 3 of top-10 results

Multi-Granularity Result Generation Goal: Generating a non-overlapping result set maximizing “quality” • Quality = Sum of scores of all results in the set • Maximal independent set problem (NP Hard) • Existing Algorithm: Lexicographic All Independent Sets (LAIS) outputs maximal independent set with polynomial delay in specific order

Optimal Algorithm for k-set (OAKS) • Fix node ordering by decreasing scores • Efficient OAKS Algorithm (typically k<<n): • Start with k-sized first independent set, i.e., greedy • Branch from nodes preceding kth node of the set, check if maximal • Find new k-sized maximal sets, save in priority queue • Reject sets from priority queue where starting node occurs after current best set’s kth node

0.1 0.1 Thread 1 Thread 2 2.1 2 2.5 0.1 Post 1 Post 2 Post 3 Post 4 1.6 1.5 1.4 1.3 0.1 0.4 Sent 1 Sent 2 Sent 3 Sent 4 Sent 5 Sent 6 OAKS Sorted Ordering: Post3(2.5), Post1(2.1), Post2(2), Sent1(1.6), Sent2(1.5), Sent3(1.4), Sent4(1.3), Sent6(0.4), Sent5(0.1), Post4(0.1), Thread1(0.1), Thread2(0.1) For k=4, Greedy = {Post3, Post1, Post2, Sent6} SumScore=7.0 In the 1st iteration: {Post3, Post2, Sent1, Sent2} SumScore = 7.7 {Post3 , Post1, Sent3, Sent4} SumScore = 7.3 Branch from Sent1, removing all adjacent to Sent1,  {Post3, Post2, Sent1} Maximal on first 4 nodes? YES! then complete to size k and insert in queue- {Post3, Post2, Sent1, Sent2} Branches from nodes before Sent6, i.e. Sent1, Sent2, Sent3, Sent4

Evaluating OAKS Algorithm Comparing OAKS Runtime Small overhead for practical k (=20) • Scoring time = 0.96 sec • OAKS Result set generation time = 0.09 sec OAKS improves over Greedy SumScore in 31% queries @top20 Comparing LAIS and OAKS • 100 relatively infrequent queries with corpus frequency in range 20-30, 30-40… • OAKS is very efficient. Time required by OAKS depends on k

Dataset and Evaluation Setting • Data collected from breastcancer.org • 31K threads, 301K posts, 1.8M unique sentences, 46K keywords • 18 Sample Queries • e.g., broccoli, herceptin side effects, emotional meltdown, scarf or wig, shampoo recommendation … • Experimental Search Strategies – top20 results • Mixed-Hierarchy : Optimal mixed granularity result. • Posts-Hierarchy : Hierarchical scoring of posts only. • Posts-tf*idf : Existing traditional search. • Mixed-BM25

Evaluating Perceived Relevance Mixed-Hierarchy Graded Relevance Scale Exactly relevant answer, Relevant but too broad, Relevant but too narrow, Partially relevant answer, Not Relevant Crowd Sourced Relevance using Mechanical Turk • Over 7 annotations • Quality control -Honey pot questions • EM algorithm for consensus

Evaluating Perceived Relevance Clearly, Mixed-H outperforms post only methods Users perceive higher relevance of mixed granularity results

EgoCentric Search • Previous technique did not takethe authorship of posts into account • Some forum participants are similar, sharing same topics of interest or having the same needs, not necessarily at the same time • Rank similar author’s posts higher for personalized search • Some forum participants are experts, prolific and knowledgeable • Expert opinions carry more weight in keyword search • Author score to enhance personalized & keyword search

auth 1 query 1 author 1 Topic 1 auth 2 query 2 Topic 2 author 2 author 3 author n Topic t auth n query n Author Score • Forum participants have several reasons to be linked • Build a multidimensional heterogeneous graph over authors incorporating many relations • But, users assign different importance to different relations • User Profiles: • Location • Age • Cancer stage • Treatment • … W(a,t) W(q,t) W(a1,a2) • Co-participation • Explicit References

Contributions Critical problem for leveraging authorship for search: Incorporating multiple user relations with varying importance learned egocentrically from user behavior Outline: • Author score computation using multidimensional graph • Personalized predictions of user interactions: authors most likely to provide answers • Re-ranking results of keyword search using author expertise

Multi-Dimensional Random Walks (MRW) b 2 3 a • Random Walks (RW) for finding most influential users • Pt+1 = M × Pt… till convergence • M = α(A + D) + (1 − α)E … relation matrix A, D for dangling nodes, uniform matrix E, αusually set to 0.85 • Rooted RW for node similarity • Teleport back to root node with probability (1-α) • Computes similarity of all nodes w.r.t root node • Multidimensional RW– Heterogeneous Networks: • Transition matrix computed as A = 1 * A1 +  2* A2 + ... +  n* An where  i i = 1 and all  i >= 0 • Egocentric weights - For root node r:  i (r) =  jewAi (r, m)/ Ak jewAk (r, j) …  m  Ai and  j  Ak c A = D = E =

Personalized Answer Search • Link prediction by leveraging user similarities: • Given participant behavior, find similar users to the user asking question • Predict who will respond to this question • Learn similarities from first 90% training threads • Relations used: • Topics covered in text, Co-participation in threads, Signature profiles, Proximity of posts • MRW similarity compared with baselines: • Single relations • PathSim: • Existing approach for heterogeneous networks • Predefined paths of fixed length • No dynamic choice of path Link prediction enables suggesting which threads or which users to follow

Predicting User Interactions MAP for link prediction Multidimensional RW has best prediction performance

Predicting User Interactions • Leverage content of the initial post to find users who are experts on the question • TopicScore computed as cosine similarity between author’s history and initial post • UserScore = β * MRWScore + (1- β) * TopicScore Purely topical expertise Purely MRW MAP % Improvement over purely MRW

Enhanced Keyword Search • Non-rooted RW to find most influential expert users • Re-rank top-k results of IR scoring using author scores • Final score of post = ω*IR_scoreλ + (1- ω)*Authority_score • Posts only, tf*idfscoring with size parameter  4% improvement 5% Re-ranking search results with author score yields higher MAP relevance

Patient Emotion and stRucture Search USer tool(PERSEUS)- Conclusions • Designed hierarchical model and score that allows generating search results at several granularities of web forum objects. • Proposed OAKS algorithm for best non-overlapping result. • Conducted extensive user studies, show that mixed collection of granularities yields better relevance than post-only results. • Combined multiple relations linking users for computing similarities • Enhanced search results using multidimensional author similarity • Future Directions: • Multi-granular search on web pages, blogs, emails. Dynamic focus level selection. • Search in and out of context over dialogue, interviews, Q&A. • Optimal result set selection for targeted advertising, result diversification • Time sensitive recommendations – Changing friendships, progressive search needs.

Thank you!

Why Random Walks? • Multi-dimensional • Rooted RW Example Rooted RW Examples 0.6 (a) r t 4 2 0.4 (b) r a t r2 r2 2 4 r1 r1 b c 1 1 b 0.16 c (c) r a t A1 A2 b score (b w.r.t r1) = 0.072 score (c w.r.t r1) = 0.096 score (b w.r.t r2) = 0.097 score (c w.r.t r2) = 0.066 0.26 (d) r a t 0.16 c u

0.1 0.1 Thread 1 Thread 2 2.1 2 2.5 0.1 Post 1 Post 2 Post 3 Post 4 2 2 1.6 1.5 1.4 1.3 0.1 0.4 Sent 1 Sent 2 Sent 3 Sent 4 Sent 5 Sent 6 LAIS Sorted Ordering: Post3(2.5), Post1(2.1), Post2(2), Sent1(1.6), Sent2(1.5), Sent3(1.4), Sent4(1.3), Sent6(0.4), Sent5(0.1), Post4(0.1), Thread1(0.1), Thread2(0.1) Greedy = {Post3, Post1, Post2, Sent6}. In the 1st iteration: {Post3, Post2, Sent1, Sent2, Sent6} {Post3, Post1, Sent3, Sent4, Sent6} {Post1, Post2, Sent6, Sent5} {Post3, Post1, Post2, Post4) {Post3, Sent6, Thread1} {Post1, Post2, Thread2}

Current Search Functionality breastcancer.org Filtering criteria keyword search, member search Ranking based on date Posts only results

Searching Web Forums

Searching Web Forums

Presentation Transcript

Web Searching Strategies

Searching the Web CS3352 Searching the Web

Effective Web Searching

Searching the Web

Searching the Web

Effective Web Searching

Searching the web

Intelligent web searching

Searching the Web

Searching the Web

Web Searching

Searching the web

Searching the Web

Searching the Web

Web Searching

Searching Web Better

Web Searching

Searching the Web

Searching the Web

Searching the Web

Searching the Web