1 / 39

Tag-Cloud Drawing : Algorithms for Cloud Visualization

Tag-Cloud Drawing : Algorithms for Cloud Visualization. Owen Kaser , University of New Brunswick, Saint John, NB, Canada Daniel Lemire , Universite du Quebeca Montreal Montreal , QC, Canada WWW’07: 16th International World Wide Web Conference. Introduction.

azia
Télécharger la présentation

Tag-Cloud Drawing : Algorithms for Cloud Visualization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tag-Cloud Drawing : Algorithms for Cloud Visualization Owen Kaser , University of New Brunswick, Saint John, NB, Canada Daniel Lemire, Universitedu Quebeca Montreal Montreal, QC, Canada WWW’07: 16th International World Wide Web Conference

  2. Introduction • Tag Cloud usually use font size to show the relative importance or frequency of tag. • A consequence is wasteful white space that is problematic in small-display device. • Clumps of white space are not aesthetically pleasing . • Try to optimize the display of tag cloud and place associated tags near one another. • Use EDA algorithm, min-cut placement for area minimization and clustering in tag clouds. • Use Knuth-Plass algorithm for text justification and a book-placement exercise considered by Skiena.

  3. Related Work • Tag clouds have been attributed to Coupland but have been popularized by the Web site Flickr. • Tag cloud are commonly associated with folksonomies and social software. • Graph drawing suggest some metrics that make graph easy to understand and pleasing to eyes.

  4. Related Work (Cont.) • Other type of tag-cloud display. • Hassan-Montero and Herrero-Solana have proposed improving tag-cloud by clustering similar tags together. • Millen et al. have proposed that user be dynamically remove able to remove less significant tags and add index in large clouds. • Bielenberghas proposed circular clouds, where the most heavily weighted tags appear closer to the center. • Dubinkoet al. have proposed a model to represent tags over a time line. • Russelhas proposed cloudalicious, a tool to study the evolution of the tag cloud over time. • Jaffe et al. have integrated tag clouds inside maps for displaying tags having geographical information, such as pictures taken at a given location.

  5. Related Work (Cont.) • Improvement of layout of HTML • Hurst et al. showed that it is possible to make HTML table more pleasing. • Ongoing work to improve the layout of text in HTML pages using Cascading Style Sheet.

  6. Background • Typesetting • EDA: Physical Design

  7. Typesetting Greedy Method • Fits as many words per line as possible, starting a new line whenever further words cannot be placed on the current line. • This approach used by most browsers. • Greedy approach can be done on-line, without waiting the end of paragraph. • It is fast but can also produce suboptimal solutions.

  8. Typesetting (Cont.) Dynamic Programming • Knuth and Plass compute an optimal solution using dynamic programming. • TEX system can quickly determine where to break line and fit text onto the page. • Their total-fit algorithm minimizes the sum of squares of each line’s badness inline.

  9. Typesetting (Cont.) • The total-fitalgorithm can be summarized excluding hyphenation and penalties. • We can compute for all possible j = 1,….,n in time O(n2) and O(n) space. • Label the words of a paragraph from 1 to n. bk,j- The badness measure resulting from a line containing wordsk to j and bk,j= 0 while k>j. tj - The minimal possible sum of square of the line badness when the jthword ends a line and t0=0. Kj– For j > 1, the last word of the line prior to the one the one ending with jthword.

  10. EDA: Physical Design • Electronic design automation (EDA) is the category of tools for designing and producing electronic systems. • Placement and floorplanningare two closely related stages during many physical design flows. • Mathematically, floorplanning and placement solve the same problem. • Floorplanning is often done early in the design stage and gives a “a bird’s eyes” view of the layout. • On the other hand, placement is typically done with complete knowledge module shape. • Recent tools blur the destinction.

  11. EDA: Physical Design (Cont.) Placement Approaches in EDA • Placement problem are typically NP-hard. • Approaches include force-directed placement, simulated annealing, min-cut placement. • For speed, min-cut placement is often chosen.

  12. Models For Cloud Optimization • Tag Clouds with Inline Text • Tag Clouds with Arbitrary Placement • Tag Relationships

  13. Tag Clouds with Inline Text • Inline text is a paragraph (block) made exclusively of inline HTML elements such as span, font, em, b, i, strong, a and br. • Any area outside a tag but inside the tag cloud will be referred to as “white”. • The primary view has the width and height of each tag fixed. • But still can change the height and width of tag by using HTML style or CSS. • Do not include a penalty for squeezing tags or spaces. • Do not take into account symmetry or homogeneity.

  14. Tag Clouds with Inline Text Badness Measuring k : numbers of tags i : ranging from 1 to k hj: height of tag i wj: width of tag i • The badness of a line is only a function of the set of tag dimension (wj,hj). W : normal width of a white space. h : h = max hj

  15. Tag Clouds with Inline Text Example 1. We have tags on the line in (width, height) format: (32,14),(45, 16), (24,12). Tag cloud width w is 128 pixels. Expected white-space width of 4 pixels between tags. W = 4. The line height h = max{14,16,12} = 16 extra white space on the line 128 – 32 – 45 – 24 – (2 * 4) = 19 Contributing to the badness by 19 * 16 = 304 The first and last tags have lesser heights than the second tag, and they contribute respectively 32(16-14) + 24(16-12) = 160 Total badness 304 + 160 = 464

  16. Tag Clouds with Inline Text • In the spirit of the Knuth-Plass total-fit algorithm, we might define the overall badness of a tag cloud as the sum of the squares. • (l2) Summing the squares of the badness has the benefit of penalizing more heavily solutions with some very bad lines. • (l1) Merely summing the line badness tend to produce shorter clouds • (l∞) Minimize the maximum badness across all line might generate very tall clouds.

  17. Tag Clouds with Arbitrary Placement Assumption • tags may be reordered and placed arbitrarily (but without overlap or rotation) in the plane; • tag relationships are known, and strongly related tags should be in close proximity; • tag-cloud width has an upper bound; • tag-cloud height should be small, to reduce scrolling; • (optional) tags may be deformed slightly (made shorter but wider, for instance), so long as tag area remains (nearly) constant; • (optional) large clumps of white space are bad.

  18. Tag Clouds with Arbitrary Placement (Cont.) • There is no analogue to a “line” of tags when arbitrary placement is allowed. • We need to sum white area surrounding tags . • Another goal is to obtain spatial clustering of semantically related tags. • Small values indicated better clustering.

  19. Tag Relationships • One method of determining tag relationships counts co-occurrences, when a pair of tags have been assigned to the same resource. • Another view is that each resource corresponds to a hyperedge in a hypergraph, whose members consist of the tags. • For instance, the hyperedge{bottle, gas, beer} from the first view would correspond to the edges { (bottle, gas), (bottle, beer), (gas, beer) } in the second view. • We should use graph instead of hypergraph.

  20. Tag Relationships (Cont.)

  21. Solutions • Cloud Layout with Inline Text • Cloud Layout with Arbitrary Placement

  22. Cloud Layout with Inline Text • Apply dynamic programming or shelf-packing. • First breed of algorithms : take an ordered list of tags and choose where to break lines. • First design a simple greedy method : • Tags are added to line until the line is full and create new line when needed. • Then apply Knuth-Plass algorithm except that : • The last line is not an exception: it cannot be half empty without penalty ; • if, and only if, a tag exceeds the maximal width, then it will be given a line of its own; no other overfull lines are allowed.

  23. Cloud Layout with Inline Text (Cont.) • The second breed of algorithms : attempting to decrease the badness. (NP-hard) • Strip packing problem (SPP) (with 10 time randomly shuffling tags) • Other heuristic method are based on approximation algorithms for SPP. • NEXT FIT DECREASING HEIGHT (NFDH) • FIRST FIT DECREASING HEIGHT (NFDH) • FIRST FIT DECREASING HEIGHT WEIGHT (NFDHW)

  24. Cloud Layout with Arbitrary Placement Min-cut Placement • Min-cut placement recursively decomposes a collection of tags by bipartitioning. Then each group is recursively split. • Ideally, the bipartition must be fairly balanced • The cut size (the number — or perhaps total weight — of edges/hyperedges containing tags in both groups) should be small. • There should be an influence of “outside” tags. • Min-cut placement can run in O(mlogn) time if we use the Fiduccia-Mattheyesesbipartitioning heuristic.

  25. Cloud Layout with Arbitrary Placement (Cont.)

  26. Cloud Layout with Arbitrary Placement (Cont.) Slicing Floorplans • Recursive bipartitioning’s effect can be represented in a slicing tree.

  27. Cloud Layout with Arbitrary Placement (Cont.) Nested Tables for Slicing Floorplans • The Table is either 2x1 or 1x2, denpending whether the slicing-tree node is tagged ‘H’ or ‘V’.

  28. Cloud Layout with Arbitrary Placement (Cont.) EDA Placement Is Not (Quite) Tag Placement • We can simply feed our tag-cloud data to an EDA placement, but we found it appropriate to modify the EDA tool. • Long tags are unusual for EDA. • Tags cannot be rotated. • Tags do not need to consider wire area. • Each tag in the cluster is related to every other tag, and thus dividing them should be much more expensive. • Different solution quality levelsand running time requirement.

  29. Experimental Results • Test Data • Tag Clouds with In-line Text • Tag Clouds with Arbitrary Placement

  30. Test Data • Tags and their accompanying importance levels (0-9) were obtained from ZoomClouds and Project Gutenberg. On average, clouds had 93 tags. • ZoomClouds is Web site using the Yahoo! Content Analysis API. • Experiment retreives 65 different tag clouds and normalized the weights with a linear function. • Test data were also derived from word co-occurrences in 20 e-books produced by Project Gutenberg . • The importance iof tag T was determined as • f, r and t are respectively the frequencies of the most frequent tag, the least frequent retained tag, and the tag T.

  31. Tag Clouds with In-line Text • Alphabetically-sorted tags are, on average, 40% larger than weight-sorted tags. • Dynamic programming does not reduce the area of the tag clouds for weight-sorted tags, but offers a reduction of about 3% for alphabetically sorted tags. • The random-shuffling algorithm does worse than sorting by weight.

  32. Tag Clouds with In-line Text (Cont.) • The NFDH heuristic gives about the same average tag-cloud height as does the weight-sorted greedy algorithm. • The FFDH and FFDHW heuristics offer an average reduction of about 3% in the height of the ZoomClouds tag clouds, and of 1% and 2% respectively for the Project Gutenberg tag clouds.

  33. Tag Clouds with In-line Text (Cont.) • (l∞) can generate unacceptably tall tag clouds (3 times taller than normal). • The difference in height between (l1) and (l2) aggregates is well below 1%. • The most competitive algorithms are FFDH, FFDHW and either the greedy or dynamic-programming algorithms applied to weight-sorted tags. • if the l1norm is chosen, the FFDHW heuristic is the clear winner and dynamic programming is not worth the effort • If the sum of squares is preferred, it is a close race.

  34. Tag Clouds with In-line Text (Cont.)

  35. Tag Clouds with Arbitrary Placement • Some changes of EDA algorithms • Modified program to perform graph bipartitioningrather than hypergraphpratitioning. • Fiduccia-Mattheyesheuristic. • Estimating the correct amount of “padding” area is not required for tag placement. • Add an estimate of the absolute width of a floorplan area.

  36. Tag Clouds with Arbitrary Placement (Cont.) Results • Interestingly, 100-tag is faster than 50-tag. • Floorplaning sizing was only small part of over-all time. • C-soft shows that unacceptably long runtimes.

  37. Experimental Results – Tag Clouds with Arbitrary Placement (Cont.) • compaSS has tighter cloud than other algorithms. • The sorted greedy heuristic used 2–19% less area than min-cut heuristic. • With 200 tags, it is remarkable that the more sophisticated compaSS approach was not as good as the greedy heuristic.

  38. Tag Clouds with Arbitrary Placement (Cont.) • The min-cut approach clearly (and unsurprisingly) outperformed greedy approaches and compaSS • compaSS is apparently better at grouping than the sorted greedy heuristic. This is counterintuitive and reveals a weakness in using Equation 1.

  39. Conclusion • Future work should include browser-based implementations. • For in-line text, our cloud-badness model is probably incomplete since it ignores some basic symmetry issues. • Differences between tag-cloud layout and EDA placement, we plan to test an industrial strength min-cut placement tool. • The new hyphenate property might encourage the use of slightly more sophisticated line-breaking algorithms in browsers.

More Related