1 / 22

Efficient Visualization of Document Streams

Efficient Visualization of Document Streams. Miha Gr č ar { miha.grcar @ijs.si} Vid Podpe čan Matjaž Juršič Prof. Dr. Nada Lavrač Jozef Stefan Institute, Dept. of Knowledge Technologies Ljubljana, Slovenia Discovery Science, Canberra, October 2010. Outline. Motivation

Télécharger la présentation

Efficient Visualization of Document Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Visualization of Document Streams Miha Grčar{miha.grcar@ijs.si} Vid Podpečan Matjaž Juršič Prof. Dr. Nada Lavrač Jozef Stefan Institute, Dept. of Knowledge Technologies Ljubljana, Slovenia Discovery Science, Canberra, October 2010

  2. Outline • Motivation • Original algorithm • Document corpus visualization pipeline • Our modified algorithm • Visualization of document streams • Experiments (speed tests) • Conclusions and further work DS 2010

  3. MotivationVisualization of Document Corpora DS 2010

  4. MotivationGoal: Visualization of Document Streams Documentstream Outdateddocuments DS 2010

  5. Corpus Visualization Pipeline Paulovich et al. (2006) Neighborhoodscomputation Corpus preprocessing k-means clustering Least-squaresinterpolation Stressmajorization Document corpus Layout DS 2010

  6. Corpus Visualization Pipeline • Corpus preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation • Tokenization • Stop-word removal • Lemmatization • n-grams  Sparse TF-IDF vectors in a high-dimensional space DS 2010

  7. Corpus Visualization Pipeline • Corpus preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation Iterative method DS 2010

  8. Corpus Visualization Pipeline • Corpus preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation Iterative method High-dimensional  2D DS 2010

  9. Corpus Visualization Pipeline • Corpus preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation DS 2010

  10. Corpus Visualization Pipeline 1 (0,0) 1 (0,0) … -1/k 1 -1/k -1/k (x1,y1) 1 (x2,y2) • Corpus preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation • pi = (1/|Np|)rNpr • pi + rNp(–1/k)r = (0, 0), k = |Np| • ci = (xi*, yi*) … 1 Iterative method … 1 1 … 1 = … 1 1 (0,0) … 1 (0,0) (xn-1,yn-1) 1 (x1*,y1*) (xn,yn) … 1 … 1 1 (xr*,yr*) argminX{||AX – B||2} AX = B DS 2010

  11. Stream Visualization Pipeline Neighborhoodscomputation Preprocessing k-means clustering Least-squaresinterpolation Stress majorization  Buffer (FIFO) Documentstream Outdateddocuments DS 2010

  12. Stream Visualization Pipeline • Preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation • TF-IDF weights • TF: the number of times the term occurs in the document • DF: the number of documents in the corpus containing the term • IDF: log(|D| / DF) • Not possible to compute IDF from (infinite) real-time streams DS 2010

  13. Stream Visualization Pipeline • Preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation • TF vector • TF-IDF vector VocabularyDF values • TF vector TF vector • TF vector • TF vector • TF vector DS 2010

  14. Stream Visualization Pipeline • Preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation Warmstart! DS 2010

  15. Stream Visualization Pipeline • Preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation Warmstart! DS 2010

  16. Stream Visualization Pipeline • Preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation • Remove outdated instances • Add new instances … DS 2010

  17. Stream Visualization Pipeline 1 (0,0) 1 (0,0) … 1 (x3,y3) (x4,y4) 1 • Preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation • Remove outdated instances • Add new instances … 1 Warmstart! … 1 1 … 1 = 1 (0,0) … 1 (0,0) 1 (x1,y1) 1 (0,0) 1 (x2,y2) (0,0) (0,0) 1 1 (x3,y3) … … 1 (0,0) 1 (x4,y4) 1 … (xn-1,yn-1) 1 = 1 (x1*,y1*) … 1 (xn,yn) … 1 (0,0) 1 … 1 (0,0) … 1 (xn-1,yn-1) 1 (x1*,y1*) (xn,yn) … 1 1 (xr*,yr*) … 1 1 (xr*,yr*) DS 2010

  18. Speed Tests • First 30,000 news from Reuters Corpus Vol. 1 (“natural” rate: 1.4 news / minute) • Experimental setting • Maximum rate? • 10 news in a batch (u = 10) • Buffer capacity: nQ = 5,000 news • 100 control points, 30 + 30 neighbors DS 2010

  19. Speed Tests DS 2010

  20. Speed Tests Processing delay: ~9 sec + 4 sec to form a batch Exit delay: ~4 sec Exit frequency: ~1 / 4 batches per sec (2.5 docs / sec) Neighborhoodscomputation Preprocessing k-means clustering Least-squaresinterpolation Stress majorization Buffer (FIFO) Documentstream Outdateddocuments DS 2010

  21. Speed Tests DS 2010

  22. Conclusions and Further Work • Conclusions • Efficient online distance-preserving document stream visualization technique (2.5 docs / sec, 5 parallel processes) • Tricks: warm start, pipelining, parallelization • Further work • Performance at different nQand u? • Optimize k-means (done!) and k-NN (easy) • Find use cases, perform user studies • Decision making in financial domain (FIRST) • Press clipping (media monitoring) DS 2010

More Related