Efficient Visualization of Document Streams
This presentation discusses innovative methods for visualizing document streams, focusing on an original algorithm and its modifications. It covers the complete visualization pipeline, including data preprocessing, k-means clustering, and stress majorization techniques. We delve into experiments that tested the speed and efficiency of our approach, revealing performance metrics such as processing rates and batch handling capacities. Conclusions highlight the effectiveness of our distance-preserving visualization method, paving the way for future enhancements in document stream analysis.
Efficient Visualization of Document Streams
E N D
Presentation Transcript
Efficient Visualization of Document Streams Miha Grčar{miha.grcar@ijs.si} Vid Podpečan Matjaž Juršič Prof. Dr. Nada Lavrač Jozef Stefan Institute, Dept. of Knowledge Technologies Ljubljana, Slovenia Discovery Science, Canberra, October 2010
Outline • Motivation • Original algorithm • Document corpus visualization pipeline • Our modified algorithm • Visualization of document streams • Experiments (speed tests) • Conclusions and further work DS 2010
MotivationGoal: Visualization of Document Streams Documentstream Outdateddocuments DS 2010
Corpus Visualization Pipeline Paulovich et al. (2006) Neighborhoodscomputation Corpus preprocessing k-means clustering Least-squaresinterpolation Stressmajorization Document corpus Layout DS 2010
Corpus Visualization Pipeline • Corpus preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation • Tokenization • Stop-word removal • Lemmatization • n-grams Sparse TF-IDF vectors in a high-dimensional space DS 2010
Corpus Visualization Pipeline • Corpus preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation Iterative method DS 2010
Corpus Visualization Pipeline • Corpus preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation Iterative method High-dimensional 2D DS 2010
Corpus Visualization Pipeline • Corpus preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation DS 2010
Corpus Visualization Pipeline 1 (0,0) 1 (0,0) … -1/k 1 -1/k -1/k (x1,y1) 1 (x2,y2) • Corpus preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation • pi = (1/|Np|)rNpr • pi + rNp(–1/k)r = (0, 0), k = |Np| • ci = (xi*, yi*) … 1 Iterative method … 1 1 … 1 = … 1 1 (0,0) … 1 (0,0) (xn-1,yn-1) 1 (x1*,y1*) (xn,yn) … 1 … 1 1 (xr*,yr*) argminX{||AX – B||2} AX = B DS 2010
Stream Visualization Pipeline Neighborhoodscomputation Preprocessing k-means clustering Least-squaresinterpolation Stress majorization Buffer (FIFO) Documentstream Outdateddocuments DS 2010
Stream Visualization Pipeline • Preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation • TF-IDF weights • TF: the number of times the term occurs in the document • DF: the number of documents in the corpus containing the term • IDF: log(|D| / DF) • Not possible to compute IDF from (infinite) real-time streams DS 2010
Stream Visualization Pipeline • Preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation • TF vector • TF-IDF vector VocabularyDF values • TF vector TF vector • TF vector • TF vector • TF vector DS 2010
Stream Visualization Pipeline • Preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation Warmstart! DS 2010
Stream Visualization Pipeline • Preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation Warmstart! DS 2010
Stream Visualization Pipeline • Preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation • Remove outdated instances • Add new instances … DS 2010
Stream Visualization Pipeline 1 (0,0) 1 (0,0) … 1 (x3,y3) (x4,y4) 1 • Preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation • Remove outdated instances • Add new instances … 1 Warmstart! … 1 1 … 1 = 1 (0,0) … 1 (0,0) 1 (x1,y1) 1 (0,0) 1 (x2,y2) (0,0) (0,0) 1 1 (x3,y3) … … 1 (0,0) 1 (x4,y4) 1 … (xn-1,yn-1) 1 = 1 (x1*,y1*) … 1 (xn,yn) … 1 (0,0) 1 … 1 (0,0) … 1 (xn-1,yn-1) 1 (x1*,y1*) (xn,yn) … 1 1 (xr*,yr*) … 1 1 (xr*,yr*) DS 2010
Speed Tests • First 30,000 news from Reuters Corpus Vol. 1 (“natural” rate: 1.4 news / minute) • Experimental setting • Maximum rate? • 10 news in a batch (u = 10) • Buffer capacity: nQ = 5,000 news • 100 control points, 30 + 30 neighbors DS 2010
Speed Tests DS 2010
Speed Tests Processing delay: ~9 sec + 4 sec to form a batch Exit delay: ~4 sec Exit frequency: ~1 / 4 batches per sec (2.5 docs / sec) Neighborhoodscomputation Preprocessing k-means clustering Least-squaresinterpolation Stress majorization Buffer (FIFO) Documentstream Outdateddocuments DS 2010
Speed Tests DS 2010
Conclusions and Further Work • Conclusions • Efficient online distance-preserving document stream visualization technique (2.5 docs / sec, 5 parallel processes) • Tricks: warm start, pipelining, parallelization • Further work • Performance at different nQand u? • Optimize k-means (done!) and k-NN (easy) • Find use cases, perform user studies • Decision making in financial domain (FIRST) • Press clipping (media monitoring) DS 2010