Créer une présentation
Télécharger la présentation

Download

Download Presentation

Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects

0 Vues
Download Presentation

Télécharger la présentation
## Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Fast Algorithms for Time Series with applications to**Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao, Zhihua Wang, and Alberto Lerner {shasha,yunyue, xiaojian, zhihua, lerner}@cs.nyu.edu Courant Institute, New York University**Goal of this work**• Time series are important in so many applications – biology, medicine, finance, music, physics, … • A few fundamental operations occur all the time: burst detection, correlation, pattern matching. • Do them fast to make data exploration faster, real time, and more fun.**Sample Needs**• Pairs Trading in Finance: find two stocks that track one another closely. When they go out of correlation, buy one and sell the other. • Match a person’s humming against a database of songs to help him/her buy a song. • Find bursts of activity even when you don’t know the window size over which to measure. • Query and manipulate ordered data.**Why Speed Is Important**• Person on the street: “As processors speed up, algorithmic efficiency no longer matters” • True if problem sizes stay same. • They don’t. As processors speed up, sensors improve – e.g. satellites spewing out a terabyte a day, magnetic resonance imagers give higher resolution images, etc. • Desire for real time response to queries.**Surprise, surprise**• More data, real-time response, increasing importance of correlation IMPLIES Efficient algorithms and data management more important than ever!**Corollary**• Important area, lots of new problems. • Small advertisement: High Performance Discovery in Time Series (Springer 2004). At this conference.**Outline**• Correlation across thousands of time series • Query by humming: correlation + shifting • Burst detection: when you don’t know window size • Aquery: a query language for time series.**Real-time Correlation Across Thousands (and scaling) of Time**Series**Scalable Methods for Correlation**• Compress streaming data into moving synopses. • Update the synopses in constant time. • Compare synopses in near linear time with respect to number of time series. • Use transforms + simple data structures. (Avoid curse of dimensionality.)**GEMINI framework**** Faloutsos, C., Ranganathan, M. & Manolopoulos, Y. (1994). Fast subsequence matching in time-series databases. In proceedings of the ACM SIGMOD Int'l Conference on Management of Data. Minneapolis, MN, May 25-27. pp 419-429.**StatStream (VLDB,2002): Example**• Stock prices streams • The New York Stock Exchange (NYSE) • 50,000 securities (streams); 100,000 ticks (trade and quote) • Pairs Trading, a.k.a. Correlation Trading • Query:“which pairs of stocks were correlated with a value of over 0.9 for the last three hours?” XYZ and ABC have been correlated with a correlation of 0.95 for the last three hours. Now XYZ and ABC become less correlated as XYZ goes up and ABC goes down. They should converge back later. I will sell XYZ and buy ABC …**Correlated!**Online Detection of High Correlation • Given tens of thousands of high speed time series data streams, to detect high-value correlation, including synchronized and time-lagged, over sliding windows in real time. • Real time • high update frequency of the data stream • fixed response time, online**Online Detection of High Correlation**• Given tens of thousands of high speed time series data streams, to detect high-value correlation, including synchronized and time-lagged, over sliding windows in real time. • Real time • high update frequency of the data stream • fixed response time, online**Correlated!**Online Detection of High Correlation • Given tens of thousands of high speed time series data streams, to detect high-value correlation, including synchronized and time-lagged, over sliding windows in real time. • Real time • high update frequency of the data stream • fixed response time, online**StatStream: Naïve Approach**• Goal: find most highly correlated stream pairs over sliding windows • N : number of streams • w : size of sliding window • space O(N) and time O(N2w) . • Suppose that the streams are updated every second. • With a Pentium 4 PC, the exact computing method can monitor only 700 streams, where each result is produced with a separation of two minutes. • Note: “Punctuated result model” – not continuous, but online.**StatStream: Our Approach**• Use Discrete Fourier Transform to approximate correlation as in Gemini approach. • Every two minutes (“basic window size”), update the DFT for each time series over the last hour (“window size”) • Use grid structure to filter out unlikely pairs • Our approach can report highly correlated pairs among 10,000 streams for the last hour with a delay of 2 minutes. So, at 2:02, find highly correlated pairs between 1 PM and 2 PM. At 2:04, find highly correlated pairs between 1:02 and 2:02 PM etc.**Basic window digests:**sum DFT coefs Basic window digests: sum DFT coefs Time point Basic window Sliding window StatStream: Stream synoptic data structure • Three level time interval hierarchy • Time point, Basic window, Sliding window • Basic window (the key to our technique) • The computation for basic window i must finish by the end of the basic window i+1 • The basic window time is the system response time. • Digests**Basic window digests:**sum DFT coefs Basic window digests: sum DFT coefs Time point Basic window Sliding window StatStream: Stream synoptic data structure • Three level time interval hierarchy • Time point, Basic window, Sliding window • Basic window (the key to our technique) • The computation for basic window i must finish by the end of the basic window i+1 • The basic window time is the system response time. • Digests Basic window digests: sum DFT coefs**Basic window digests:**sum DFT coefs Basic window digests: sum DFT coefs Time point Basic window Sliding window StatStream: Stream synoptic data structure • Three level time interval hierarchy • Time point, Basic window, Sliding window • Basic window (the key to our technique) • The computation for basic window i must finish by the end of the basic window i+1 • The basic window time is the system response time. • Digests Basic window digests: sum DFT coefs Sliding window digests: sum DFT coefs**Basic window digests:**sum DFT coefs Basic window digests: sum DFT coefs Time point Basic window Sliding window StatStream: Stream synoptic data structure • Three level time interval hierarchy • Time point, Basic window, Sliding window • Basic window (the key to our technique) • The computation for basic window i must finish by the end of the basic window i+1 • The basic window time is the system response time. • Digests Basic window digests: sum DFT coefs Sliding window digests: sum DFT coefs**Sliding window**StatStream: Stream synoptic data structure • Three level time interval hierarchy • Time point, Basic window, Sliding window • Basic window (the key to our technique) • The computation for basic window i must finish by the end of the basic window i+1 • The basic window time is the system response time. • Digests Basic window digests: sum DFT coefs Basic window digests: sum DFT coefs Basic window digests: sum DFT coefs Time point Basic window**How general technique is applied**• Compress streaming data into moving synopses: Discrete Fourier Transform. • Update the synopses in time proportional to number of coefficients: basic window idea. • Compare synopses in real time: compare DFTs. • Use transforms + simple data structures (grid structure).**Synchronized Correlation Uses Basic Windows**• Inner-product of aligned basic windows Stream x Stream y Basic window Sliding window • Inner-product within a sliding window is the sum of the inner-products in all the basic windows in the sliding window.**f1(1) f1(2) f1(3) f1(4) f1(5) f1(6) f1(7) f1(8)**f2(1) f2(2) f2(3) f2(4) f2(5) f2(6) f2(7) f2(8) f3(1) f3(2) f3(3) f3(4) f3(5) f3(6) f3(7) f3(8) Approximate Synchronized Correlation • Approximate with an orthogonal function family (e.g. DFT) x1 x2 x3 x4 x5 x6 x7 x8**Approximate Synchronized Correlation**• Approximate with an orthogonal function family (e.g. DFT) x1 x2 x3 x4 x5 x6 x7 x8**y1 y2 y3 y4 y5 y6**y7 y8 Approximate Synchronized Correlation • Approximate with an orthogonal function family (e.g. DFT) x1 x2 x3 x4 x5 x6 x7 x8**y1 y2 y3 y4 y5 y6**y7 y8 Approximate Synchronized Correlation • Approximate with an orthogonal function family (e.g. DFT) • Inner product of the time series Inner product of the digests • The time and space complexity is reduced from O(b) to O(n). • b : size of basic window • n : size of the digests (n<<b) • e.g. 120 time points reduce to 4 digests x1 x2 x3 x4 x5 x6 x7 x8**sliding window**sliding window Approximate lagged Correlation • Inner-product with unaligned windows • The time complexity is reduced from O(b) to O(n2) , as opposed to O(n) for synchronized correlation. Reason: terms for different frequencies are non-zero in the lagged case.**x**Grid Structure(to avoid checking all pairs) • The DFT coefficients yields a vector. • High correlation => closeness in the vector space • We can use a grid structure and look in the neighborhood, this will return a super set of highly correlated pairs.**Empirical Study : Speed**Our algorithm is parallelizable.**Empirical Study: Accuracy**• Approximation errors • Larger size of digests, larger size of sliding window and smaller size of basic window give better approximation • The approximation errors (mistake in correlation coef) are small.**Sketches : Random Projection***• Correlation between time series of the returns of stock • Since most stock price time series are close to random walks, their return time series are close to white noise • DFT/DWT can’t capture approximate white noise series because the energy is distributed across many frequency components. • Solution : Sketches (a form of random landmark) • Sketch pool: list of random vectors drawn from stable distribution • Sketch : The list of inner products from a data vector to the sketch pool. • The Euclidean distance (correlation) between time series is approximated by the distance between their sketches with a probabilistic guarantee. • W.B.Johnson and J.Lindenstrauss. “Extensions of Lipshitz mapping into hilbert space”. Contemp. Math.,26:189-206,1984 • D. Achlioptas. “Database-friendly random projections”. In Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, ACM Press,2001**Sketches : Intuition**• You are walking in a sparse forest and you are lost. • You have an old-time cell phone without GPS. • You want to know whether you are close to your friend. • You identify yourself as 100 meters from the pointy rock, 200 meters from the giant oak etc. • If your friend is at similar distances from several of these landmarks, you might be close to one another. • The sketch is just the set of distances.**Sketches : Random Projection**inner product sketches random vector raw time series**Sketches approximate distance well(Real distance/sketch**distance) (Sliding window size=256 and sketch size=80)**Empirical Study: Sketch on Price and Return Data**• DFT and DWT work well for prices (today’s price is a good predictor of tomorrow’s) • But badly for returns (todayprice – yesterdayprice)/todayprice. • Data length=256 and the first 14 DFT coefficients are used in the distance computation, db2 wavelet is used here with coefficient size=16 and sketch size is 64**Sketch Guarantees**• Note: Sketches do not provide approximations of individual time series window but help make comparisons. Johnson-Lindenstrauss Lemma: • For any and any integer n, let k be a positive integer such that • Then for any set V of n points in , there is a map such that for all • Further this map can be found in randomized polynomial time**Overcoming curse of dimensionality***• May need many random projections. • Can partition sketches into disjoint pairs or triplets and perform comparisons on those. • Each such small group is placed into an index. • Algorithm must adapt to give the best results. *Idea from P.Indyk,N.Koudas, and S.Muthukrishnan. “Identifying representative trends in massive time series data sets using sketches”. VLDB 2000.**X**Y Z Inner product with random vectors r1,r2,r3,r4,r5,r6**Further Performance Improvements**-- Suppose we have R random projections of window size WS. -- Might seem that we have to do R*WS work for each timepoint for each time series. -- In ongoing work with colleague Richard Cole, we show that we can cut this down by use of convolution and an oxymoronic notion of “structured random vectors”*. *Idea from Dimitris Achlioptas, “Database-friendly Random Projections”, Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems**Empirical Study: Speed**• Sketch/DFT+Grid structure • Sliding Window Size=3616, basic window size=32 • Correlation>0.9**Query By Humming**• You have a song in your head. • You want to get it but don’t know its title. • If you’re not too shy, you hum it to your friends or to a salesperson and you find it. • They may grimace, but you get your CD**With a Little Help From My Warped Correlation**• Karen’s humming Match: • Dennis’s humming Match: • “What would you do if I sang out of tune?" • Yunyue’s humming Match:**Related Work in Query by Humming**• Traditional method: String Matching [Ghias et. al. 95, McNab et.al. 97,Uitdenbgerd and Zobel 99] • Music represented by string of pitch directions: U, D, S (degenerated interval) • Hum query is segmented to discrete notes, then string of pitch directions • Edit Distance between hum query and music score • Problem • Very hard to segment the hum query • Partial solution: users are asked to hum articulately • New Method : matching directly from audio [Mazzoni and Dannenberg 00] • We use both.