Pavlos Protopapas (Harvard CfA and SEAS)

Classification Through Conversations With DataA next generation search engine using machine learning, semantics and GPGPU Pavlos Protopapas (Harvard CfA and SEAS) Gabriel Wachman, Matthias Lee, Patrick Ohiomoba, Roni Khardon

Time Series Center: Short overview what is it about and what we do. Classification: Machine Learning using Kernels and SVM Using time series and meta data and search engine Using knowledge via published papers

Time Series Center Idea: Create the largest collection of time series in the world and do interesting science. Discoveries. Recipe: 5 tons of data 1 dozen of people with science questions 2-3 people with skills 2 tons of hardware Focus: Astronomy. Light curves = time series. We have other data too. Labor data, real estate data, heart monitor data etc.

Ubiquity of Time Series Cyclist’sheartrate tcppackets Cadence design stock prices FishcatchoftheNorth-eastpacific

0 200 400 600 800 1000 1200 Supporting Exact Indexing of Shapes under Rotation Invariance with Arbitrary Representations and Distance Measures VLDB 2006: 882-893. Eamonn Keogh · Li Wei · Xiaopeng Xi · Michail Vlachos · Sang-Hee Lee · Pavlos Protopapas

Data Astronomical data. MACHO.(Microlensing survey)- 66 million objects. 1000 flux observations per object in 2 bands (wavelengths) SuperMACHO (another microlensing survey)- Close to a million objects. 100 flux observations per objects. TAOS (outer solar system survey) - 100000 objects. 100K flux observations per object. 4 telescopes. ESSENCE (supernovae survey). Thousands objects, hundred observations. Minor Planet Center Light curves - Few hundred objects. Few hundred observations Pan-STARRS. (general purpose panoramic survey) Billions of objects. Hundred observations per object. OGLE (microlensing and extra solar planet surveys) and few others. MMT variability studies,some HAT-NETand SDSS 82, (EROS). DASCHDigital Access to a Sky Century @ Harvard

Demo of the data accessibility

Astronomy Extra-solar planets: Either discovery of extra solar planet or statistical estimates of the abundance of planetary systems Cosmology:Supernovae from Pan-STARRS will help determine cosmological constants. Asteroids, Trans Neptunian objects etc: Using occultation signals –for the detection of the smaller objects in the edge of the Solar System. Understanding of the solar system. AGN: (Active Galactic Nuclei) Automatic classification of AGN from the time series. Eclipsing binaries:Determining masses and distances to objects Variable stars:Study of variable stars.Automatic classification. Microlensing:Determine dark matter question. and many more

Outlier/anomaly detection • Find the anomalous cases. For error control or novel phenomena. • Clustering • Unsupervised clustering could help identify new subclasses. • Classification • Automatic classification either supervised or semi supervised. • Motif detection • Finding patterns especially low signal to noise ratio. Computer Science and Statistics Scalability Analyzing a large data set requires efficient algorithms that scale linearly in the number of time series. The feature space Representation of the time series (Fourier Transform, Wavelets, Piecewise Linear, and symbolic methods) A distance metric Determine similarities in time series.

Computational The sizes of data sets in astronomy, medicine and other fields are presently exploding. The time series center needs to be prepared for data rates starting in the 10’s of gigabytes per night, scaling up to terabytes per night by the end of the decade. PARALLEL FILE SYSTEMS [gpfs, luster, NFS do not perform well] Interplay between the algorithms used to study the time series, and the appropriate database indexing of the time series itself must be optimized. Real-time: Read time access and real time processing Distributed Computing: Standards (VO), subscription query etc.

Hardware • Disk: • ~100 TB of disk GPFS, LUSTER, NFS • Computing nodes: • Part of Odyssey cluster (~5362 cores). • DB server: • Dual core with 16 GB of memory and 2 TB of disk. Few servers for development • GPGPU dedicate machine with xxx and a cluster of GPGPU server. Nvidia GTX285(1GB 240 cores corespeed:700MHz), Intel i7 920 quadcore(4 physical 4 logical=8) 2.67GHz, 1TB HDD • GPU cluster (Nikola) with 16 machines with Nvidia Tesla T10 GPU's attached to it • Web server • 2 dual machines. [who caress]

Rest of the talk Automatic classification of variable stars Scan statistics, for event detection Efficient searches in large data set

Astronomy surveys contain millions of stars and no end in sight Astronomers spend years doing manual classification Many stars remain unclassified Stars are represented as a time series, overall brightness, color and period if they are periodic. Great opportunity to bring machine learning to another domain Description Work with Gabriel Wachan and Roni Khardon (Tuft University)

Variable Stars Variable stars: Most stars are of almost constant luminosity (sun has about 0.1% variability over an 11 year solar cycle in the optical). Stars that undergo significant variations in luminosity are called variable stars. Intrinsic pulsating-Cepheid, RR Lyrae, Mira, delta Scuti eruptive-Luminous blue variables explosive- Supernovae Extrinsic variable stars rotating eclipsing binaries planetary transits Periodic non periodic.

Goals Summary Right Now: • Start with MACHO, EROS, MMT variable survey • Periodic stars only • Cepheids, RRL's, EB's • Input: data (time series and other features) from survey • Output: list of periodic variable stars (Cepheids, RRLs, Eclipsing Binaries) • Lets do something real. Working on: • Other events (e.g. other variables) • Early prediction

Typical light curves

Other features: brightness, color, period

How do time series look like, really ?

Folding

Variables -> Periodic • Challenges • Sensitivity of period • Range of potential periods • 0.25d – 50d • Overfitting • Computational cost

Kernel for Time Series We want to capture what it means for two time series to have the same “shape” S1: • Pros: • Does exactly what we want • Can compute using FFT in O(nlogn) • Cons: • Is not positive semidefinite

Convolution

Kernel for Time Series Theorem 1: S1 satisfies the Cauchy Schwartz inequality. Theorem 2: Can construct a distance measure using S1 that satisfies triangle inequality. Theorem 3: Any 3x3 Gram matrix of S1 is positive semidefinite. Theorem 4: S1 is NOT positive semidefinite.

Kernel for Time Series K1: • Pros: • Positive semidefinite • Intuitively approximates maximum alignment • Works as well • O(nlogn)

Classification Stage Overview SVM Kernel K1: Similarity measure of “shape” Kernel K2: magnitude (brightness), color, period Final kernel: K1 + K2

Current Results • Cross-validation over OGLEII: CEPH 3413 0 13 EB 0 3388 0 RRL 12 2 7259

Training Set: OGLE 14087 periodic variables of type Cepheid, EB, RRL Periods given 99.8% accuracy on cross-validation … so we're done, right? We know we can classify well given a group of Cepheids, EBs, and RRLs and their periods.

Test Set: MACHO ~25 million stars (LMC and SMC) ~50,000 are periodic variables Two primary issues: Finding those ~50,000 periodic variables or eliminating just the other 24,950,000 stars Finding the periods

Approach Overview Train on OGLE Multi-stage processing of MACHO Eliminate non-variables Eliminate non-periodic variables Eliminate non-Cepheid,RRL,EB periodic variables Test on MACHO Rank classifications by confidence Set aside low-confidence predictions

Eliminate Non-Periodic Variables For each time series (9 million of them): Find the period Fold time series to the period Finding the period is hard

Period Estimation Challenges Sensitivity of period Range of potential periods 0.25d – 50d Computational cost 0.5608998 0.5609995

Period Estimation Data are non-uniformly sampled Aliasing to sample frequency Aliasing to integer multiples of true frequency Unknown “true” period Period finding algorithms assume periodic

Period Estimation Use Lomb-Scargle Periodogram to generate period candidates If period candidate ~ 1d, check for aliasing Check for asymmetry Imperfect, but automatic and consistent

But Is It Periodic? Period finder works on any (non-)periodic data Check the shape Need to formalize notion of “shape” Variance ratio Move a sliding window along the folded time series Compare local variances to global variances For this application, this is a reliable estimate of “having shape”

Eliminate Non-Cepheid, RRL, EBs Now we believe we are left with only periodic variables Model learned from OGLE cannot make useful predictions on other kinds of data It is critical to remove as many data points as possible that are not in one of the three classes

SVM and Linear Classifiers

Eliminate Non-Cepheids, RRLs,EBs

Future Work Period finding [thesis] Estimation of non-Cepheid,RRL, EB classes Ranking of predictions according to confidence [come with a correct probabilistic model] Improved elimination round [use SVM for the elimination process] No periodic. Use motifs as feature extraction [thesis] Early warnings, sequential statistics.

An Application for Similarity Searches as a search engine Work with Matthias Lee (WIT), Patrick Ohiomoba (Harvard)

Search engines • Search Technology is something that we're very familiar with • Everything's searchable • Google • Yahoo • Spotlight • Amazon • iTunes

Traditional Search/Indexing Technology is highly dependent on searching one dimensional structures like Text Search: ”2008 Presidential Election” Metadata/Tags:

This doesn't work so well on non-text objects that are harder to tag ahead of time Facial Recognition GIS/Rich Media Astronomical Time Series Shazam: a cool iPhone Application for searching Music

Similarity We have an appropriate similarity measure for searching our light curves What kind of similarity queries can we make? K-Nearest Neighbors Range Search

An Application for Similarity Searches in Massive Time Series Databases • K-Nearest Neighbors • e.g. Find the three nearest zipcars to the IIC • Range Search • e.g. Find all the zipcars within 700 yards of the IIC

An Application for Similarity Searches in Massive Time Series Databases We have a distance metric/similarity measure on our light curve set. Aren't we done? We can now presumably do all of the similarity searches we've talked about Brute Force Approach K-Nearest Neighbors: Find the distance from the target to every other time series, hold onto the k best matches Range Search: Compare the target to every other time series, if the distance is less than the range, hold on to it

An Application for Similarity Searches in Massive Time Series Databases The problem is that our similarity queries are O(N) in complexity where N is the number of light curves If N is large (say 10 million), searches will take a very long time

An Application for Similarity Searches in Massive Time Series Databases Hmm... Each query seems to do a lot of repeated work (distance searches). Can we precompute some of that work and save time? Yes We Can!! Yes We Can!! Metric Space Indexes.

Pavlos Protopapas (Harvard CfA and SEAS)