Data quality and station selection

Data quality and station selection Rafael J. Fernandez-Moctezuma Notes for 5/15/07

Data quality • Kristin suggested we should take a closer look at data quality issues last week, in order to make a more educated choice of stations to model. • Using data currently in Latte, initial focus on US26 E.

Recall 69.29 69.3 69.31 Milepost Controller Station Station Lane ILD Ramp Lane ILD Ramp Lane ILD Lane ILD Lane ILD

Types of readings • Ignoring SWARM flags, we have used three main indicators: • How many speed readings are just NULL? • How many speed readings are just 0? • How many are truly “no traffic”? • Speed is NULL, Volume and Occupancy are 0. • We looked at detectors instead of stations, hoping to keep as much usable data as possible. Kristin’s initial notes are here: http://wiki.cecs.pdx.edu/view/Main/DataQuality

Methodology • Pick the second Wednesday of every month. • Look at the NULL, ZERO, and NO TRAFFIC numbers during a rush hour window (7-9) • No traffic of 100% is nonsense, around ~15% is suspicious (may be far out, such as Helvetia) • Tolerance is 10%. More than 10% all zero readings is suspicious • Look at the plots. If you see a flat line it may not be a good reading. You should see nice ups and downs. • (Optional): We talked about expanding the report to look at SWARM flags, Kristin feels this method is good enough.

Murray, January 10, 2007 – 7 to 9

Selected stations on US 26 E • 58.8, 61.25, 62.47, and 62.8 too far away to be interesting in rush hour. • 64.5 – 185th NB to EB • 64.6 – 185th SB to EB • 65.9 – Cornell only one lane • 67.4 – Murray • 68.55 – CH is not usable (gone in January, fishy after that) • 69.31 – Parkway • 70.9 – Canyon • 71.37 – Skyline • 73.62 – somewhere before count station is looney.

In progress • Picking stations from other segments using this methodology

Dave said we needed a name for the project. I jokingly said “Spackle”. Now Kristin calls it spackle. Some suggestions: Notion of filling in gaps Filler Stuffer Caulk Notion of managing something and “bringing it back” Herder Gaucho “Spackle”?

Nonlinear Travel Time Estimation Rafael J. Fernandez-Moctezuma Notes for 1/23/07

Purpose of today’s meeting • Show you some data you requested • Food for thought, present you a paper we should read • Things to do for clustering • Shift gears toward ‘gap’

Written in stone • The ‘Gap’ Project  OTREC round one, fill-in-the-blanks, estimation, speedmap • The ‘Clustering’ Project  Answer “What is similar?”, indexing, structure in data, etc.

How different are the ODOT and a posteriori estimates? (MSE) There were 9 novelties that could not be predicted under the 8 cluster – retrieve 3 model. Here are MSEs for the predictable values: MSE (ODOT, a posteriori) = 1.757955824 (900) MSE (ODOT, a posteriori) = 1.773927643 (891) MSE(cluster, a posteriori) = 1.277429479 (891) So when we can predict under this model, we do better (overall). Do we really? I just showed you plots that are not “too hot” (i.e., the clustering approach, visually, seems to be under predicting) and I highlighted points that makes us question this conclusion. Compare boith approaches: MSE(cluster,ODOT) = 0.982103535 (891)

“Clustering” error is slightly noisier, with less “magnitude”. STDEV( a posteriori – odot ) = 1.331133702 STDEV( a posteriori – cluster) = 1.128615412

In-progress / To do • Without clustering, is taking the average of travel times of similar data points a good approach? • I think this is key before I answer “How many relevant neighbors are you losing?”)

Tests to be performed / refined • Does this make sense to you? • For I = 1 to 10 • For j = 1 to N • Find the I nearest neighbours of j and average their TT • A plot of I vs. MSE should give a good parameter for “i-nn”. This should be done on “train” data. • Then cluster and measure how many of those i-nn we are loosing. This should be the average of individual loses (I think).

Reading • I think we should read the TODS paper (Bozkaya and Oszoyoglu, 1999) and further refine where the clustering thing is going. A referenced work in it (Burkhard and Keller, 1973) seems to be doing essentially what I came up with (darn!) • This does not imply I should deviate my attention from what you need in Latte) which I think may benefit from these approaches.

Shift gears to GAP • I fished out the data, it will be on barista tonight (don’t worry about the load) – We have January, February, March 2007 for all detectors. We can pick as many segments as we want. • Dave’s words were “Let’s get the data first and we’ll then figure out the next step.” I’ve done bookkeeping tasks on code I know we will use  • Let’s spend a few minutes “formalizing” next steps.

Rethinking classes • Settled on the following: 00:00 – 06:00 06:00 – 10:00 10:00 – 12:00 12:00 – 15:30 15:30 – 19:00 19:00 – Matches several plots (eyeballed), As well as intuitive regimes (free flow, morning peak, morning activity, lunchtime, afternoon peak)

Additional dataset • I-5 N • Milepost 299.7 (Macadam Ave) • September 2006, weekdays only. • Considering entire daily regime. • Sanity check. Downstream from 295.18. • Note: fewer datapoints (sensors were down)

Currently • Still working on aggregates (my aggregates are consistently lower than portal, since I do not throw away data.) I think we should dig deeper into the aggregate “filter” and understand what their motivation was. • Writing down the strategy we discussed last Friday (characterizing incorporating weather information, incidents, etc.)

Contents • Suggested material • Description of selected dataset • First exploration • 3D projections • 2D projections • Hypothesis for classification • 3D projections • 2D projections • Comments

Suggested material related to the Decision Tree project • J.R. Quinlan. Induction of Regression Trees. http://www.cs.toronto.edu/~roweis/csc2515-2003/readings/quinlan.pdf • David J.C. MacKay’s K-Mean (and soft K-mean) clustering demo (Octave required, could port to MATLAB) http://www.inference.phy.cam.ac.uk/mackay/itprnn/code/kmeans/ • Tom M. Mitchell. Machine Learning. McGraw Hill, 1997. • Ian Nabney’s Netlab. http://www.ncrg.aston.ac.uk/netlab/

Initial dataset • I-5 N • Milepost 295.18 (Capital Hwy) • September 2006, weekdays only. • Considering entire daily regime. • Working with 5 minute average – not sure if it is reasonable to build 20 second estimators first. 5 minute agg. Allows us to “fast prototype” a strategy. • Looking at occupancy, speed, volume, and estimated travel time from PORTAL. • Why one milepost to begin with?

First hypothesis: Time as a discriminating value Somewhat arbitrary, they seem to match behavior on average. 00:00 – 05:00 05:00 – 10:00 10:00 – 16:00 16:00 – 23:00 23:00 – 24:00 Again, not written in stone – but will help us reason about the dynamics of the space. Image from PORTAL

Data quality and station selection