Analysis of Real-World Data

Analysis of Real-World Data Static Stability Factor and the Risk of Rollover April 11, 2001

References • Federal Register, June 1, 2000 • Description of the original linear regression analysis • Federal Register, January 12, 2001 • Description of the updated linear regression analysis • Comparison with logistic regression analysis

Need to Specify • Vehicles • Calendar years • States • Crash types • Variables • Statistical model

Criteria for Selecting Vehicles • Reliable estimate of the Static Stability Factor (SSF) • Model years 1988 and later • Sources include: • Vehicles tested by the agency • Passenger cars tested by General Motors

Vehicles Selected • 100 vehicle model groups, including: • 36 cars • 30 SUVs • 13 vans • 21 pickup trucks

Criteria for Selecting Calendar Years • Vehicle Identification Numbers (VINs) for that year had been decoded and included in the State Data System (SDS) • Wanted multiple years to maximize data available for analysis

Calendar Years Selected • 1994-1997 for the original linear regression analysis • 1994-1998 for the updated linear regression analysis and the logistic regression analysis

Criteria for Selecting States • Part of the SDS • Provided 1994-1998 calendar year data • Include VIN on the crash file • Identify rollover occurrence even if it is not the first harmful event in the crash

States Selected • Florida • Maryland • Missouri • North Carolina • Pennsylvania • Utah

Other SDS VIN States • VIN available for fatalities only • Kansas • VIN added in 1998 • Georgia • Incomplete rollover information • New Mexico • Ohio

Criteria for Selecting Crashes • Single-vehicle crashes of study vehicles • Excluded crashes with other participants • Pedestrian, pedalcyclist, animal, or train • Excluded certain unusual situations • No driver, parked vehicle, pulling a trailer, or emergency use (ambulance, fire, police, or military)

Crashes Selected • 241,036 single-vehicle crashes, including • 48,996 rollovers • This is 0.20 rollovers per single-vehicle crash, consistent with the national estimate from the General Estimates System for these calendar years and vehicle groups

Criteria for Selecting Variables • Variables describing purpose of study • Rollover (yes or no) • SSF (study values range from 1.00 to 1.53) • Confounding factors • Environmental and driver factors that describe how the vehicle was used • Want variables correlated with rollover risk, including travel speed

Variables Selected • Rollover • SSF • Dichotomous variables based on: • Environmental factors (light condition, weather, urbanization, speed limit, road grade, road curve, road condition, surface condition) • Driver factors (sex, age, insurance coverage, alcohol/drug use) • Number of occupants in the vehicle

Summary of Available Data • Six states • Five calendar years (1994-1998) • 100 vehicle groups with a reliable estimate of SSF • 14 confounding variables, including 10 available in all six states • 241,036 single-vehicle crashes, including • 48,996 rollovers

Limitations • Pennsylvania dropped key road use variables (grade and curve) from its electronic file in 1998, so 1998 Pennsylvania data were not used here • Some variables were not available for all six states (urbanization, road condition, insurance coverage, and number of occupants in vehicle) • Could not be used in analysis of combined data • Were used in logistic analysis of individual states • Reporting practices vary by state

Statistical Models • Linear model of summarized data • Logistic models of individual crashes

Preparing Data for the Linear Model • Limited to state-vehicle groups with at least 25 observations • 518 state-vehicle groups used in analysis • Percentage involvement calculated for each variable, for each state-vehicle group • Values ranged from 0 to 1 • For example: • Rollover risk described by rollovers per single-vehicle crash • Urbanization described by percent of crashes on rural roads

Specifying Linear Model Form • Dependent variable = LOG(rollover risk) • Rollover risk set at 0.0001 for state-vehicle groups with no rollovers so they can be included in model • Five dummy variables used to capture state-to-state differences in reporting practices • Missouri used as baseline case • Linear regression of the rollover variable as a function of the summarized explanatory variables and the state dummy variables

Fitting the Linear Model • Each summary data point was weighted by the sample size, capped at 250 as a trade-off between two considerations • Sample size affects reliability of estimates • Model should fit over entire range of SSF • Stepwise procedure used forward variable selection and a significance level of 0.15 for entry and removal from the model

Results of the Linear Model • Model selected six confounding factors (DARK, FAST, CURVE, MALE, YOUNG, and DRINK) and all five state dummies • R2 = 0.88 for the model of rollover risk as a function of state, road use variables, and SSF • SSF variable coefficient was: • Important in terms of the size of the estimated effect • Highly significant in the model (P<0.0001)

Predictions from the Linear Model • Model describes rollover risk as a function of the explanatory variables and can be used to: • Estimate rollover risk as a function of the SSF for any mix of road-use conditions • Adjust the observed rollover rate for each summary data point to account for differences in vehicle use • Next graph shows results for average conditions observed in the study data as a whole • Rollover risk is estimated as 0.20 in both the adjusted and the unadjusted data

Fit of Linear Model

Interpreting the Linear Model • Estimated rollover risk given a single-vehicle crash is halved when the SSF increases by 0.21 • For example, a vehicle with an SSF of 1.00 has twice the estimated rollover risk of a vehicle with an SSF of 1.21

Specifying Logistic Model Forms • Variables used • Individual explanatory variables or • Scenario risk variable • Approach used with states • Model each state, and average the results or • Model pooled data with dummy variables to capture state-to-state reporting differences

Concept of Scenario Risk • Data divided into cells defined by explanatory variables • For each cell, scenario risk is rollovers per single-vehicle crash • For each crash, scenario risk is adjusted to reflect rollovers per single-vehicle crash for all other crashes in the cell • Idea is to use scenario risk in the logistic model in place of all the explanatory variables

Fitting the Logistic Models • Models from individual states were based on the explanatory variables available in that state • Models from pooled data were limited to the explanatory variables available in all six states

Results of the Logistic Models • The models from the six individual states and the two models based on pooled data all fit the data well • These models were consistent in showing a large and significant effect for SSF

Predictions from the Logistic Models • Logistic models describe the change in the log(odds) of rollover as a function of the change in the SSF • Results can be used to predict the absolute rollover risk as a function of the SSF for a given set of conditions • Here, estimates of average SSF and odds of rollover are based on the data as a whole • The four summary models produce similar results

Comparison of Linearand Logistic Models • Linear and logistic models both suggest SSF has a large effect on rollover risk • Next graph compares results of linear model with results of logistic model from pooled data with individual explanatory variables

Predictions from the Models

Conclusions • Advantages of linear model of summary data • All summary data can be shown • Simpler to explain • Advantages of logistic analysis • Includes full range of values and interactions because not restricted to averages for each vehicle group • Better for measuring effects of explanatory variables because most were significant in the models • In this analysis, logistic analysis appeared to confirm the general pattern of the linear results

Analysis of Real-World Data

Analysis of Real-World Data

Presentation Transcript

Real-World Data Is Dirty

Real World Project : Analysis of Fruits

Analysis of DNA Microarray Data: Sensitivity, Specificity, and Other Real-World Issues

Real World Data

Enabling Real Time Data Analysis

Real-World Data. Real-World Solutions

Enabling Real Time Data Analysis

‘Real World’ Problem / Data Set an overall real world problem, supported by real world data

‘Real World’ Problem / Data Set an overall real world problem, supported by real world data

Modeling Real-World Data

Integration of Real World Data

Real Data Analysis

Real-World Data Is Dirty

Analysis of data from Real time experiments

2.5 – Modeling Real World Data:

REAL-WORLD ANALYSIS

FRONTIERS OF REAL-TIME DATA ANALYSIS

OncoCollect-Collection of Real World Data in Oncology

Modeling Real-World Data

World data analysis

Real World Data market

Real-world Data (RWD) Market