1 / 44

A Case Study of Bayesian Modeling on a Real World Problem

A Case Study of Bayesian Modeling on a Real World Problem. RAM Energy Energester/Enziro. Bob Mattheys, Malcolm Farrow, Giles Oatley, Garen Arevian, Souvik Banerjee. ISS – Intelligent Systems Solutions. Group of researchers/academics Working with CAS (Centre for Adaptive Systems) Remit:

Télécharger la présentation

A Case Study of Bayesian Modeling on a Real World Problem

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Case Study of Bayesian Modeling on a Real World Problem RAM Energy Energester/Enziro Bob Mattheys, Malcolm Farrow, Giles Oatley, Garen Arevian, Souvik Banerjee

  2. ISS – Intelligent Systems Solutions • Group of researchers/academics • Working with CAS (Centre for Adaptive Systems) • Remit: • Provide Technology Transfer and Expertise to Industry • Assist NE SME’s and stimulate business growth • Obtain funding, e.g. SMART Awards, GONE, etc.

  3. ISS Projects • RAM Energy – Intelligent Data Analysis • Neptune Engineering – Intelligent Diagnostics • HASS – Back-office system/DBase • Hart Biological – Back-office system/Dbase, process manufacturing • Etc.

  4. RAM Energy • Founded 2000 • Clients in Oil/Gas, Energy, Process, Manufacturing, Haulage Industry • Products Energester +Enziro • Ester based synthetic lubricants and greases, enzymatic cleaning solutions, absorbents and blasting media • Better lubrication, heat dissipation and vibration reduction than oil or grease in isolation and conventional additives

  5. RAM Energy • Problem • Demonstrate effectiveness and cost efficiency • Data collected by RAM Energy • very large • major differences across the various sectors • Assist RAM Energy in structuring their data collection and storage in general • Heavy haulage industry

  6. RAM Energy • Trials • RAM energy carried out select trials with clients. These included: • Monitored consumption prior to Energester use • Monitored consumption post Energester use • Use of control vehicles (no Energester use) • Temperature data collected

  7. RAM Energy Haulage • Data collected via diesel receipts • Information consisted of • Card number (allocated to regn number) • Vehicle registration • Date • Fuel • Mileage

  8. RAM Energy • Analysis • Performed using Excel spreadsheets • Discrete mpg (mileage since last fill/diesel input) • Some cumulative mpg using total mileage/total diesel input to date) • Attempt to normalise using mean temperature records • Some regression analysis

  9. RAM Energy Results

  10. RAM Energy Problems • Missing data consisted of • Driver information (who?) • Loading information (full/empty) • Length of journey • Type of journey (long haul vs short haul) • Urban or motorway conditions • Etc.

  11. RAM Energy Conclusion • Results very poor and inconclusive

  12. Database • Excel sheets were converted to an Access database with deletion of unnecessary rows and columns. • The Access database was then imported into SQL Server for data query and subsequent analysis

  13. Data Cleansing • Brief outline of most obvious problems with the data • 1. Card Number • 2. Registration Number • 3. Date • 4. Fuel Added • 5. Mileage

  14. Card Number • There were duplicate Card Numbers for (presumably) the same Card, e.g. • 85944 and 0085944 • In a few cases, for a given Registration Number, there appear additional Card Numbers, e.g. for ‘N151EUB’ there are the Card Numbers: • 38195 0038195 56408

  15. Registration Number • Registration numbers seemed to be always entered correctly • However, the field Reg Entered did not always tally with this • RAM recommendation to ignore

  16. Date • Dates entered very consistent • preserved the ordering • distance between dates • the actual date • An important question was: CAN WE PRESUME THE DATE IS ALWAYS ENTERED CORRECTLY ? • If this was so, then this provided us with a convenient check on the Mileage, as Date and Mileage should both increase together.

  17. Fuel • Outlier identification • Very small and very large values easily detected over large dataset • Take mean of the sample and flag as outliers data more than 3 or 4 SD’s away from the mean • Very small values e.g. 0 or 1 assumed as bogus values • 9999, 999, etc. taken to be bogus values • Some small and large values mistyped, with either the decimal place occurring too soon (e.g. 38.6 instead of 386) or extra digits added (e.g. 3860 instead of 386)

  18. Fuel • Difficult errors • e.g. 693392.. could be 69392 ? What if 693399 ? • Data must be flagged as erroneous

  19. Mileage • Some values were entered as {0,1,999,9999,2,3,5,10,111,1111,123,789, etc} • If we can presume that the Date is a sensible value, then in a dataset where there are only a few missing or obviously incorrect values for the Mileage, these values can be amended as follows

  20. Mileage We do not know if the day 13 entry is wrong, or day 14. So we can look ahead:

  21. Mileage Or

  22. Mileage Collapsed to:

  23. Mileage • Small and very large values could be ignored • Problem was determining whether any of the remaining data was valid – data validation • Evaluating the degree of correlation between the increasing Date, and the supposed increasing Mileage • Useful approaches for estimating rank-orderedness and correlation between lists • Spearman’s coefficient of rank correlation • Kendall’s Tau

  24. Data Cleansing

  25. Ram Energy Data Validator

  26. Bayesian - Approach • In Bayesian approach to statistical inference, express uncertain beliefs about things in terms of probability • E.g. that there is a 50% chance that the average fuel consumption of a vehicle will be less than 30mpg • Can use probabilities in this way to describe uncertainty about things we do not know • E.g. amount of fuel in a vehicle’s tank at 10.00am yesterday

  27. Bayesian - Approach • Once we accept this view of probability, the principle for learning from data is simple • Before we see the data, we have a probability distribution based on our knowledge up to that point • prior distribution • When we see the data our probability distribution changes, in the light of new information in the data • posterior distribution.

  28. Bayesian - Approach • Calculation used to get from the prior distribution to the posterior distribution • Uses Bayes’ theorem • Hence Bayesian statistics • Very straightforward interpretation of the results when using this method • Posterior distribution tells us how likely it is that various things are true, after we have used the evidence in the data

  29. Bayesian - Approach • Different observers can have different prior beliefs and this means that their posterior distributions will also be different • make prior distribution represent very little information • in practice prior tends to have little effect on posterior • One advantage of this approach is that it is straightforward to calculate what we expect various things to be after seeing the data • For example, can calculate a posterior probability distribution for the cost savings of applying the fuel additive to a whole vehicle fleet

  30. Bayesian - Model • The basic model used is a regression, with fuel used as the dependent variable and distance travelled as one of the explanatory variables • Each observation corresponds to the time between two successive additions of fuel to the fuel tank • Expect zero fuel to be used if zero distance were travelled, amount of fuel used is not necessarily proportional to the distance travelled • For example, fuel efficiency may be greater on longer journeys

  31. Bayesian - Model • Simplest form of the model, assume that fuel used is proportional to distance travelled • Constant of proportionality which is the slope of the line on a graph • Various other forms of relationship were also investigated. • While distance travelled is most obvious explanatory variable, there are several other variables and factors which must be taken into account

  32. Bayesian - Factors • Vehicle Types • Type of vehicle has effect • Individual vehicles of same type may also have different characteristics • Effect of individual vehicles (within a type) was regarded as a random effect • Vehicles seen as a sample from all vehicles of that type

  33. Bayesian - Factors • Drivers • Driver identified by card number • Drivers closely associated with vehicles • In this case, difficult to separate effects of vehicles from the effects of drivers • However, if this were not the case, then it would be possible to make inferences about individual drivers as well as individual vehicles

  34. Bayesian - Factors • Time of year • Fuel efficiency may be affected by ambient temperature/meteorological variables • Ideally use meteorological data • Obtained data for this purpose • But, as a first step, a simple substitute is to use the time of year, e.g. month

  35. Bayesian - Factors • Presence of fuel additive • The main question of interest is, “How does the use of the fuel additive affect fuel consumption?

  36. Bayesian - Complications • Fuel • How full the fuel tank was before or after fuel was added • Precisely how much fuel was used between fills • True tank content regarded as a latent or “hidden” variable • Such variables can be built into a Bayesian analysis

  37. Bayesian - Complications • Data entry errors • Graph of odometer readings against date for a single vehicle shows the general pattern - spurious values • This built into the model by allowing certain prior probabilities for errors of different types • The analysis can thus “recognise” errors by calculating posterior probabilities that a reading is an error of the various types • Those values which have large posterior probabilities of being erroneous are, in effect, ignored by the rest of the analysis.

  38. Bayesian - Conclusions • Prototype Bayesian models were successfully run • Demonstrated feasibility of approach for this problem • However: • Need to overcome problems of missing data • Uncertainty over when additive would be expected to have an effect • Pattern of this effect • Confounding of additive effect with the effects of other factors such as the changing seasons

  39. Bayesian Results Posterior probability density for the effect of the additive, in litres per mile

  40. Conclusions • Recommendations: • Design of better trials and data acquisition • Collection of ambient temperatures, etc. • Future Directions • Fraud detection • Efficiency of individual drivers/vehicles • Patterns of work, optimisation

More Related