How do we know if AI is right? Challenges in the testing of AI systems

How do we know if AI is right? Challenges in the testing of AI systems Jukka K Nurminen Professor Data-Intensive Computing in Natural Sciences jukka.k.nurminen@helsinki.fi AI Testing Tiedekulma/ Jukka K Nurminen

AI Research Focus on AlgorithmsLess on Testing and Other Real Use Issues • “However, to date very little work has been done on assuring the correctness of the software applications that implement machine learning algorithms.” • Xie, X., Ho, J. W. K., Murphy, C., Kaiser, G., Xu, B., & Chen, T. Y. (2011). Testing and validating machine learning classifiers by metamorphic testing. Journal of Systems and Software, 84(4), 544–558. https://doi.org/10.1016/J.JSS.2010.11.920 • “Software testing is well studied, as is machine learning, but their intersection has been less well explored in the literature” • Breck, Eric, et al. "What’s your ML Test Score? A rubric for ML production systems." NIPS Workshop on Reliable Machine Learning in the Wild. 2016. https://static.googleusercontent.com/media/research.google.com/fi//pubs/archive/45742.pdf AI Testing Tiedekulma/ Jukka K Nurminen

Failures in software systems • Bad investment decisions • Knight was the largest trader in U.S. equities. Due to a computer trading “glitch” in 2012, it took a $440M loss in less than an hour. • Fatal cancer treatment • A bug in the code controlling the Therac-25 radiation therapy machine was directly responsible for at least five patient deaths in the 1980s when it administered excessive quantities of beta radiation • Biased decisions • AIs, from IBM Microsoft and Chinese company Megvii, could correctly identify a person’s gender from a photograph 99 per cent of the time – but only for white men. For dark-skinned women, accuracy dropped to just 35 per cent. Uber car hitting the cyclist AI Testing Tiedekulma/ Jukka K Nurminen

Software Lifetime Costs -Development is only small part • For classic software maintenance cost dominates • Testing cost is about the same size as development • How is it for AI software? • Software Life-Cycle Costs - Schach 2002 AI Testing Tiedekulma/ Jukka K Nurminen

AI is still mainly in research labs (and news headlines) although some companies are very active and advanced When major deployments starts to happen interest to efficient SW processes for AI likely to be of interest BUT we are not there yet AI is still experimental - Lifecycle support problems are not yet visible AI Testing Tiedekulma/ Jukka K Nurminen

AI DOES not always give right answerhow to deal with Statistical results? … Cat Cat Dog Cat AI Testing Tiedekulma/ Jukka K Nurminen

Wedonotknowthe ”rightanswer” AI Testing Tiedekulma/ Jukka K Nurminen

We do not agree on the ”right answer”=> AI ethics http://moralmachine.mit.edu/ AI Testing Tiedekulma/ Jukka K Nurminen

Challenges of testing of machine learning models • Statistical results. Outcome is level of confidence. Not Pass or Fail. • Oracle problem. We do not have the “right answer” • A buggy ML program does not crash nor produce an error message, it just fails to learn or act properly • Borderline between bug and feature is vague • Is the training/testing material representative? Is it biased? • Would another neural net architecture do better? • Would another model do better? AI Testing Tiedekulma/ Jukka K Nurminen

Testing of ML Model Original Dataset • Supervisedlearningcase • We have a set of feature vectors and a label for each • Split data into training and test sets • Select classifier type, network architecture, and hyper-parameters • Train the classifier with the training data only • Test with Dev set (and Test set) Training set Dev set Test set 60% 20% 20% 70% 30% Not always For very large (1M) datasets Dev and Test sets should be much smaller (1% = 10k) AI Testing Tiedekulma/ Jukka K Nurminen

Adversarial input http://www.cleverhans.io/security/privacy/ml/2017/02/15/why-attacking-machine-learning-is-easier-than-defending-it.html AI Testing Tiedekulma/ Jukka K Nurminen

Evtimov et al. Robust Physical-World Attacks on Deep Learning Models LISA CNN based on AlexNet https://arxiv.org/pdf/1707.08945 AI Testing Tiedekulma/ Jukka K Nurminen

ML model is only a part of a bigger software system In a production systems ML code is often less than 5 % of total code Google Crash Course on machine learning https://developers.google.com/machine-learning/crash-course/production-ml-systems AI Testing Tiedekulma/ Jukka K Nurminen

ML models and other software modules have complex interactions • Unexpected output can cause problems elsewhere in the system • Changes in any module can influence other modules AI Testing Tiedekulma/ Jukka K Nurminen

Autonomous Driving • Today’s car: ~100 control units, ~100 million lines of code • Future: Multiple AI systems working together • Each situation is unique • AI not able to be 100% sure of its outcome • If Uber/Volvo fixes this can a similar problem be in other car brands? • How should authorities test these things? https://www.youtube.com/watch?v=Cuo8eq9C3Ec AI Testing Tiedekulma/ Jukka K Nurminen http://moralmachine.mit.edu/

New kinds of tests are needed Breck, Eric, et al. "What’s your ML Test Score? A rubric for ML production systems." NIPS Workshop on Reliable Machine Learning in the Wild. 2016. https://static.googleusercontent.com/media/research.google.com/fi//pubs/archive/45742.pdf AI Testing Tiedekulma/ Jukka K Nurminen

Self-evaluation of ML capabilities • Four sets of tests • Data, model, infrastructure, monitoring • Score 0-5 in each • 0 = more of a research project that productized system • 5 = exceptional levels of automated testing and monitoring • ML-score = Min Breck, Eric, et al. "What’s your ML Test Score? A rubric for ML production systems." NIPS Workshop on Reliable Machine Learning in the Wild. 2016. https://static.googleusercontent.com/media/research.google.com/fi//pubs/archive/45742.pdf AI Testing Tiedekulma/ Jukka K Nurminen

Whitebox Testing of Neural Networks • Code coverage testing provides little value but with new tools we can see inside the neural net • DeepXPlore, DLFuzz, … • Testing: Can we detect poorly trained parts of the network? • Maintenance: Can we detect when trained network is used differently from its training? • Need for retraining, detection of adversarial attacks, … AI Testing Tiedekulma/ Jukka K Nurminen

Data problem Testbench • Add artificial errors to data and see how it influences system operation ML system Σ Allow comparing how results changed as a function of data problems Plug-in new modules easily Error generator Built in error types + user defined new error types AI Testing Tiedekulma/ Jukka K Nurminen

University of Helsinki & VTT + 11 industrial partners from Finland + international consortium (Germany, France, Sweden, Netherlands, Spain, Canada) Interested problems and challenges wanted! Model Quality WP2 Validation Techniques for ML Data Quality IVVES ITEA Project Proposal Automotive Data Creation Health Engineering Online Testing & Monitoring Data Analytics in Development WP1 Case Studies WP5 Framework & Methodology for DevOps WP3 Testing Techniques forComplex Evolving Systems Banking ML-based Testing Data Analytics in QA Telecom Risk based Testing Data Collection WP4 Data Analytics in Engineering WP6 Standardization, Dissemination and Exploitation AI Testing Tiedekulma/ Jukka K Nurminen

Thank you! AI Testing Tiedekulma/ Jukka K Nurminen

How do we know if AI is right? Challenges in the testing of AI systems

How do we know if AI is right? Challenges in the testing of AI systems

Presentation Transcript

“This is a Test. This is Only a Test!”

Software Testing

3D Test Issues

Test and Test Equipment December 2012 Hsin -Chu , Taiwan

Who wants to be a Millionaire?

Test Preparation, Test Taking Strategies, and Test Anxiety

Test Automation Tools: QF-Test and Selenium

System Test Specification

TDC ( Test Description Code)

Engine Condition Diagnosis

Chi-square test or c 2 test

200

Test del Software, con elementi di Verifica e Validazione, Qualità del Prodotto Software

Test of Significance

System Test Tools

Lesson 7