Addressing Experimental Research Challenges in Practice

Workshop Ib:Experimental research in practice Roland Geraerts 2 May 2014

Bad repeatability Why can it be hard to reproduce papers’ claims?

Bad repeatability (1) • Problem • The results cannot be reproduced easily • Cause • Details of the method are lacking • Parts of the method are not described • Degenerate cases are missing • References to other papers (without mentioning details) • Parameters don’t get assigned values (usually weights) • Source code is not available • The experimental setup is not clear • Tested hardware (e.g. which PC/GPU, the number of cores used) • Statistical setup (e.g. number of runs, seed) • Details of the scenario(s) are missing

Bad repeatability (2) • Problem • The results cannot be reproduced easily • Cause • Low significance cause by a low number of runs • Hard problems can be hard to implement • Solution • Let someone else implement the method/paper • Provide the source code

Data collection errors What kind of errors occur during the collection of (raw) data?

Data collection errors (1) • Problem • Errors occur during collection of raw data • E.g., copy/paste values from GUIs into excel sheets or text files • Cause • The data collection process was not automated • There is a GUI but not a command line (console) version • Variables aren’t assigned the right values (how to verify?) • The precision of the stored numbers is too low • Statistics are computed wrongly (e.g. how to compute the SD) • Only the execution of a part of the algorithm is recorded • The visualization part is not strictly separated from the execution part of the algorithm • E.g. While the method performs its computations, the results are being written to a log file and sent to the GPU for visualization purposes

Data collection errors (2) • Solution • Automate the process using a console called from a batch file • For small experiments, call the arguments in the batch file • Otherwise, build a load/save mechanism • Create an API that supports setting up experiments

Data collection errors (3): Time measurement errors • Problem • Time is measured wrongly • Cause • Lack of timer’s accuracy • C++: Don’t use time.h • Don’t start/stop the timer inside the method, especially not if the parts take less than 1 ms to compute • Intervening network/CPU/GPU processes

Data collection errors (4):Time measurement errors • Solution • Use accurate timers • C++: Use QueryPerformanceCounter(…) instead; be careful of 0.3s jumps, or C++ 11: std::chrono::high_resolution_clock • Run fast methods many times and take the average; watch out for non-deterministic behavior • Take the average of some runs, also in case of deterministic algorithms • Only measure the running time of the algorithm • Switch off the network • Kill the virus killer • Stop the e-mail program • Disable update functionality • Use only 1 core • Don’t work on your thesis while running the experiments on the same machine; and yes, this happens

Bad figures When do figures convey information badly?

Bad figures • Problem • The figures convey information badly • Cause • The figures are hard to read (e.g. too small or bitmapped) • Axes haven’t been labeled • The y-axis doesn’t start at 0 which amplifies (random) differences • Use the right number precision/format • Don’t display 100,000.001 • Don’t display 0.0005 s, or 0.1 0.15 0.2 … • The meaning is not conveyed clearly • Some colors/patterns don’t do well on black & white printers • Solution • Use e.g. GNUplot (set all labels and export to vector: EPS or PDF) • Use vector images as much as possible (e.g. use IPE) • Explain all phenomena

Conclusions are too general When are drawn conclusions too general?

Conclusions are too general (1) • Problem • The conclusions drawn are often too general • Cause • Only one instance is tested, e.g. • environment / moving entity • Only one problem setting is tested • A favorable setup is used, e.g. • a few axis-aligned rectangular obstacles • polygonal convex obstacles • 1 fixed query • Deterministic experiments do suffer from the ‘variance problem’

Conclusions are too general (2) • Solution • Try to sample the problem space as good as possible • Don’t try to bias any method • Use a favorable setup (to show certain properties) and a ‘normal’ one • Also choose worst-case scenarios • Tune all methods equally • Compare against the state-of-the-art instead of old methods only • Dare to show the weakness(es) of your method

Statistical weaknesses When are the statistics less reliable?

Statistical weaknesses • Problem • Statistics are done badly • Cause • Results have been collected on different sets of hardware • Too few runs • Not all running times are mentioned (e.g. initialization) • Only averages are mentioned • Solution • Use the same machine (and don’t change the setup) • Use e.g. GNUplot and set all (relevant) labels • Use other measures, e.g. • SD • Boxplot • Student’s t-test: statistical hypothesis test • ANOVA: Analysis of variance

Statistically significant?

So your method is statistically significant • While a method was granted being statistically significant, this does not have to mean anything in practice… • …due to the programmer’s bias. • Suppose different methods run in 10.2, 10.0, 10.3, and 9.6 seconds (with appropriate SDs etc). While the latter one might be better, in reality it does not have to be… • …since the third one might be the only one that wasn’t optimized.

Ways to bias your results (1) • Run the code with choices in of • Hardware (CPU, GPU, memory, cache, #cores, #threads) • Language (C++/C#, 32/64bit, different optimizations) • Software libraries (own code/boost/STL) • Implementation is done by different people

Ways to bias your results (2):Some code optimizations • Enable optimizations in your compiler • Run in release mode! • Visual studio • full optimization • inline function expansion • Enable intrinsic functions • Etc. • Compile the code with a 64-bit compiler • 2-15% improvement of running times due to • usage a of a larger instruction set • Not having to simulate 32-bit code • However, watch code that deals with memory and loops • use memsize-types in address arithmetic

Ways to bias your results (3):Some code optimizations • Unroll loops • Improves usage of parallel execution (e.g. SSE2) • Create small code • E.g. by improving the implementation; properly align data • Improves cache behavior • Avoid mixed arithmetic • Use STL • Is heavily optimized • Avoid disk usage and writing to a console etc.

Ethics…

Addressing Experimental Research Challenges in Practice