Information theory concepts in software engineering

Information theory concepts in software engineering Richard Torkar richard.torkar@bth.se

Blekinge Institute of Technology

A YOUNG INSTITUTE • Founded in 1989 • One of three independent institutes of technology • Three campuses

ME • Richard Torkar • Former officer • PhD in software engineering at BTH(studied at University West and Chalmers) • REFASTEN project • Director for SWELL • Project manager for CONES • Programme manager POKAL and EMSE • Participating in EASE and RUAG • Prof. ClaesWohlin’s research group SERL

Partner with problems… • Millions of test cases constantly running (24/7) • Tests a system containing 25-30 large subsystems • Contractors and divisions all over the world use the same test bed

NORMALISED COMPRESSION DISTANCE2 • KolmogorovComplexity • Cilibrasi and Vitányi used a compression3algorithm for approximating K • Non-neg number 0<=NCD<=1+e, where e depends on how good C approximates K

WHAT´S BEEN DONE? • Information distance (ID) • The ID between two binary strings, x and y, is the length of the shortest program that translates x to y and, consequently, y to x. • ID is the universal distance metric • Minimal among computable distance functions • Uncovers all effective similarities

Distance = 1 || 0?

What to try? Hcog: Ordering tests based on their ∆VAT distance cannot be distinguished from how a human would order the tests based on their ‘cognitive similarity’.

Defs • A complete VAT trace of a test is a string with all the information about the actual execution of a test for all the variation points in the VAT model. • The Universal Test Distance, denoted ∆VAT, in the VAT model is the information distance between the complete VAT traces of two tests.

UNIVERSAL TEST DISTANCE • Universal Test Distance (UTD): Information Distance of complete VAT traces ofn tests (where n>= 2) • Should discover any similarities1between tests… • But ID is non-computable!

USING NCD AS A TEST DISTANCE • Uncover “meaningful” distances? • Three engineers ordered 25 tests applied on the triangle problem • Coded these tests in Bacon (Ruby) • Traced exec of each test  saved • Calculated an NCD matrix (distance tree)

[4]

RESULTS • Humans and NCD classified in the same way (rooted non-binary trees) • NCD: • Args permutations grouped together • float case close to int • Division between valid and non-valid

CONCLUSIONS • NCD can cluster tests on cognitive similarities • Differences we see are mainly explained by “white-boxness” (traces include implementation details) • Input data only is not sufficient • NCD calculations are costly • Could be used as a way to smooth the search space?

WHAT WILL BE DONE? • If we can measure distance, basically any distance, then why not measure: • Scientific real world propagation • Quality of alternative information sources • Quality of individual engineers… • Clustering trouble reports, using the CH or Silhouette index (and then do RCA on clusters instead of individual reports to get indications regarding fault modules)

OTHER TODOs • Statistical tests for cumulative voting • Semi-automated sys lit review via abstract clustering http://www.torkar.se

NODE DESCRIPTIONS • For p. 12: • XY_A1_A2_A3 • X = S/L: short/long integer arguments • X = F: Float arguments • Y = E: Equilateral triangle • Y = S: Scalene triangle • Y = I: Isosceles triangle • Y = X: Invalid triangle

References [1] M. Li, X. Chen, X. Li, B. Ma and P.M.B. Vitányi, P. 2003. The similarity metric. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (Baltimore, Maryland, January 12 - 14, 2003). Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, pp. 863-872. [2] R.L. Cilibrasi, P.M.B. Vitányi, “The Google similarity distance,” IEEE Transactions on Knowledge and Data Engineering, pp. 370-383, March, 2007 [3] P.M.B. Vitányi and L. Ming, “Minimum description length induction, Bayesianism, and Kolmogorov complexity,” IEEE Transactions on Information Theory, (46) pp. 446-464. 2000. [4] P.M.B. Vitanyi, F.J. Balbach, R.L. Cilibrasi, M. Li, Normalized information distance, pp. 45-82 in: Information Theory and Statistical Learning, Eds. F. Emmert-Streib and M. Dehmer, Springer-Verlag, New-York, 2008.

Questions?

Information theory concepts in software engineering

Information theory concepts in software engineering

Presentation Transcript

Information Technology / Software Engineering

Component-Based Software Engineering Basic Concepts

Information Technology / Software Engineering

Information Technology / Software Engineering

Convergence concepts in probability theory

Software Concepts

THE CONCEPTS OF ENTROPY, PROBABILITY AND INFORMATION THEORY

Engineering Concepts in Technology

Category Theory and Software Engineering

Engineering Principles in Software Engineering

Software Engineering in Media Engineering

16.355 Software Engineering Concepts 16.842 System Engineering

Important concepts in software engineering

Engineering Concepts in Technology

SOFTWARE CONCEPTS

Basic Concepts in Information Theory

Concepts and Theory

Information Security Antipatterns in Software Requriements Engineering

Theory and Concepts

Information Theory in Software Metrics (Assessment and Issues)

MSc in Software Engineering for information systems