1 / 22

ACM Workshop on Software Quality Szeged, Hungary 4th September 201 1

WOSQ2011. SNEED. A Process for Assessing Data Quality Harry M. Sneed ANECON GmbH, Wien University of Regensburg, Bayern OÖ-Fachhochschule Hagenberg, Austria. ACM Workshop on Software Quality Szeged, Hungary 4th September 201 1. WOSQ2011. SNEED-1. The Problem of Data Quality.

karik
Télécharger la présentation

ACM Workshop on Software Quality Szeged, Hungary 4th September 201 1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WOSQ2011 SNEED A Process for AssessingData QualityHarry M. Sneed ANECON GmbH, WienUniversity of Regensburg, BayernOÖ-Fachhochschule Hagenberg, Austria ACM Workshop on Software Quality Szeged, Hungary 4th September 2011

  2. WOSQ2011 SNEED-1 The Problem of Data Quality • The original data model was never thought through and does not satisfy the requirements on the data. • The original database design was once adequate, but over the years it has become obsolete as data requirements changed. • The database structure has depreciated under the weight of many minor alterations (new attributes, keys and Indexes) • The database is now used for other purposes than for which it was originally intended. • As a result of program faults more and more errors have crept unnoticed into the data and corrupted it. • An ever increasing number of dead data objects has accumulated over time and burdens the database. • Data redundancy leads to inconsistent information. • As a result of continual evolution, the data structure has become so complex that it can no longer be evolved.

  3. WOSQ2011 SNEED-2 The Purpose of a Data Audit • A Data Audit is a process by which the quality of the database is assessed and suggestions for improvement are made. • The Process encompasses two parallel activities: • Static Analysis of the Database Schema • Static Analysis of the Database Content • The static analysis of the database schema entails an inspection and measurement of the data structures and data type definitions. • The static analysis of the database content entails a validation of the actual data content against a specification of what that content should be. • The goal is to uncover discrepancies and deficiencies in the data and to propose means of eliminating or improving them.

  4. WOSQ2011 SNEED-3 Static Analysis of the Database Schema • The database schema is compared against the rules for data structuring, including the rules of data normalization. • The data type definitions are checked for compliance and incompliant types reported. • Structural deficiencies in the database schema are searched for and uncovered. • Size and complexity limits are controlled and excesses noted. • The size, complexity and quality of the database structure is measured to compare with other databases and with a benchmark. • Recommendations for the improvement of the database structure are suggested.

  5. WOSQT2011 Database Analyst SNEED-4 Setting the rules Defining themetrics Rules Metrics Database Schema Database Views Database Schema Auditor Deficiency Report Metric Report Figure 1: Static Database Analysis

  6. WOSQ2011 SNEED-5 Checking for Database Deficiencies • Typical relational data deficiencies: • There is no unique primary key • Foreign key to related table is missing • There are no views defined upon a table • No indexes are defined for a table • Key contains too many sub keys • Table contains more than allowed number of columns • Table has more than allowed number of indexes • Table has more than allowed number of views • Table row exceeds maximum allowed length • Table contains incompatible data types • Null Option is missing • Delete Option is missing • Data names do not conform to naming convention • Data schema is inadequately commented.

  7. WOSQ2011 SNEED-6 Sample Database Deficiency Report | | | | WARN: | 22 TimeStamp Data is not allowed | | 253| LEDG_DAT_INS DATETIME YEAR TO SECOND DEFA | | | | | WARN: | 18 Variable Character Data is not allowed | | 254| LEDG_USR_INS VARCHAR (20) DEFAULT USER, | | | | | WARN: | 17 Not null Option is missing | | 254| LEDG_USR_INS VARCHAR (20) DEFAULT USER, | | | | | WARN: | 17 Not null Option is missing | | 256| LEDG_USR_UPD VARCHAR (20), | | | | | PROB: | 05 Table has no View definition | | 258| EPSLEDGE | | | | | DEFI: | 08 Table has too many relationships (foreign keys) | | 258| EPSLEDGE | | | | | DEFI: | 11 Row Length exceeds maximum length allowed | | 258| EPSLEDGE | | | | | DEFI: | 12 Table has too few comments | | 258| EPSLEDGE | | | | +----------------------------------------------------------------------+ | Number of major Rule Violations = 11 | | Number of media Rule Violations = 11 | | Number of minor Rule Violations = 132 | | Total Number of Rule Violations = 154 | | Number of Database Schema Lines = 267 | | | | Rate of Rule Conformity = 0.651 | +----------------------------------------------------------------------+

  8. WOSQ2011 SNEED-7 Database Quality Metrics • data independence = number of views and indexes relative to the number of tables or record types • data accessibility = number of keys, indexes and selects relative to the number of tables, • data flexibility = number of attributes and keys relative to the number of tables, • data testability = number of test cases relative to the number of data attributes stored, • data storage efficiency = number of attributes stored relative to the size of the storage in bytes, • data conformity = number of schema deficiencies relative to the number of schema lines.

  9. WOSQ2011 SNEED-8 Database Complexity Metrics • data complexity = number of different data types and keys , relative to the sum of all data attributes stored, • view complexity = number of views relative to the number of tables, • access complexity = number of tables, keys and indexes relative to the number of stored attributes, • relational complexity = number of foreign keys and relations relative to the number of tables and keys in all, • structural complexity = number of structural elements - tables, relations and views – relative to the number of stored attributes, • storage complexity = number of attributes stored relative to the storage capacity.

  10. WOSQ2011 SNEED-9 Size Metrics of individual Data Tables +------------------------------------------------------------------------------+ | | | SQL DATA BASE METRIC REPORT | |SYSTEM: DATABASE | | DATE: 10.03.11 PAGE: 001 | +------------------------------------------------------------------------------+ | NR NR DATA PRIM FOR NR NR ROW DATA FUNC NR | | TABLE NAME LINES FLDS TYPES KEYS KEYS RELS INDX SIZE PTS PTS DEFIS| +------------------------------------------------------------------------------+ |[WEEKDAYS] 012 03 02 01 00 01 00 104 009 007 002 | |[ERRORLOGS] 016 05 02 01 00 01 00 017 011 007 002 | |[AUSWERTUNGTYP] 016 06 03 01 00 01 00 163 012 007 002 | |[FEATUREDMITGLIEDPOO 014 05 04 01 00 01 00 774 011 007 003 | |[FACHGRUPPEN] 028 10 06 01 00 01 00 101 016 007 002 | |[KONTAKTARTEN] 014 04 02 01 00 01 00 053 010 007 002 | |[CATEGORIES] 022 06 02 01 00 01 00 021 012 007 002 | |[AUSWERTUNGSPALTE] 012 03 02 01 00 01 00 101 009 007 002 | |[MENUCATEGORIES] 012 03 02 01 00 01 00 058 009 007 002 | |[SITEMAP] 024 12 04 01 00 01 00 1309 018 007 003 | |#FILTERCRITERIA 010 02 02 00 00 01 00 008 006 007 003 | |[BUSINESSHOURS] 014 05 03 01 00 01 00 017 011 007 002 | |[ZERTIFIKATE] 034 10 05 01 00 01 00 755 016 007 003 | |#FILTERCRITERIA 010 02 02 00 00 01 00 008 006 007 003 | |[REPORTINGHISTORY] 018 09 04 01 00 01 00 538 015 007 003 | |#FILTERCRITERIA 010 02 02 00 00 01 00 008 006 007 003 | |[MITARBEITERFUNKTION 013 04 04 01 00 01 00 058 010 007 002 | |#FILTERCRITERIA 010 02 02 00 00 01 00 008 006 007 003 | |[SCHLAGWORTE] 019 03 03 01 00 01 00 257 009 007 002 | |[PRODUKTE] 017 07 03 01 00 01 00 1205 013 007 003 | |[REGIONEN] 022 12 06 01 00 01 00 142 018 007 002 | |[REPORTTYPE] 012 03 03 01 00 01 00 058 009 007 002 | |[SUCHNAMEPATTERNS] 007 02 02 00 00 01 00 017 006 007 003 | |[STATISTIKOHNEPRODUK 018 09 04 01 00 01 00 523 015 007 003 | |[BANNER] 014 05 02 01 00 01 00 773 011 007 003 | |[STATISTIKOHNEEDITIE 018 09 04 01 00 01 00 523 015 007 003 |

  11. WOSQ2011 SNEED-10 Summary Report of Database Size Metrics +------------------------------------------------------------------------------+ | D A T A B A S E Q U A N T I T Y M E T R I C S | +------------------------------------------------------------------------------+ | Total Number of SQL Source Files ===> 01 | | Total Number of SQL Code Lines ===> 20107 | | Total Number of SQL Comment Lines ===> 2320 | | Total Number of SQL Procedure Statements ===> 7842 | | Total Number of SQL Statement Types ===> 3245 | | Total Number of SQL Access Operations ===> 3732 | | Total Number of SQL Select Operations ===> 1323 | | Total Number of SQL Insert Operations ===> 334 | | Total Number of SQL Update Operations ===> 73 | | Total Number of SQL Delete Operations ===> 43 | | Total Number of SQL Stored Procedures ===> 1497 | | Total Number of SQL Procedure Blocks ===> 558 | | Total Number of SQL Procedure Conditions ===> 1933 | | Total Number of SQL Procedure Loops ===> 09 | | Total Number of SQL Procedural Branches ===> 2439 | | Total Number of SQL Data Declarations ===> 266 | | Total Number of SQL Set Data Assignments ===> 1325 | | Total Number of SQL Output Operations ===> 1074 | | Total Number of SQL Input Operations ===> 749 | | Total Number of Stored Procedure Points ===> 4806 | | Total Number of SQL Tables ===> 62 | | Total Number of Data Attributes ===> 468 | | Total Number of different Data Types ===> 202 | | Total Number of primary Keys ===> 56 | | Total Number of Foreign Keys ===> 00 | | Total Number of DB-Indexes ===> 17 | | Total Number of Relationships ===> 62 | | Total Number of Database Views ===> 15 | | Total Number of Database View Attributes ===> 119 | | Total Length of all SQL Rows ===> 36028 | | Total Number of Data-Points ===> 10884 | | Total Number of Function-Points ===> 434 | | Total Number of Deficiencies ===> 311 | | Total Number of required Test Cases ===> 1377 | +------------------------------------------------------------------------------+

  12. WOSQ2011 SNEED-11 Sample Complexity and Quality Metrics +------------------------------------------------------------------------------+ | D A T A B A S E C O M P L E X I T Y M E T R I C S | +------------------------------------------------------------------------------+ | Data Complexity ===> 0.611 | | Access Complexity ===> 0.635 | | Relational Complexity ===> 0.609 | | Structural Complexity ===> 0.228 | | Storage Complexity ===> 0.120 | | | | Average Database Complexity ===> 0.440 | | | +------------------------------------------------------------------------------+ | D A T A B A S E Q U A L I T Y M E T R I C S | +------------------------------------------------------------------------------+ | Data Independence ===> 0.471 | | Data Accessability ===> 0.859 | | Data Testability ===> 0.518 | | Data Conformity ===> 0.984 | | Storage Efficiency ===> 0.200 | | | | Average Database Quality ===> 0.606 | | | +------------------------------------------------------------------------------+

  13. WOSQ2011 SNEED-12 Static Analysis of the Database Content • The database contents are checked against assertions on the valid value domains as well as against assertions on valid relationships. • Invalid data values are discovered thru comparison with valid values and recorded. • Missing data objects are reported. • Redundant data objects and data corpses are revealed and reported. • The degree of correctness is computed for each data table to compare with other tables. • The correctness metric of the database content is passed on to the metric database for further evaluation.

  14. Data Validation Process WOSQ2011 SNEED-13 Database Schema Database Analyst Validation script Specify the scripts Validation Script Compiler Comparison Tables Data Base Content Database Tables Database Content Auditor Validation Report Figure 2: Data Validation Process

  15. WOSQ2011 SNEED-14 Sample Assertion Script table: LAENDER; // This is test procedure is for checking the correctness of the Bundesländer when (new.BUNDESLAND_ID = "1") then replace new.SUMME by "11"; when (new.BUNDESLAND_ID = "2") then replace new.SUMME by "12"; when (new.BUNDESLAND_ID = "3") then replace new.SUMME by "13"; when (new.BUNDESLAND_ID = "4") then replace new.SUMME by "14"; when (new.BUNDESLAND_ID = "5") then replace new.SUMME by "15"; when (new.BUNDESLAND_ID = "6") then replace new.SUMME by "16"; when (new.BUNDESLAND_ID = "7") then replace new.SUMME by "17"; when (new.BUNDESLAND_ID = "8") then replace new.SUMME by "18"; when (new.BUNDESLAND_ID = "9") then replace new.SUMME by "19"; when (old.STAATSID = "14") then replace old.STAATSID by "13"; when (new.SUMME = "65") then replace new.SUMME by "66"; when (old.BUNDESLANDNR = "9") then skip; if ( new.BUNDESLAND_ID = old.BUNDESLANDNR ! "0" ); assertnew.STAAT_ID = old.STAATSID; assertnew.BUNDESLAND = "NIEDEROESTERREICH" if (old.SUMME = "33";) assertnew.BUNDESLAND = "OBEROESTERREICH" if (old.SUMME = "44"); assertnew.BUNDESLAND = old.BUNDESLANDif (other); assertnew.BUNDESLAND = old.BUNDESLAND; assertnew.SUMME = "100" ! "200" ! "300" ! "400"; assertnew.SUMME = {1.00:5.00}; assertnew.MITGLIEDER = "300" ! "656" ! "644" ! "92"; end;

  16. WOSQ2011 Sample Data Validation Report SNEED-15 • | Key Fields of Record(new,old) • +-----------------------------------------------------------------------------------------+ • | Ist :BUNDESLANDNR • | Soll:BUNDESLANDNR • +----------------------------------------------+-----------------------------------------+ • | Non-Matching Fields | Non-Matching Values • +----------------------------------------------+-----------------------------------------+ • | RecKey:1 | • | Ist : BUNDESLAND | NIEDERöSTERREICH • | Soll: BUNDESLAND | NIEDEROESTERREICH • +----------------------------------------------+-----------------------------------------+ • | RecKey:2 | • | Ist : STAATSID | 13 • | Soll: STAATSID | 12 • +----------------------------------------------+-----------------------------------------+ • | RecKey:2 | • | Ist : BUNDESLAND | OBERöSTERREICH • | Soll: BUNDESLAND | OBEROESTERREICH • +----------------------------------------------+-----------------------------------------+ • | RecKey:4 | • | Ist : STAATSID | 13 • | Soll: STAATSID | 14 • +----------------------------------------------+-----------------------------------------+ • | RecKey:5 | • | Ist : BUNDESLAND | KäRNTEN • | Soll: BUNDESLAND | KAERNTEN • +----------------------------------------------+-----------------------------------------+ • | RecKey:8 | • | Ist : STAATSID | 13 • | Soll: STAATSID | 18 • +----------------------------------------------+-----------------------------------------+

  17. WOSQ2011 Data Validation Metric Report SNEED-16 +----------------------------------------------------------------------------------------+ | New:MGNR SUBNR JAHR ART LFNR | | Old:MGNR SUBNR JAHR ART LFNR | +----------------------------------------------+-----------------------------------------+ | Non-Matching Fields | Non-Matching Values | +----------------------------------------------+-----------------------------------------+ | RecKey:000083 0000 0000 BAN 00002 | | | Ist: DATUM | 2002-04-29 | | Soll: DATUM | 2003-04-10 | +----------------------------------------------+-----------------------------------------+ | RecKey:000095 0000 0000 BAN 00002 | missing from the old File/Table | +----------------------------------------------+-----------------------------------------+ | RecKey:000112 0000 0000 BAN 00002 | | | Ist: DATUM | 2002-05-02 | | Soll: DATUM | 2003-04-29 | +----------------------------------------------+-----------------------------------------+ | RecKey:000083 0000 0000 BAN 00001 | missing from the new File/Table | +----------------------------------------------+-----------------------------------------+ | RecKey:000095 0000 0000 BAN 00001 | missing from the new File/Table | +----------------------------------------------+-----------------------------------------+ | RecKey:000112 0000 0000 BAN 00001 | missing from the new File/Table | +----------------------------------------------+-----------------------------------------+ +----------------------------------------------+-----------------------------------------+ | Total Number of old Records checked: 76 | | Number of old Records found in new File: 73 | | Number of old Records with duplicate Keys: 00 | | Number of old Records not in new Table: 03 | | Total Number of new Records checked: 76 | | Number of new Records found in old File: 75 | | Number of new Records with alternate Keys: 00 | | Number of new Records not in old File: 01 | | Total Number of Fields checked: 225 | | Total Number of non-Matching Fields: 02 | | Percentage of matching Fields: 99 % | | Percentage of matching Records: 99 % | +----------------------------------------------------------------------------------------+

  18. WOSQ2011 Database Size Comparison SNEED-17

  19. WOSQ2011 Database Complexity Evaluation SNEED-18

  20. WOSQ2011 Database Quality Evaluation SNEED-19

  21. 0,5 0,6 0,4 1 0 Complexity WOSQ2011 Database Dashboard SNEED-20 0,003 0,5 0,4 0,6 0,002 0,001 1 0 0 0,020 Quality Defect Density Defects per 1000 Data 0,1 0,8 0,7 0,7 0,05 0,6 0,9 0,6 0,8 1 0 0,5 1,0 0,5 1,00 Change rate Correctness Conformity Visualization of the Database Status with SoftEval

  22. WOSQ2011 SNEED-21 Conclusions • Databases are an important asset of corporate users, perhaps even more important than the software! • As with software the quality of data decreases over time as a result of usage and change. • The complexity of data increases over time as a result of change and extension. • Therefore, it is imperative to periodically clean up and restructure the data. • For that purpose one needs information on the state of the data. • Such information is provided by a database audit.

More Related