How big ?

How big ? • Jobcentre & client activity data - 11 million people • 80 Gbytes of data per month on DLT • Processing 20 tables & 250m records per month • Largest tables have 80 million records, 5 Gbytes • Key extract retrievals take 6 - 20 hours to run • Full SPSS extracts for clients - 5 Gbytes • Twenty five SIR databases - over 100 Gbytes • from 1 Mbyte to 15 Gbytes

Evaluation Database - Overview • Background & project requirements • Setting the ‘environment’ • Visual PQL • The ‘Data Dictionary’ • Program generators • Fuzzy matching • Use of the data • New developments • Questions

Background • Government program started in January 1998 • New Deal for Young People was set up to encourage and assist unemployed groups into full time sustainable employment. It gives unemployed people aged 18-24 the opportunity to develop their potential, gain skills and experience and find work. • Employment Service needed an evaluation database • March 98 - ORC issues first extract with 18000 clients • June 98 - program expanded to over 25’s • June 99 - expanded to cover all Jobcentre clients • October 00 - Source database migrates from six regional Ingres databases to single Oracle database

Contract extension -press notice • The new contract will run to May 2004, with an option to run a further two years to May 2006. • ORC International’s database tools are designed to help the ES evaluate the service it provides to all clients registered on the Government’s JobCentre computer system.They allow the ES to regularly monitor clients with sustainable jobs, the effectiveness of equal opportunities measures, and the relationship between job vacancies and the labour market skills base. • “It is our intention to create a project web site, which will allow multiple level access to different categories of users. This will include project documentation and progress reports, access to tabulations and small data extracts, and customer feedback areas, as well as links to other related sites.” • ORC International is part of Opinion Research Corporation, which was founded in 1938 with offices in the United States, Europe, Asia, Latin America and Africa, the Company provides integrated marketing services to both businesses and governments in more than 100 countries. http://www.orc.co.uk

Project Requirements • LMS (Labour Market System) is a multi-user transaction system used in Jobcentres • Needed new database with evaluative functionality • Linked to additional data-sets including clerical • Flexibility to change structure periodically • Combine cross regional records for the same client • Extracts provided to ES for statistical purposes

Setting the environment PROGRAM PQL CONNECT DATABASE 'CLMIDS’ PREFIX '<CLI_MIDS>' END PROGRAM SET DATABASE CLMIDS SET PROCFILE '<LMSPROC>’ | reset to main procedure file ( main procedure file held in separate .SR4 file ) Midlands database prefix retval = globals ( ’CLI_MIDS ’ , '\\urmston\d\50413\sirdb\client\') and so on for the procedure file & other databases

Visual PQL (1) execute dbms ‘CALL ddict.cprog ($’ + tablenam + ‘$,$’ + region + ‘$,$’ + extract + ‘$,$’ + newdata + ‘$,$’ + editdata + ‘$,$’ + dbnum + ‘$)’

Visual PQL (2)

Linking SIR to other software - Winzip open inf /dsnvar=fnamein /iostat=ierr1 /write /lrecl=300 ifthen ( ierr1 ne 0) . write 'File not found ’ write // 'Unzipping file using winzip pql escape "C:\Program Files\WinZip\WINZIP32.EXE -e zipfn fnamein " write 'Back to sir !!!!!! ' c Build in a loop to check that input file now unzipped and ready c wait and repeat if not else . write 'File opened OK ' fnamein / endif

The ‘Data Dictionary’ Tablename Field names Data types Start / End Columns Sort ids Include / exclude & date flags Data Dictionary Generate Schemas Edit flags

Automatic schema generation • Source field names, labels data types read from HTML into data dictionary database • Program to create SIR variable names • strip out underscore characters ‘_’ • trims field length to eight • if duplicated replaces eighth character with a number • Program sets column positions based on field types • Data types converted from Ingres / Oracle to SIR • Procedure for Date Integers and true date fields

Load programs - the report files (1) Client_action record Read 78224996 \ fixed_width_data \ 2001_05 \north: 19156652 east: 6949927 west: 11515552 south: 11774525 northwest: 11352025 midlands: 12147512 oracle: 5328791 dodgy: 12

Load program generation (2) • Develop standard file naming & directory location • Read pre -processing report files to pick up numbers of records to be loaded • Detect any missing report files using iostat values • Write loading program code to pql files • Read these pql files back into the procedure file

Load generator -finishing touches (3) open inf6 / dsn = 'Q:\ps\50413\ProgGen\ldmidlands1.txt' / read / lrecl = 250 open ouf6 / dsn = 'Q:\ps\50413\ProgGen\ldmidlands2.txt' / write / lrecl = 250 write ( ouf6 ) 'PROCEDURE INITLOAD.<6>:T’ | add the procedure header line / 'call initload.dropall' | add call to drop all databases // 'call initload.connmids' / | add call to connect midlands loop . read (inf6, iostat = ierr6) textline(a250) | copy the rest of the program . if (ierr6 ne 0) exit loop . write (ouf6) textline end loop write (ouf6) 'call initload.dropmids' | add module call to drop database / 'END PROCEDURE’ | add the END PROCEDURE close (inf6) | close the input and output files close (ouf6) pread 'Q:\ps\50413\ProgGen\ldmidlands2.txt’ | Pread the SIR pql

Load program - Midlands database (4) call initload.dropall | drop all connected databases call initload.connmids | connect the midlands database call initload.update (\\Urmston\d\50413\, sir_input_files\lms\2001_04\Midlands\edited\, d:\50413\sirdb\client\Midlands\log\2001_04\, Midlands_client2001_04ORCIDsort, edt, 0.75, 1, 1, 181450, 0) call initload.update (\\Urmston\d\50413\, sir_input_files\lms\2001_04\Midlands\edited\, d:\50413\sirdb\client\Midlands\log\2001_04\, Midlands_client_action2001_04ORCIDsort, edt, 0.75, 2, 1, 236835, 0) ……. | more call to load data call initload.dropmids | disconnect the midlands database

Fuzzy matching • Linking data by best combinations of Nino, name,dob • Stripping non significant text • blanks, apostrophes, hyphens • Methods grown organically on case by case basis • Variety of scoring methods eg surname matches • HODGSON 4 points direct matches • HODGESON 3points * 0.9 for misaligned -> 6.7/8 =84% • Flexible generic modules applicable to all match types • Reporting of false positive and negative matches • Manual review of near /doubtful matches

Generic fuzzy matching - issues • Key fields different sizes and names across applications • Some key fields absent or with high % missing values • Quality of key fields varies widely • Matching varies from a handful to millions of records • Bringing Access and SIR together - Visual PQL ? • How to assess ‘false’ positives when no other common fields in data sets being matched • -> set of core procedures with options to bypass ?

General Issues • Continual growth in database & extract size • Data Irregularities • Embedded carriage returns in text fields • Date formats (American /English) • The team • Keeping the routine /repeated processing ‘interesting’ • Mushrooming similar code - keeping it generic /’black box’ • SIR • several large tabfiles required - largest currently 11Gb • Some retrievals crash with sir.exe error 1 time in 5 -why? • P4 /Windows 2000 mixed performance

New Developments (1) - Project web site • Management tool for display of project statistics • Focus for collecting project documents ( Word ) • FAQ’s • Glossary of abbreviations & acronyms ? • which variable should I use for this ? • How is leaving date derived ? • Where can I get the latest data dictionary for NDYP ? • What’s the ORC variable name for expct_start_date ? • Small sampling /extraction tools • Links to other related government sites

New Developments (2) - Processing • Full benefits data set arrives on Monday • Increasingly complex extracts in SPSS and SAS • Moving more of the data processing to Linux • Generic fuzzy matching tools • Adding other data , deprivation index , surveys • Sir 2002 - reading an 800 byte record into a single string • Secondary indexing /lookups - replace Case 0 • Better linking of the processing zipping /spss /excel /HTML • Questions

Larger than the Evaluation Database ? Presented By Adrian Hodgson

How big ?

How big ?

Presentation Transcript