1 / 60

Sampling Data for Better Understanding

Sampling Data for Better Understanding Understanding your data is vital. When faced with millions of records, sampling is necessary, and this session will show some great techniques and tricks to better understand your data or Data Mart tables. Foxwoods Resort & Casino Ledyard Ct

saima
Télécharger la présentation

Sampling Data for Better Understanding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sampling Data for Better Understanding Understanding your data is vital. When faced with millions of records, sampling is necessary, and this session will show some great techniques and tricks to better understand your data or Data Mart tables. Foxwoods Resort & Casino Ledyard Ct August 22. 2008 Jim Morrow

  2. Topics • Limiting the records processed • Counting Records • Determining The Domain of Data • Finding the biggest / smallest values • Dropping the Outliers • Skipping records in your sample

  3. WHERE RECORDLIMIT EQ 1000 IF RECORDLIMIT EQ 1000 Stops the processing of the table request after 1000 records have passed the where or if test. WHERE READLIMIT EQ 1000 IF READLIMIT EQ 1000 Stops the processing of the table request after 1000 records read from the data source. This is not used for FOCUS Data Bases..

  4. TABLE FILE EMPDATA PRINT PIN LASTNAME FIRSTNAME END PIN LASTNAME FIRSTNAME 000000010 VALINO DANIEL 000000020 BELLA MICHAEL 000000030 CASSANOVA LOIS 000000040 ADAMS RUTH 000000050 ADDAMS PETER 000000060 PATEL DORINA \/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/ /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ 000000370 WANG JOHN 000000380 ELLNER DAVID 000000390 GRAFF ELAINE 000000400 LOPEZ ANNE 000000410 CONTI MARSHALL NUMBER OF RECORDS IN TABLE= 41 LINES= 41

  5. TABLE FILE EMPDATA PRINT PIN LASTNAME FIRSTNAME WHERE RECORDLIMIT EQ 10 END NUMBER OF RECORDS IN TABLE= 10 LINES= 10

  6. TABLE FILE EMPDATA PRINT PIN LASTNAME FIRSTNAME WHERE LASTNAME CONTAINS 'E' WHERE RECORDLIMIT EQ 10 END PIN LASTNAME FIRSTNAME 000000020 BELLA MICHAEL 000000060 PATEL DORINA 000000070 SANCHEZ EVELYN 000000100 ANDERSON TIM 000000130 CVEK MARCUS 000000140 WHITE VERONICA 000000150 WHITE KARL 000000190 MEDINA MARK 000000220 LEWIS CASSANDRA 000000260 ROSENTHAL KATRINA NUMBER OF RECORDS IN TABLE= 10 LINES= 10

  7. -SET &&LIMIT = 'WHERE RECORDLIMIT EQ 2 '; EX FUN_EMPDATA_REC_2.FEX -DEFAULT &&LIMIT = ' ' TABLE FILE EMPDATA PRINT PIN LASTNAME FIRSTNAME &&LIMIT END NUMBER OF RECORDS IN TABLE= 2 LINES= 2

  8. FILENAME=EMPDATA, SUFFIX=FOC SEGNAME=EMPDATA, SEGTYPE=S1 FIELDNAME=PIN, ALIAS=ID, FORMAT=A9, INDEX=I, $ /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ \/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/ FIELDNAME=HIREDATE, ALIAS=HDAT, FORMAT=YMD, $ DEFINE AREA/A13=DECODE DIV (NE 'NORTH EASTERN' SE 'SOUTH EASTERN' CE 'CENTRAL' WE 'WESTERN' CORP 'CORPORATE' ELSE 'INVALID AREA');$ END DBA=DBAUSER,$ USER= , ACCESS=RW, $ USER=LIM5, ACCESS=R, RESTRICT=VALUE,NAME=SYSTEM, VALUE=RECORDLIMIT EQ 5,$

  9. TABLE FILE FUN_EMPDATA PRINT PIN LASTNAME FIRSTNAME END PIN LASTNAME FIRSTNAME 000000010 VALINO DANIEL 000000020 BELLA MICHAEL 000000030 CASSANOVA LOIS 000000040 ADAMS RUTH 000000050 ADDAMS PETER 000000060 PATEL DORINA \/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/ /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ 000000370 WANG JOHN 000000380 ELLNER DAVID 000000390 GRAFF ELAINE 000000400 LOPEZ ANNE 000000410 CONTI MARSHALL NUMBER OF RECORDS IN TABLE= 41 LINES= 41

  10. SET USER = LIM5 TABLE FILE FUN_EMPDATA PRINT PIN LASTNAME FIRSTNAME WHERE LASTNAME CONTAINS 'E' END PIN LASTNAME FIRSTNAME 000000020 BELLA MICHAEL 000000060 PATEL DORINA 000000070 SANCHEZ EVELYN 000000100 ANDERSON TIM 000000130 CVEK MARCUS NUMBER OF RECORDS IN TABLE= 5 LINES= 5ACCESS LIMITED BY PASSWORD

  11. FILTER FILE EMPDATA ADD NAME=LIMIT10,DESC=LIMITING RECORDS TO 1O IF RECORDLIMIT EQ 10 END -RUN SET FILTER=* IN EMPDATA ON -RUN ? FILTER -RUN TABLE FILE EMPDATA PRINT PIN LASTNAME FIRSTNAME WHERE LASTNAME CONTAINS 'E' END

  12. PIN LASTNAME FIRSTNAME 000000020 BELLA MICHAEL 000000060 PATEL DORINA 000000070 SANCHEZ EVELYN 000000100 ANDERSON TIM 000000130 CVEK MARCUS 000000140 WHITE VERONICA 000000150 WHITE KARL 000000190 MEDINA MARK 000000220 LEWIS CASSANDRA 000000260 ROSENTHAL KATRINA SET FILE FILTER NAME DESCRIPTION--- -------- ----------- ---------------------------------* EMPDATA LIMIT10LIMITING RECORDS TO 1O NUMBER OF RECORDS IN TABLE= 10 LINES= 10ACCESS LIMITED BY FILTERS

  13. Topics • Limiting the records processed • Counting Records • Determining The Domain of Data • Finding the biggest / smallest values • Dropping the Outliers • Skipping records in your sample

  14. TABLE FILE EMPDATACOUNT SALARY/I8C SUM.SALARYEND NUMBER OF RECORDS IN TABLE= 41 LINES= 1 TABLE FILE EMPDATA SUM CNT.SALARY/I8C SALARY END NUMBER OF RECORDS IN TABLE= 41 LINES= 1

  15. DEFINE FILE EMPDATACTR/I8C WITH PIN = 1;ENDTABLE FILE EMPDATASUM CTR SALARYEND CTR SALARY 41 $2,029,200.00 NUMBER OF RECORDS IN TABLE= 41 LINES= 1

  16. Topics • Limiting the records processed • Counting Records • Determining The Domain of Data • Finding the biggest / smallest values • Dropping the Outliers • Skipping records in your sample

  17. Output (Report, Hold File or Save file) Data Source “Summarization” Processing COMPUTE WHERE & IF DEFINE WHERE TOTAL & IF TOTAL DEFINE FILE CAR CONTINENT/A20 = IF COUNTRY EQ 'JAPAN' THEN 'ASIA' ELSE 'EUROPE'; CTR/I2 WITH LENGTH = 1; END TABLE FILE CAR PRINT CONTINENT CAR RCOST DCOST COMPUTE PROFIT/D7 = RCOST - DCOST; WHERE COUNTRY IN ('ENGLAND' , 'JAPAN' , 'FRANCE') WHERE TOTAL PROFIT GT 1000 BY COUNTRY END

  18. COUNTRY CONTINENT CAR RETAIL_COST DEALER_COST PROFIT ENGLAND EUROPE JAGUAR 8,878 7,427 1,451 EUROPE JAGUAR 13,491 11,194 2,297 EUROPE JENSEN 17,850 14,940 2,910

  19. AA 56 AA 92 AA 19 AA 27 BB 14 BB 13 BB 82 BB 50 CC 03 CC 62 CC 10 CC 12 FILE=NUMBERS, SUFFIX=FIX SEGNAME=SEG01, SEGTYPE=S0, $ FIELD=ID , USAGE=A03 , ACTUAL=A03, $ FIELD=NBR , USAGE=I02L , ACTUAL=A02, $

  20. TABLEF FILE NUMBERS PRINT * END NUMBER OF RECORDS IN TABLE= 12 LINES= 12

  21. TABLE FILE NUMBERS SUM MIN.NBR MAX.NBR FST.NBR LST.NBR END

  22. TABLE FILE NUMBERS SUM MIN.NBR MAX.NBR FST.NBR LST.NBR BY ID END

  23. TABLE FILE NUMBERS SUM MIN.NBR MAX.NBR FST.NBR LST.NBR SUM MIN.NBR MAX.NBR FST.NBR LST.NBR BY ID END

  24. DEFINE FILE EMPDATACTR/I8C WITH PIN = 1;ENDTABLE FILE EMPDATASUM CTR COMPUTE AVE_SALARY/P12.2C = SALARY / CTR;SALARY MIN.SALARY MAX.SALARYBY DEPTON TABLE SUMMARIZEEND

  25. MIN MAX DEPT CTR SALARY SALARY SALARY AVE_SALARY ACCOUNTING 5 $283,300.00 $26,400.00 $83,000.00 56,660.00 ADMIN SERVICES 2 $56,200.00 $25,400.00 $30,800.00 28,100.00 CONSULTING 3 $126,300.00 $35,900.00 $49,500.00 42,100.00 CUSTOMER SUPPORT 4 $198,400.00 $19,300.00 $62,500.00 49,600.00 MARKETING 11 $570,700.00 $32,300.00 $62,500.00 51,881.82 PERSONNEL 5 $216,800.00 $25,000.00 $80,500.00 43,360.00 PROGRAMMING & DVLPMT 4 $182,300.00 $40,900.00 $49,500.00 45,575.00 SALES 7 $395,200.00 $30,500.00 $115,000.00 56,457.14 TOTAL 41 $2,029,200.00 $235,700.00 $533,300.00 49,492.68 NUMBER OF RECORDS IN TABLE= 41 LINES= 8

  26. DEFINE FILE EMPDATA CTR/I8C WITH SALARY = 1; TOT_SAL/D12.2M = SALARY; END TABLE FILE EMPDATA HEADING " CTR <20 <CTR " " TOT_SAL <20 <TOT_SAL " " AVE.TOT_SAL <20 <AVE.TOT_SAL " " MIN.TOT_SAL <20 <MIN.TOT_SAL " " MAX.TOT_SAL <20 <MAX.TOT_SAL " SUM CTR NOPRINT TOT_SAL NOPRINT AVE.TOT_SAL NOPRINT MIN.TOT_SAL NOPRINT MAX.TOT_SAL NOPRINT SUM CTR SALARY AVE.SALARY MIN.SALARY MAX.SALARY BY DIV END

  27. Topics • Limiting the records processed • Counting Records • Determining The Domain of Data • Finding the biggest / smallest values • Dropping the Outliers • Skipping records in your sample

  28. TABLE FILE EMPDATA COUNT PIN/I8C BY DEPT END DEPT PIN COUNT ACCOUNTING 5 ADMIN SERVICES 2 CONSULTING 3 CUSTOMER SUPPORT 4 MARKETING 11 PERSONNEL 5 PROGRAMMING & DVLPMT 4 SALES 7 NUMBER OF RECORDS IN TABLE= 41 LINES= 8

  29. TABLE FILE EMPDATA PRINT COMPUTE MESG/A10 = 'HIGHEST'; RANKED BY HIGHEST 5 SALARY END RANK SALARY MESG 1 $115,000.00 HIGHEST 2 $83,000.00 HIGHEST 3 $80,500.00 HIGHEST 4 $79,000.00 HIGHEST 5 $70,000.00 HIGHEST RANK SALARY MESG 1 $19,300.00 LOWEST 2 $25,000.00 LOWEST 3 $25,400.00 LOWEST 4 $26,400.00 LOWEST 5 $30,500.00 LOWEST TABLE FILE EMPDATA PRINT COMPUTE MESG/A10 = 'LOWEST'; RANKED BY LOWEST 5 SALARY END

  30. TABLE FILE EMPDATA PRINT COMPUTE MESG/A10 = 'HIGHEST'; RANKED BY HIGHEST 5 SALARY ON TABLE HOLD AS HI_LIST FORMAT ALPHA END TABLE FILE EMPDATA PRINT COMPUTE MESG/A10 = 'LOWEST'; RANKED BY LOWEST 5 SALARY ON TABLE HOLD AS LO_LIST FORMAT ALPHA END TABLE FILE HI_LIST SUM SALARY BY MESG AS '' ACROSS RANK AS '' MORE FILE LO_LIST END

  31. 1 2 3 4 5 HIGHEST $115,000.00 $83,000.00 $80,500.00 $79,000.00 $70,000.00 LOWEST $19,300.00 $25,000.00 $25,400.00 $26,400.00 $30,500.00 NUMBER OF RECORDS IN TABLE= 41 LINES= 41(BEFORE TOTAL TESTS) NUMBER OF RECORDS IN TABLE= 41 LINES= 41(BEFORE TOTAL TESTS) NUMBER OF RECORDS IN TABLE= 10 LINES= 2

  32. Topics • Limiting the records processed • Counting Records • Determining The Domain of Data • Finding the biggest / smallest values • Dropping the Outliers • Skipping records in your sample

  33. -SET &HI = 2; -SET &LO = 2; DEFINE FILE EMPDATA TOT_CTR/I8 WITH PIN = 1; END TABLE FILE EMPDATA SUM TOT_CTR COMPUTE BOT/I8 = TOT_CTR - &LO; BY DIV PRINT COMPUTE CTR/I8 = IF DIV NE LAST DIV THEN 1 ELSE LAST CTR + 1; COMPUTE HILO/A4 = IF CTR LE &HI THEN 'HIGH' ELSE IF CTR GT BOT THEN 'LOW' ELSE 'N'; COMPUTE JOIN_KEY/A20 = DIV || EDIT(FST.SALARY); COMPUTE OUTLIER/A3 = 'YES'; BY DIV BY SALARY WHERE TOTAL HILO NE 'N'; ON TABLE HOLD AS HOLDOUTL FORMAT FOCUS INDEX JOIN_KEY END

  34. DIV TOT_CTR BOT SALARY FOCLIST CTR HILO SALARY JOIN_KEY OUTLIER CE 10 8 $25,000.00 1 1 HIGH $25,000.00 CE000000025000 YES CE 10 8 $25,400.00 2 2 HIGH $25,400.00 CE000000025400 YES CE 10 8 $33,300.00 3 3 N $33,300.00 CE000000033300 YES CE 10 8 $40,900.00 4 4 N $40,900.00 CE000000040900 YES CE 10 8 $43,000.00 5 5 N $43,000.00 CE000000043000 YES CE 10 8 $45,000.00 6 6 N $45,000.00 CE000000045000 YES CE 10 8 $49,500.00 7 7 N $49,500.00 CE000000049500 YES CE 10 8 $54,100.00 8 8 N $54,100.00 CE000000054100 YES CE 10 8 $62,500.00 9 9 LOW $62,500.00 CE000000062500 YES CE 10 8 $115,000.00 10 10 LOW $115,000.00 CE000000115000 YES CORP 8 6 $26,400.00 1 1 HIGH $26,400.00 CORP000000026400 YES CORP 8 6 $32,400.00 2 2 HIGH $32,400.00 CORP000000032400 YES CORP 8 6 $35,200.00 3 3 N $35,200.00 CORP000000035200 YES CORP 8 6 $55,500.00 4 4 N $55,500.00 CORP000000055500 YES CORP 8 6 $62,500.00 5 5 N $62,500.00 CORP000000062500 YES CORP 8 6 $62,500.00 6 6 N $62,500.00 CORP000000062500 YES CORP 8 6 $79,000.00 7 7 LOW $79,000.00 CORP000000079000 YES CORP 8 6 $83,000.00 8 8 LOW $83,000.00 CORP000000083000 YES NE 7 5 $19,300.00 1 1 HIGH $19,300.00 NE000000019300 YES NE 7 5 $32,300.00 2 2 HIGH $32,300.00 NE000000032300 YES NE 7 5 $39,000.00 3 3 N $39,000.00 NE000000039000 YES NE 7 5 $43,600.00 4 4 N $43,600.00 NE000000043600 YES NE 7 5 $52,000.00 5 5 N $52,000.00 NE000000052000 YES /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/ Eliminated by WHERE TOTAL HILO NE 'N'; Eliminated by WHERE TOTAL HILO NE 'N'; Eliminated by WHERE TOTAL HILO NE 'N';

  35. JOIN CLEAR * JOIN JOIN_KEY WITH SALARY IN EMPDATA TO JOIN_KEY IN HOLDOUTL AS JEMPOUT END EMPDATA 01 S1 ************** *PIN **I *LASTNAME ** *FIRSTNAME ** *MIDINITIAL ** *************** ************** I SEG02 02 I KU .............. :SALARY : :FOCLIST : :CTR : :JOIN_KEY :K :............: I SEG01 03 I KLU .............. :DIV : :TOT_CTR : :BOT : :............:

  36. DEFINE FILE EMPDATA JOIN_KEY/A20 WITH SALARY = DIV || EDIT(SALARY); END TABLE FILE EMPDATA " WITH OUTLIERS REMOVED" SUM MIN.SALARY MAX.SALARY BY DIV WHERE OUTLIER NE 'YES' END

  37. DEFINE FILE EMPDATA TOT_CTR/I8 WITH PIN = 1; END TABLE FILE EMPDATA SUM TOT_CTR COMPUTE BOT/I8 = TOT_CTR - &LO; BY DIV PRINT COMPUTE CTR/I8 = IF DIV NE LAST DIV THEN 1 ELSE LAST CTR + 1; COMPUTE HILO/A4 = IF CTR LE &HI THEN 'HIGH' ELSE IF CTR GT BOT THEN 'LOW' ELSE 'N'; COMPUTE JOIN_KEY/A20 = DIV || EDIT(FST.SALARY); COMPUTE OUTLIER/A3 = 'YES'; BY DIV BY SALARY WHERE TOTAL HILO NE 'N'; ON TABLE HOLD AS HOLDOUTL FORMAT FOCUS INDEX JOIN_KEY END JOIN CLEAR * JOIN JOIN_KEY WITH SALARY IN EMPDATA TO JOIN_KEY IN HOLDOUTL AS JEMPOUT END DEFINE FILE EMPDATA JOIN_KEY/A20 WITH SALARY = DIV || EDIT(SALARY); END TABLE FILE EMPDATA " WITH OUTLIERS REMOVED" SUM MIN.SALARY MAX.SALARY BY DIV WHERE OUTLIER NE 'YES' END

  38. WITH OUTLIERS REMOVED DIV MIN MAX SALARY SALARY CE $33,300.00 $54,100.00 CORP $35,200.00 $62,500.00 NE $39,000.00 $52,000.00 SE $49,500.00 $50,500.00 WE $33,000.00 $54,100.00 NUMBER OF RECORDS IN TABLE= 41 LINES= 41(BEFORE TOTAL TESTS) NUMBER OF RECORDS IN TABLE= 20 LINES= 5

  39. Topics • Limiting the records processed • Counting Records • Determining The Domain of Data • Finding the biggest / smallest values • Dropping the Outliers • Skipping records in your sample

  40. DEFINE FILE EMPDATA REC_NBR/I5 WITH PIN = LAST REC_NBR + 1; TAKE/I5 WITH PIN = IMOD(REC_NBR,10,'I5'); END TABLE FILE EMPDATA PRINT REC_NBR TAKE LASTNAME SALARY END

  41. DEFINE FILE EMPDATA REC_NBR/I5 WITH PIN = LAST REC_NBR + 1; TAKE/I5 WITH PIN = IMOD(REC_NBR,10,'I5'); END TABLE FILE EMPDATA PRINT REC_NBR TAKE LASTNAME SALARY WHERE TAKE EQ 1 END

  42. SET DMPRECISION = 3 -RUN -SET &DIV = 10000; -SET &PER = 100 * (1/&DIV); -SET &PER = &PER || '%'; DEFINE FILE EMPDATA REC_NBR/I5 WITH PIN = LAST REC_NBR + 1; TAKE/I5 WITH PIN = IMOD(REC_NBR,&DIV,'I5'); END TABLE FILE EMPDATA " SAMPLE &PER OF FILE " PRINT PIN LASTNAME FIRSTNAME WHERE TAKE EQ 1 END SAMPLE 0.01% OF FILE PIN LASTNAME FIRSTNAME 000000010 VALINO DANIEL NUMBER OF RECORDS IN TABLE= 1 LINES= 1

  43. -SET &NUM = 2; -SET &DIV = 3; -REPEAT REPEND FOR &I FROM 0 TO 9 STEP 1 SET DMPRECISION = &I -RUN -SET &QUOT = &NUM / &DIV; -TYPE DMPRECISION IS &I &NUM / &DIV = &QUOT -REPEND DMPRECISION IS 0 2 / 3 = 1 DMPRECISION IS 1 2 / 3 = 0.7 DMPRECISION IS 2 2 / 3 = 0.67 DMPRECISION IS 3 2 / 3 = 0.667 DMPRECISION IS 4 2 / 3 = 0.6667 DMPRECISION IS 5 2 / 3 = 0.66667 DMPRECISION IS 6 2 / 3 = 0.666667 DMPRECISION IS 7 2 / 3 = 0.6666667 DMPRECISION IS 8 2 / 3 = 0.66666667 DMPRECISION IS 9 2 / 3 = 0.666666667

  44. The PRDNOR and PRDUNI functions generate reproducible random numbers: {PRDNOR|PRDUNI}(seed, outfield) PRDNOR generates reproducible double-precision random numbers that are normally distributed with an arithmetic mean of 0 and a standard deviation of 1. PRDUNI generates reproducible double-precision random numbers uniformly distributed between 0 and 1 The RDNORM and RDUNIF functions generate random numbers: {RDNORM|RDUNIF}(outfield) RDNORM generates double-precision random numbers that are normally distributed with an arithmetic mean of 0 and a standard deviation of 1. RDUNIF generates double-precision random numbers uniformly distributed between 0 and 1 Reference: FOCUS for S/390 Using Functions Version 7.2 DN1001140.1101

More Related