230 likes | 363 Vues
Discover innovative sorting strategies in data management with insights from Jerry Le Breton and Doug Lean. This presentation highlights the importance of understanding your data before sorting, addressing common sorting methods in SAS and SQL. Learn how to minimize unnecessary sorting, utilize PROC SORT effectively, and maintain data order. Explore techniques to handle duplicates, cases, and specific variable sorting to enhance performance. Adopt smarter data handling practices for better efficiency and clarity in your operations.
E N D
Smarter Sorts Beyond the Obvious Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) SAUSAG 69 – 20 Feb 2014
Sorting –The Obvious First • Why Sort ? • “Data and information is almost always presented in a sorted or structured way”
Sorting - The Obvious First procsortdata=claims; by claim client; Sort puts your records in order - BY the values of the variables you list. • Its important to know your data • How many variables • How many distinct data values for each SAUSAG 69 – 20 Feb 2014
Sorting – Do You Need To? procsortdata=claims; by claim; Proc tabulate ...; classclaim; ... • Some PROCS do their own sorting: • TABULATE • MEANS • REPORT • SQL • (which can run out of memory for really big data sets) An unnecessary SORT SAUSAG 69 – 20 Feb 2014
Sorting – Do You Need To? • Only use PROC SORT before REPORT, TABULATE, MEANS if there’s another reason later. • For PROC MEANS substitute BY with CLASS e.g. PROC MEANS NWAY; CLASS x y z; Is similar to PROC SORT; BY x y z; PROC MEANS; BY x y z; And saves significant time by avoiding the SORT SAUSAG 69 – 20 Feb 2014
Sort Only What You Need Sort just the rows you want… procsortdata=claims out=Sorted_claims; where client =: 'A'; by claim; … and just the columns you want… procsortdata=claims(keep = c:) out=Sorted_claims; by claim; Leaving out unwanted rows and columns can produce dramatic performance improvements. SAUSAG 69 – 20 Feb 2014
Sorting – Proc Sort vsProc SQL /* SORT Procedure */ procsortdata=claims; byclient claim; run; /* SQL Procedure */ procsql; create table claims as select * from claims order by client claim; quit; • Both will sort your data. • No significant performance difference. • Choose according to clarity, functional requirement and efficiency. • Make it as clear and simple as possible! SAUSAG 69 – 20 Feb 2014
Sorted Status of a Data Set procsortdata=claims; by claim client; Sort status is saved as part of a SAS data set. Sort Information Sortedby CLAIM CLIENT Validated YES Character Set ANSI So SAS won’t waste time re-sorting if it’s already in the required order. SAUSAG 69 – 20 Feb 2014
Setting Sorted Status of a Data Set dataclient_claims (sortedby = client ); merge clients claims; by client ; If you know a data set is sorted, say so with the SORTEDBY= option!. Sort Information Sortedby CLIENT Validated NO Character Set ANSI So SAS won’t waste time re-sorting later. SAUSAG 69 – 20 Feb 2014
Presorted or Notsorted procsortdata=claims out=sorted presorted; by claim; • PRESORTED option for when data probably sorted! • SAS will check and only sort if necessary. procprintdata=grouped_claims; by claim NOTSORTED; • No need to sort if data is grouped BY the required variable – it doesn’t matter its NOTSORTED • (you just have to say so). SAUSAG 69 – 20 Feb 2014
Sorting and Maintaining Order procsortdata=claims; by claim ; • By default, SAS maintains the original order of records within a BY group. procsortdata=claims noequals; by claim ; • Using the NOEQUALS option means SAS won’t necessarily retain the original ordering. • More efficient but, directly affects the results of using NODUPKEY SAUSAG 69 – 20 Feb 2014
Sorting Duplicates procsortdata=claims out=no_duplicates nodupkey; by claim; NODUPKEY effectively keeps the first record of any duplicates. procsortdata=claims out=no_duplicates dupout=dups nodupkey; by claim; DUPOUT= puts the duplicates to a separate table. SAUSAG 69 – 20 Feb 2014
Separating Unique & Duplicate Rows procsortdata=claimsout=sorted ; by claim; run; dataunique_claims dup_claims; set sorted; by claim; iffirst.claim and last.claimthen outputunique_claims; else outputdup_claims; run; It works, but needs an extra pass of the data. SAUSAG 69 – 20 Feb 2014
Separating Unique & Duplicate Rows- the smarter way NOUNIQUEKEY ensures no records with a unique key are written to the OUT= table. procsortdata=claims out=duplicates uniqueout=uniques nouniquekey; by claim; run; …and the UNIQUEOUT= option directs the unique records to a separate table SAUSAG 69 – 20 Feb 2014
Sorting – Case Insensitive procsortdata=names out=simply_sorted; by name; Upper case letters are before lower case in the ASCII collating sequence. data names2; set names; upcase_name = upcase(name); procsortdata=names2 out=upcase_sorted(keep=name); byupcase_name; Creating an upper (or lower) case copy of the variable is the old solution. SAUSAG 69 – 20 Feb 2014
Sorting – Case Insensitive - Smarter SORTSEQ option specifies the collating sequence (ASCII/EBCDIC/other languages) or, LINGUISTIC option modifies the current collating sequence. procsortdata=names out=linguistic_sorted sortseq=linguistic; by name; The affect is to make the sort case insensitive. SAUSAG 69 – 20 Feb 2014
Sorting – Case Insensitive – with SQL PROC SQL allows the use of functions in the Order By (and other) clauses. procsql; createtablesql_sortedas select * from names orderbyupcase(name); The result is different from Proc SORT using the sorteq=linguistic. SAUSAG 69 – 20 Feb 2014
Sorting Out Spaces procsortdata=names out=simply_sorted; by name; A standard sort is obviously no use. datanames_temp; set names; temp_name = upcase(compress(name)); run; procsortdata=names_temp out=temp_sorted(keep=name); bytemp_name; Creating another variable for sorting, without spaces, is the old solution.
Sorting Out Spaces procsql; createtablesql_sortedas select * from names orderbyupcase(compress(name)); Proc SQL can do it too. procsortdata=namesout=alt_handling_sorted sortseq = linguistic(alternate_handling = shifted); by name; Proc SORT can too! This sub-option of the LINGUISTIC sortseq option, effectively ignores spaces as well as being case-insensitive. SAUSAG 69 – 20 Feb 2014
Sorting by Numbers Sorting text with numeric prefixes e.g. student id and name … procsortdata=students out=simply_sorted; by student; … results in nothing useful! SAUSAG 69 – 20 Feb 2014
Sorting by Numbers datastudents_temp; set students; student_num = input(scan(student,1), 2.); run; procsortdata=students_temp out=temp_sorted(keep=student); bystudent_num; An extra data step can create a numeric variable to sort with (as can SQL of course) procsql; createtablesql_sortedas select * from students orderby input(scan(student,1), 2.); SAUSAG 69 – 20 Feb 2014
Sorting by Numbers procsortdata=students out=num_collation_sorted sortseq = linguistic (numeric_collation=on); by student; The numeric_collation sub-option of the LINGUISTIC sortseq option, sorts by the numeric values that prefix the variable values. SAUSAG 69 – 20 Feb 2014
Questions? • Did you learn something new from this presentation? SAUSAG 69 – 20 Feb 2014