Advanced Sorting Techniques in Data Structures

COP 3540 Data Structures with OOP Chapter 7 - Part 1 Advanced Sorting

Advanced Sorting • Two sorts we will cover first. • Shell Sort – an O(n(log2 n) 2) sort … in general, and ‘can approach’ O(n) performance! • Partitioning, an O(nlog2n) sort. • Then, we’ll cover the QuickSort.

Recall how the Insertion Sort worked. • Took an element out of the ‘array’ and assumed all elements ‘to the left’ were sorted. • We marked this spot. • And we extracted out that element. • We then • compared the element extracted out with the elements ‘to the left’ of this element and • ‘inserted’ this element into its proper place • shifting all elements to the right as needed to make room for this inserted element and fill the vacated spot.

Approach that helped us: • Constraints: • Helped ourselves by: • starting with a single element to the left – so knew ‘that’ element was sorted - certainly sorted unto itself. • Then we proceeded: • Slowly the elements to the ‘left’ of the marked element grew in sorted number, as new numbers find their proper place in the subarray to the left - while the unsorted elements to the right diminish in number.

PotentialProblems with the Insertion Sort •  Now, what happens if the new number to be sorted is very small (or very large) and our sort is ‘ascending (or descending)?’ • This may require a largenumber of ‘copies’ to the right to make room for this new element. • Can require a number of ‘copies’ close to ‘n’ in fact. • Average number of copies is clearly n/2. • For n elements to be sorted and an average of n/2 copies per element, we have n*n/2 or n2/2 copies. • That may result in a very inefficient sort. •  This is how the insertion sort is an O(n2) sort. • It is this number of copies (comparing and shifting) that decreasesitsperformance.

Shell Sort Approach • Want to reduce these numbersoflargeshifts • Shell sort does this by sorting a very small subset of numbers – like three or four: • Where the numbers themselves might be large distances apart (like in a large array) • and it sorts them withrespect to each other • By sorting a small number of numbers, very small (or very large) numbers can be put much more nearly ‘in place’ much more quickly than with other approaches. • How done?

Shell Sort uses the notion of a ‘computed Gap’ • The Shell Sort uses a computed ‘gap’ between numbers represented by an ‘h’ as the distance between numbers in each subset to be sorted. • 1. Sorts all numbers (say in the array of numbers) with the same ‘h’ (gap) • Like, numbers eight apart – or four apart… • Sorts these numbers with respect to each other. 2. Then, after doing this, the algorithm reduces thegap (or distance) to a smaller number, like maybe 4 apart. • 3. (Ultimately the gap has size = 1;) Then the algorithm ‘1-sorts’ the array using the insertion sort.

Example • Consider: sort threeelements at a time with respect to each other, where the numbers are some ‘h’ distance apart • ……………………………………………………. • For array size n=10, and if gap size h = 4, we have four sub-arrays: (We call this a 4-sort) • Indices: (0,4,8), (1,5,9), (2,6) and (3,7). These sets are sorted with respect to each other. (Note: all ten are sorted!) • Arrays are interleaved, but, again, sorted with respect to each other. • (Note: the integers are not yet in final spot.

Consider Improved Performance! • Recall again the Insertion Sort • Recalling how the insertion sort works, • veryefficient for arrays nearly sorted (fewer swaps and movement, and yet can be • veryinefficient (due to shifts and copies) if the data are very unsorted. • Particularly true for very large / very small numbers. • Shell sort does ‘n-sorting’ • Capitalizes on initial position of elements especially if they are far from where they might ultimately end up. • Brings numbers more quickly to final position…(or nearer) • Algorithm moves elements that may be very far apart much closer to their final position more quickly thus reducing copying and shifting and swapping! • Shell Sort canapproach O(n) performance: muchbetter than O(n2) !

What about Larger Arrays? Gap Size? • Using a carefully researched algorithm to compute optimumgapsize,. • DonKnuthdeveloped a ‘recursive’ relationship: • h= 3*h+1 to start with, and then, subsequent gaps at • (h-1)/3. • (note the ‘recursion’ in the formula itself. Uses value of h to compute new value of h. • These h-values are referred to as • interval sequence or gap sequence • and are recursively computed as functions of h. • In more detail:

Don Knuth’s algorithm will start with a 3-sort; that is, sort three numbers some distance apart. By Don Knuth’s research reveals, as it turns out (algorithm is a few slides ahead), for an array of size > 364 and < 1093, 3-sort with a gap size of 364; After that sort, use a gap size of 121; then gap size = 40; steadily decreasing… Developinitial gap size recursively by computing h: (algorithm is three slides ahead) h 3*h+1 h is determined by computing the largest value of h 1 4 computing h=h*3 +1 until h <= nElems/3 is false 4 13 13 40 So, computing h we see that h increases from 1 to 4 to 13 to 121 to 364 to …. 40 121 121 364 Once original gap is determined, sort continues and algorithm steadily reduces gap h from 364 to 121 .. 364 1093 until h = 1 1093 3280 So for array size > 364 and < 1093, gap = 364, etc. Gap sizes

Algorithm (covered in previous slide) • Algorithm first uses a short loop to generate the first (initial) value of h. • Then, once we have an initial value of h: • additional values of h are recursivelycomputed depending on the size of the array to be sorted. • Gap then starts with largest h-value. • For a 1000-element array, our initial gap size is 364. • After sorting, we would successively decrease the gap using the formula: h = (h-1)/3 as shown.

Note: • As it turns out, the algorithm actually sorts the first two elements of each group for a given gap first; then it goes back and sorts all three-element groups. This results in better performance time. • You will see this if you look carefully at the algorithm.

public void shellSort() { int inner, outer; long temp; int h = 1; // find initial value of h while (h <= nElems/3) // COMPUTE GAP SIZE h = h*3 + 1; // (1, 4, 13, 40, 121, 364,...) // Compute initial value of h // Value of h depends on original size of array, nElems. // start with largest gap (h-value) such that h < nElem/3 while (h > 0) // for 1000 element array, h = 364 { for (outer=h; outer<nElems; outer++) // h – sort the structure… { // for 1000 elements, h = 364; outer < nElems (1000); increment by one. temp = theArray[outer]; inner = outer; while (inner > h-1 && theArray[inner-h] >= temp) { theArray[inner] = theArray[inner-h]; inner -= h; } // end while theArray[inner] = temp; } // end for h = (h-1) / 3; // computes new gap: decreases h } // end while (h>0) } // end shellSort()

Google: Shell Sort Applet • Google: applet Lafore • You will get a number of applet choices. • Select and enjoy

Demo of Shell Sort • Do n=12 and notice how the gap varies across the bars. • You can see when h goes from 4 to 1. • Can see when it compares two in the interval … then three; then 1-sorts. • Do 100 sort. • It starts with h = 40. See it compares two of the three in the interval until there are only intervals of two left. • There is a larger number of intervals when it goes to h= 13. • Go to h=4 and see more intervals yet. • Finally, h=1. • Do this.

Shell Sort - Evaluation • Good for medium-sized array up to a few thousand items. • Shell Sort - O(n(log2n)2 ) is not as fast as the Quick Sort O(nlog2n) (coming soon) • Not so good for large files, but • Easy to implement • Requires very little extra space. • All sorts have a ‘worst case’ performance. • For Shell Sorts, the • Worse case is not much worse than average performance, so this is good! • (Worse case is very different than average case in a Quick Sort).

Final Remarks on Shell Sort • Other sequences are available. • Many alternatives available. Can experiment… • Ultimately, need to end up with a 1 • Forces last pass to be an insertion sort. • Guideline: • Gaps should be relatively prime. • Note Shell Sort’s numbers presented are not all prime (4, 40…). • This led to some earlier inefficiencies. • Experiments on Shell Sort yield performance mostly between O(n3/2) to O(n7/6)) • or from almost O(n2) down to almost O(n)! • Quite a difference and the difference is realized as n increases, which makes sense.

Partitioning

Partitioning • Partitioning is key to QuickSort thinking. • Partitioning divides data into two groups dependent upon the value of a key. • E.g. Divide students into two groups: < 3.0 gpa; > 3.0 • (Incidentally, why is a gpa of 3.0 important??) • We select a PivotValue: • value used to separate data items into two groups: • end up with Data < pivot value and Data > pivot value.

Pivot Values • Note: pivot point can be any key value. • Need not be a midpoint or value ‘half-way.’ • Would be nice if pivot were half-way point, but we have no way of knowing… •  Later we will see how the choice of the pivot impacts performance! • Pivotvalue used to separate array into left side and right side. • Ideally, we’d ‘like’ the sub-arrays to be roughly the same size, and we will work toward that reality.

Run Partition Algorithm to build Sub-Arrays • Once pivot value selected, we run the partition algorithm • Once run, • data on the left side of the pivot ‘belongs’ to the left side of the array (whatever number of elements may be on the left) and, • Data on the right side (>=) than the pivot value belong to the right side, however many elements are on the right side. • Note: Once partitioning is run, data is NOT sorted, • But, the items are a lot ‘closer’ to their final position… • And array is partitioned based on the pivot value.

The Partitioning Algorithm • Pick a pivot value… (more later) • Start with index at the left side of one partition. • Let’s call it left scan. • Move toward the right. • Compare element to pivot value. • If an element is less than the pivot value, leave it alone. Move to the right. • Advance to the right until element is >= pivot value and then Stop. • Starting with index at right most index on the right side • Let’s call it a right scan. • Move toward the left. • Compare element to the pivot value • If an element is >= pivot value, leave it alone; Move to the left. • Advance to the left until element is < pivot value and then Stop. • Swap the two values. • Iterate (back on the left; then right) until left and right scan are looking at the same entry. • ….

Let’s look at the applet

Partition.html • Google: applet Lafore • Run with n=12 with various orderings… • Run with n=40. Notice the partition first and the final ordering… • Note: in running the partitioning algorithm the data are not totally sorted – but they are a good bit closer.

Partitioning and the Pivot Value • Note partitioning is not stable. • As elements on one side are moved to the other side of the pivot value, they are NOT necessarily in the same relative positions in this ‘new’ partition! • In fact, they tend to be in reverse order. • Further, the numberof elements on each side neednotbethesame either – depends on the pivot value. • Very likely, there is NOT the same number of elements on each side of the pivot.

One (of several) Problems with Partitioning • 1. What if a poor pivot value were chosen such that all elements to the left were < pivot value? • Algorithm index keeps advancing. • End up with array index out of bounds exception. • Ditto the other way. See code below. while (leftPtr < right && theArray[++leftPtr] < pivot) ; // nop • Clearly – as any program that is to be robust, there must be checks on the pivot value.

Efficiency of the Partition • Algorithm is pretty efficient too • Runs in O(n) time. • Pointers move from opposite ends moving and swapping at a constant rate. • If n were 2n, the algorithm would take roughly twice as long. • Thus the algorithm operates in O(n) time – means time is proportional to the number of items being sorted.

Efficiency of the Partitioning Algorithm • Nonrandom data yields terrible results! • If data is inverselyordered, then every pair will be swapped, so n/2 swaps! Very inefficient! • Multiply this by n elements and we have a n2 /2. Poor! • Randomdata: yields fewer than n/2 swaps. • Some will already be in the right place. • On average for random data, about half of maximum no. of swaps will take place. • Regardless of random / non-random, both situations result in an efficiency proportional to n.

Advanced Sorting Techniques in Data Structures

Advanced Sorting Techniques in Data Structures

Presentation Transcript

COP 3540 Data Structures with OOP

COP 3538 Data Structures with OOP

COP 3538 Data Structures with OOP

COP 3540 Data Structures with OOP

COP 3540 Data Structures with OOP

COP 3540 Data Structures with OOP

COP 3540 Data Structures with OOP

Felipe X. Aspillaga COP 3540 – Data Structures Project 4 - UML

COP 3540 Data Structures with OOP

COP 3540 Data Structures with OOP

COP 3540 Data Structures with OOP

COP 3540 Data Structures with OOP

COP 3540 Data Structures with OOP

COP 3540 Data Structures with OOP

COP 3540 Data Structures with OOP

COP 3540 Data Structures with OOP

COP 3540 – Introduction to Database Structures

COP 2700 – Data Structures (SQL)

COP 2700 – Data Structures (SQL)

COP 2700 – Data Structures - SQL