Array Data Structures & Algorithms

Array Data Structures & Algorithms One-Dimensional and Multi-Dimensional Arrays, Searching & Sorting, Parameter passing to Functions

Array Data Structures & Algorithms • Concepts of Data Collections • Arrays in C • Syntax • Usage • Array based algorithms

Concepts of Data Collections Sets, Lists, Vectors, Matrices and beyond

Sets & Lists • The human language concept of a collection of items is expressed mathematically using Sets • Sets are not specifically structured, but structure may be imposed on them • Sets may be very expressive but not always easily represented • Example: The set of all human emotions. • Lists are representations of sets that simply list all of the items in the set • Lists may be ordered or unordered. • Some useful lists include data values and the concept of position within the list. • Example representations: { 0, 5, -2, 4 } { TOM, DICK } First Second

Vectors, Matrices and beyond • Many examples of lists arise in mathematics and they provide a natural basis for developing programming syntax and grammar • Vectors are objects that express the notion of direction and size (magnitude) • The 3-dimensional distance vector D has components { Dx, Dy, Dz } along the respective x, y and z axes, with magnitude D = sqrt(Dx2+Dy2+Dz2) • We use descriptive language terminology such as • The x’th component of D • ... Or, D-sub-x

Vectors, Matrices and beyond • Higher dimensional objects are often needed to represent lists of lists. • Matrices are one example that sometimes can be represented in tabular form • 2-dimensional tables (row, column) • 3- and higher are hard to visualize, but are meaningful • Beyond this, mathematics works with objects called tensors and groups (and other entities), and expresses the access to object members (data values) using properties and indices based on topologies (loosely put, shape structures)

Vectors, Matrices and beyond • In this course we focus on an introduction to the basic properties and algorithms associated with vectors and matrices, and lists (later) • These objects are mathematically defined as ordered, structured lists • The ordering derives from the property that each element of the list exists at a specific position • Enumerated starting at 0 and incrementing by 1 to a maximum value • The structure determines the mechanism and method of access to the list and its member elements

Arrays in C Syntax Memory structure Usage Functions

Arrays in C • Syntax • How to declare and reference 1D arrays using subscript notation • Memory structure • How is RAM allocated – the meaning of direct access through subscripting • Usage • Some simple illustrative examples • Functions • How to deal with arrays as function arguments

Arrays in C - Syntax • Consider the array declarations • intStudentID [ 1000 ] ; float Mark [ 1000 ] ; char Name [ 30 ] ; • Each of the declarations defines a storage container for a specific maximumnumber of elements • Up to 1000 Student (integer) ID values • Up to 1000 (real) Marks • A Name of up to 30 characters

Arrays in C - Syntax • Each array is referred to by its declared name • float A [ 100 ] ;... where A refers to the entire collection of 100 storages • On the other hand, each separate element of A is referenced using the subscript notation • A[0] = 12.34 ; /* assign 12.34 to the first element *//* NOTE: subscripts always start from 0 */ • A[K] = 0.0 ; /* assign 0 to the (K+1)’th element */ Note: Although a bit clumsy in human natural language, we can change our use of language so that A[K] always refers to the K’th element (not K+1), always starting from 0 as the 0’th element.

Arrays in C - Syntax • It is not necessary to initialize or reference all elements of an array in a program • Unlike scalar variables declared as primitive data types, these uninitialized, non-referenced array elements are not flagged as warnings by the compiler • There are good reasons to declare arrays of larger size than might be required in a particular execution run of a program • At the outset, design the program to accommodate various sizes of data sets (usually acquired as input), up to a declared maximum size • This avoids the need to modify and recompile the program each time it is used.

Arrays in C - Syntax #define directives are normally located at the beginning of the program source code, after#include directives, and before function prototypes and the main function. In the example, by using the defined symbol MAX_SIZE, changes to SID and Mark array sizes can be accomplished by simply changing the value assigned to MAX_SIZE and recompiling. • This is a good time to introduce another compiler pre-processor directive, #define: • #define is used to define constant expression symbols in C programs. The value of such symbols is that they localize positions of program modification. • Example: • #define MAX_SIZE 1000 • int main ( ) {int SID [ MAX_SIZE ] ; float Mark [ MAX_SIZE ] ; ..... }

Arrays in C – Memory structure • Now consider the declaration • int A [ 9 ] ; • The entire allocation unitis called A – the array name • There must be 9 integersized allocations in RAM • Each element is locatedcontiguously (in sequenceand “touching”) RAM A A[0] A[8]

Arrays in C – Memory structure The sizeof, operator is a compile-time operator (not an executable instruction or operator) that determines the RAM storage size allocated to a data structure. When sizeof is applied to a primitive data type, it provides the size of allocated storage, in bytes. Try running a program with statements such as: printf( “The size of int is %d bytes\n”, sizeofint ) ; • Arrays are often calleddirect access storagecontainers • The reference to A[K]is translated by thecompiler to • First, calculate the relative address offsetK * sizeofint • Second, add RAO tobase address of A, or simply&A[0] • &A[K] == &A[0] + K*sizeofint RAM sizeof int A[0] A[ K ] 0 1 K RAO Direct Access :: Since the cost of the address computation is always the same (constant) and it provides the actual RAM address location where the data is stored.

Arrays in C – Usage • Referencing arrays is straightforward using the subscript notation • B = A[ 5 ] ; /* assign 6th element of A to B */ • A [ J ] < A [ K ] /* relational expression */ • B = 0.5*( A[J] – A[J-1] ); /* finite difference */ • printf( “%d %d %d\n”, A[0], A[mid], A[N-1] ) ; • scanf ( “%d%lf%lf”, &N, &R[K], R ) ; /* Note */

Arrays in C – Average vs Median • Problem: Input N real numbers and find their average and median. • Assume the values are already sorted from lowest to highest • Assume no more than 1000 values will be inputted • Solution: • Declarations • float X [1000], Sum, Ave, Median ; int N, Mid ;

Arrays in C – Average vs Median • Declarations • float A [1000], Sum = 0.0, Ave, Median ; int N, Mid, K ; • Input Data • printf( “Enter number of values in list “ ) ; scanf( “%d”, &N ) ; • /* Enter all real values into array X */ for( K=0; K < N; K++ ) { scanf( “%f”, &A[K] ) ; /* NOTE: &must be used */ Sum += A[K] ; }

Arrays in C – Average vs Median • Compute Average and Median • Ave = Sum / (float) N ; /* real division */ • Mid = N / 2 ; /* (integer) midpoint of list */ Median = A [ Mid ] ; • Report results • printf( “Average = %”, Ave ); printf( “Median = %f\n”, Median );

Arrays in C – Related arrays • Problem: Obtain student marks from testing and store the marks along with ID numbers • Assume marks are float and IDs are int data types • Solution – Related arrays • Define two arrays, one for IDs and one for marks • int SID [ 100 ] ; float Mark [ 100 ] ; • Coordinate input of data (maintain relationships) • for( K = 0 ; K < N ; K++ ) scanf( “%d%f”, &SID[K],&Mark[K] ) ;

Arrays in C – Functions U • Passing arrays as parameters in functions requires some care and some understanding. We begin with an example. • Calculate the dot product of two 3-vectors U and V. • Components: U[0], U[1], U[2] V[0], V[1], V[2] • Mathematics: The dot product is defined as DotProd( U, V ) ::= U[0]*V[0] + U[1]*V[1] + U[2]*V[2] • Since the dot product operation is required often, it would make a useful function. V U . V

Arrays in C – Functions • Solution function: • double DotProd3 ( double U[3], double V[3] ) { return U[0] * V[0] + U[1] * V[1] + U[2] * V[2] ; } • Note the arguments which specify that arrays of type double with exactly three (3) elements will be passed. • Note that the limitation to 3 elements is reflected in the design of the function name:DotProd3

Arrays in C – Functions • Extend this to dot product of N-dimensional vectors: • double DotProdN ( double U[ ], double V[ ], int N ) { double DPN = 0.0 ; int K ; for( K = 0 ; K < N ; K++ ) DPN += U[K] * V[K] ; return DPN ; } • Note the array arguments do not specify a maximum array size. • This provides flexibility of design since now the function can handle any value of N. It is up to the programmer to ensure that the actual input arrays and N conform to the assumptions.

Arrays in C – Functions • An alternative to the same code is to use pointer references: • double DotProdN ( double * U, double * V, int N ) { double DPN = 0.0 ; int K ; for( K = 0 ; K < N ; K++ ) DPN += U[K] * V[K] ; return DPN ; } • Note the array arguments are now expressed as pointer references. • This maintains the same flexibility as previously.

Arrays in C – Functions Pointers are not the same as int’s ! If A is an int (say, 5), then A++ always evaluates to the next (or successor) value in sequence (ie. 6). On the other hand, if P is a pointer (say, int *, with value &A[K]), then P++ evaluates to the next (or successor) value in sequence, which is usually the next element of an array (ie. &A[K=1]). • A final alternative to the same code is to use pointer references altogether: • double DotProdN ( double * U, double * V, int N ) { double DPN = 0.0 ; int K ; for( K = 0 ; K < N ; K++, U++, V++ ) DPN += *U * *V ; return DPN ; } • The U and V variables are address pointers to the array components. • U++ and V++ perform the action of updating the pointers by an amount equal to the size of the array data type (in this case double is usually 8 bytes), thus pointing to the next array component in sequence.

Arrays in C – Functions • The previous examples have illustrated the various ways that are used to pass array arguments to functions. • double DotProd3 ( double U[3], double V[3] ); • double DotProdN ( double U[ ], double V[ ], int N ); • double DotProdN ( double * U, double * V, int N ); • There are important differences • When the size of the array is specified explicitly (eg. double U[3]) , some C compilers will allocate storage space for all array elements within the function stack frame • If arrays are declared within the function body, they are almost always allocated within the stack frame • When the array size is not stated explicitly, a pointer to the array is allocated (much smaller in size than the entire array)

Arrays in C – Functions • C compilers may perform code and storage optimization • Create the most efficient executable code • Create the most efficient use of RAM storage • By allocating array storage within stack frames, a significant amount of wasted space occurs due to avoidable duplication of storage allocations • It also follows that a wastage of time occurs since it is necessary to copy data from arrays declared in the calling point code to arrays declared in the called point. • Pointers solve most of these problems (with a small, but acceptable, increase in processing time) • Optimization is a difficult problem and is still the subject of much research Theoretical

Array Based Algorithms Searching Sorting Very practical !

Array Based Algorithms • Searching • How to locate items in a list • Simplicity versus speed and list properties • Sorting • Putting list elements in order by relative value • Promoting efficient search

Search Algorithms • Searching is a fundamentally important part of working with arrays • Example: Given a student ID number, what is the Mark they obtained on the test? Do this for all students who enquire. • Constructing a good, efficient algorithm to perform the search is dependent on whether the IDs are in random order or sorted. • Random order – use sequential search • Sorted order – use divide-and-conquer approach

Search Algorithms - Random PROBLEM 1: No guarantee that rand() will produce a result and exit the for loop, especially if the item does not exist. PROBLEM 2: It is possible that an array element position will be accessed that has not had data stored (will stop the program as an error – uninitialized data access violation). • If a list is stored in random order a possible search technique is to look at the list elements in random order search • int srchID, K ; • printf( “Enter your SID “ ) ; scanf( “%d”, &srchID ) ; • for( K=rand() % N ; srchID != SID[ K ] ; K=rand() % N ) ; • printf( “SID = %d, Mark = %f\n”, SID[K], Mark[K] );

Search Algorithms - Linear The Linear Search algorithm starts at the beginning of the list and proceeds in sequence: K = 0 ; /* start at beginning of list */ do { /* search the list */ if ( srchID == SID[ K ] ) break ; /* exit loop if found */ K++ ; /* move to next position */ } while ( K < N) ; /* stop at end of list */ These can be combined in a single for structure as shown below. • Note that the loop is controlled by two conditions: • K<N • This demands that K be initialized at the beginning of the list (K=0) • It also ensures that traversal of the list stops at the end of the logical list (since the actual array size may be larger) • (2) srchID != SID[K] • This ensures that as soon as the value is found, the loop is exited immediately, thereby avoiding unnecessary work • If a list is stored in random order a better search technique is to look at the list elements in order, from the beginning of the list until the element is found – or the list elements are exhausted • int srchID, K, N = 100 ; /* Assume 100 elements */ • printf( “Enter your SID “ ) ; scanf( “%d”, &srchID ) ; • /* Perform the search */ for( K=0; K<N && srchID != SID[ K ] ; K++ ) ; • if( K<N ) printf( “SID = %d, Mark = %f\n”, SID[K], Mark[K] );

Search Algorithms - Linear • Since this search approach considers each element of the list in sequence, it is called Sequential, or Linear, search. • In any situation where the list being searched does not contain the value being searched for, all N elements must be considered • This is called the “worst case” scenario • However, when the element will be found ... • The “best case” occurs when we actually find the value looked for in the first position considered • The “average case” refers to the statistical average obtained over many searches of the list • This is clearly just N/2 elements considered The term Linear derives from the fact that the actual runtime performance of the algorithm, in terms of numbers of distinct CPU operations, is expressed as a formula like: T = A * N + B This is just the equation for a line (recall: y = mx + c)

Search Algorithms - Linear To appreciate the improvement, consider cases where srchID will not be located in SID. For a statistical spread of srchID values (some big, some small) on average, only N/2 comparisons are required to eliminate the need to search further. The previous algorithm always requires N comparisons to ensure srchID is NOT in the list. This version algorithm, on average, only needs N/2 comparisons. The complexity of both algorithms is still O(N), however, because we ignore the coefficient of N. • In cases where the list is sorted, the search algorithm can be improved upon .... • Assume that SID is sorted in ascending order • /* Perform the search */ for( K=0; K<N && srchID > SID[ K ] ; K++ ) ; • if( K<N && srchID == SID[ K ] ) printf( “SID = %d, Mark = %f\n”, SID[K], Mark[K] ); • Always examine code carefully to ensure that the logic properly accounts for all circumstances that can arise, both positive and negative.

Search Algorithms - Efficiency • The time complexity (efficiency, or cost) of an algorithm is important in programming design • Algorithmic efficiencies can be divided into several categories, depending on various properties of the problem under consideration • NP-complete (computable in finite time) • NP-hard (computable in principle, but requires special care and may not actually execute in reasonable time) • NP-incomplete (not guaranteed to execute in finite time, contains unknowable aspects) • “NP” refers to the way that an algorithm performs and is expressed as a polynomial which may also include functions (such as exponential or logarithm)

Search Algorithms - Efficiency • Many problems are characterized by parameters that describe the nature of the data set, or the number of operations that must be performed to complete the algorithm • Search problems assume a data set of size N • For the Sequential Search algorithm (over N values) • K = 0 ; 1 storedo { if ( Value == SearchValue ) break ; 2 fetch, 1 compare, 1 branch K++ ; 1 fetch, 1 increment, 1 store} while ( K < N ) ; 2 fetch, 1 compare, 1 branch/* Report result */ R (constant) operations • /* In the worst case, all N loops are completed */ • Cost :: N *( 5 fetch + 1 store + 1 increment + 2 compare + 2 branch )+ R + 1 store

The Big O notation in this case is expressed as Time Complexity ::= O( NK ) for F(N) Alternatively, we say – the Order of F(N) is NK Note that we ignore the leading coefficient (aK) of NK, regardless of its magnitude. Search Algorithms - Efficiency • Assume that the behaviour of an algorithm (ie. how long it takes to execute) is described by a function F(N) that counts the number of operations required to complete the algorithm. • Consider the polynomial in N: • F(N) = aK NK + aK-1 NK-1 + ... + a1 N + a0 ... • As N increases to very large values, the smallest terms, those with smaller powers (exponent) of N become less relevant (regardless of the size of coefficient aK) • Rather than using complicated polynomial formulas to describe algorithmic time cost, a more convenient notation has been developed – the BIG “O” (Order) notation.

Search Algorithms - Binary • Let us now consider a list V of N values where the elements V[K] are sorted in ascending order (from smallest to largest) • V[0] < V[1] < ..... < V[N-2] < V[N-1] • Problem: Find if/where the search value VS is in V • We develop the Binary Search algorithm • Our design technique is based on the principle of Divide and Conquer • Also called Binary Subdivision

Search Algorithms – Binary A[0] A[Mid] A[Mid+1] A[N-1] SUBLIST • Our strategy involves the idea of sub-list. A sub-list is simply a smaller part of a list. • By dividing a list into two sub-lists (where each sub-list contains contiguous values that are sorted) it is possible to quickly eliminate one of the sub-lists with a single comparison. • Thereafter, we focus attention on the remaining sub-list – but, we reapply the same divide and conquertechnique. SUBLIST

Search Algorithms - Binary ignore • We assume that all data has been inputted to array V, N has been initialized with the number of input values, and the search value VS has been inputted. • We use the following declarations • float V [10000] , VS ;int Lo, Hi, Mid, N ; • Binary Search algorithm uses the variables Lo and Hi as array subscripts • Lo refers to the first value position in a sub-list • Hi refers to the last value position in a sub-list • Mid is used as the midpoint subscript: • Mid = (Lo+Hi)/2 Lo Mid Hi SUBLIST ignore

Search Algorithms - Binary • Binary Search algorithm: • Lo = 0; Hi = N-1 ; /* Use full list as sub-list */ • do { • Mid = ( Lo + Hi ) / 2 ; /* int division */ • if( VS == V[Mid] ) break ; • if( VS > V[Mid] ) Lo = Mid + 1 ; • else Hi = Mid – 1 ; • } while ( Lo <= Hi ) ; • printf( “Search value %f “, VS ) ; if ( VS == V[Mid] ) printf( “found at position %d\n”, Mid ); else printf( “not found\n” ) ; Lo Mid Hi SUBLIST VS VS VS ?????

Search Algorithms - Binary • To understand the complexity assume a data set of 256 values. We consider the worst case scenario. • Size of data setStep # 256 1 • 128 2 • 64 3 • 32 4 • 16 5 • 8 6 • 4 7 • 2 8 • 1 9 • Once we split the sub-list to size 1 we clearly determine that the list cannot contain the search value. N = 256 = 28, so it has taken 8+1 = 9 steps to prove that VS does not exist in V. In general, for a list of size N = 2K, it takes K+1 steps, or O( log2 N )

Search Algorithms - Binary • In general, for a list of size N = 2K, it takes K+1 steps, or O( log2 N ) time complexity • K is just the logarithm (base-2) of N • The efficiency of the Binary Search algorithm is logarithmic, or O( log N ). • Some people prefer to say O ( log2 N), but they are mathematically identical

Search Algorithms • The relative efficiencies, or complexities, of the various search algorithms is clearly established when N is large, and for the worst case scenarios. • Random O( >N ? ) • Sequential (Linear) O( N ) • Divide & Conquer (Binary) O( log N ) • In the best case, any search algorithm may be successful after only one (1) probe • Usually, one is interested in worst case and average case in choosing an algorithm. Time N

Sorting Algorithms .... Putting things in order .... .... order things Putting in ....

Sorting Algorithms • From our discussion of binary search we understand the need for the data to be ordered within lists in order to promote fast searching using Binary Search • It is also important to understand that not every list should be sorted – study each case carefully • Sorting algorithms are designed to perform the ordering required • There are literally hundreds of specialized sorting algorithms, with varying efficiencies • We will focus on a sorting algorithm called Selection Sort

Sorting Algorithms - Selection 82 45 31 72 56 62 87 90 • Selection Sort relies on being able to find the largest (or smallest) element in a sublist • Each time the proper value is found, it is exchanged with the element at the end of the sublist • We re-apply this technique by shrinking the size of the sublist, until there is no remaining sublist (or a sublist of size 1 element – which is already sorted). • We consider the example ..........

Sorting Algorithms - Selection 82 45 31 72 56 62 87 90 82 45 31 72 56 62 87 90 82 45 31 72 56 62 87 90 62 45 31 72 56 82 87 90 62 45 31 72 56 82 87 90 62 45 31 56 72 82 87 90 1 2 3 4

Sorting Algorithms - Selection 56 45 31 62 72 82 87 90 31 45 56 62 72 82 87 90 31 45 56 62 72 82 87 90 62 45 31 56 72 82 87 90 56 45 31 62 72 82 87 90 31 45 56 62 72 82 87 90 5 6 7 8

Sorting Algorithms - Selection • From the example we note that the final step 8 is not actually required, since a sub-list of one (1) element is automatically sorted (by definition). • Hence, it took 7 steps to complete the sort. In general, for a list of size N elements, it takes N-1 steps. • Each step consists of two parts: • First, search an unordered sub-list for the largest element • The size of the sub-list is N-K for step K (starting from K=0) • Second, exchange (swap) the largest element with the last element in the sub-list • Upon completion of each step it should be noted that the sorted sub-list grows by one element while the unsorted sub-list shrinks by one element.

Array Data Structures & Algorithms