An Introduction to C

An Introduction to C Prepared for UCSD summer 2009 class

Overview • Low level programming language • Language of choice when speed/efficiency at a premium • A compiled language • Supports both Pass by Reference and Pass by Value • Very similar syntax to R (in fact, R closely adopted C syntax)

When to use C over R • WEAKNESSES • Not a statistics environment • Limited graphical support (ie. no plot() commands) • Pointless for small problems, or when you don’t need to optimize speed • STRENGTHS • Custom estimators that take a long time to converge • Working with huge data sets

Hello.c --- Our first C program Note commenting conventions ‘//’ ‘vs. ‘/* ‘ // is just like # in R, /* is for long comments. /* A standard program to print out a greeting */ #include <stdio.h> int main(void){ printf(“\nHello world”); return(0); } Exactly like the library() command in R All executables are functions called “main”, which must always spit out a number of type int . Functions are always defined by the thing they output (int), their name (main), their input (void in this case, which is nothing) Brackets: Note that, like R, C uses different kinds of brackets. Round brackets ‘()’ are for functions, squiggly brackets ‘{}’ are for control flow, and square brackets ‘[]’ are for indexes. This is identical to R. All commands end with semicolons, and can not span multiple lines. In C, you must directly use the return() command, not just output ‘0’ > gcc –o hello hello.c > hello gcc is the default gnu compiler, -o tells it what to output to (hello.exe), the final file is the source code

Working with Vectors • A container of objects, in some order • Can be a list of characters, integers, doubles, or structs • In R, just like sequences you would observe from functions like seq(), rep() • Static vs Dynamic Allocation: When to do what • What is a memory leak?

Vector.c --- Static Allocation of Vector First, notice you declare your variables up front. This is considered good form for compiled procedural language programming. To declare, it should be the type (double here) and a name (numlist here). If it is a vector and you are statically allocating it, use square brackets and size. int main(void){ double numlist[4]; numlist[0] = 3; numlist[1] = 4; numlist[2] = 5; numlist[3] = 10; printf(“\n numlist[0] == %f”, numlist[0]); printf(“\n numlist[1] == %f”, numlist[1]); printf(“\n numlist[2] == %f”, numlist[2]); printf(“\n numlist[3] == %f”, numlist[3]); printf(“\n\nWhy will numlist[4] give an error?”); return(0); } Doubles are basically the same thing as “numeric” in R Assigning values to the container in order. Notice that it starts at 0. This is different from R, so be very careful here. Printing the results, just like how sprintf() is done in R. %f is to show floating point, which is what a double is.

Vector.c --- Dynamic Allocation of Vector To dynamically allocate, instantiate it as a pointer with a * (more on this later). At this point it has no space reserved, it just points to a double int main(void){ double *numlist; numlist = (double *) malloc(10*sizeof(double)); numlist[0] = 3; numlist[1] = 4; numlist[2] = 5; numlist[3] = 10; printf(“\n numlist[0] == %f”, numlist[0]); printf(“\n numlist[1] == %f”, numlist[1]); printf(“\n numlist[2] == %f”, numlist[2]); printf(“\n numlist[3] == %f”, numlist[3]); printf(“\n\nWhy will numlist[4] give an error?”); free(numlist) return(0); } malloc() takes pointer and reserves space. First, remember to cast it for safety (double *). Second, remember that you must reserve space equal to number of slots (10 slots here) x amount of space for each slot (size(double) in this case) calloc() and realloc() are similar functions, calloc() initiates everything to be 0, realloc() resizes arrays Always free the space after, this is the whole point

Working with Arrays • Arrays are basically matrices, though they can span more than two dimensions • Can think of them as “vectors of vectors” • Often these are stored as vectors that need to be unrolled (we will see this later) • Dynamic allocation is still supported

Array.c --- Static Allocation of Array int main(void){ double randMat[3][3]; randMat[0][0] = 1; randMat[1][2] = 3; printf(“\nrandMat[0][0] = %f”, randMat[0][0]); printf(“\nrandMat[1][2] = %f”, randMat[1][2]); return(0); } To create an array, declare the type, then the size of each dimension in the square brackets like this. Only thing to note here is that notation is different from R. Not randMat[1,2], but randmat[1][2] Always free the space after, this is the whole point

Array.c --- Dynamic Allocation of Array int main(void){ double **randMat; randMat[0] = (double *) malloc (3*sizeof(double)); randMat[1] = (double *) malloc (3*sizeof(double)); randMat[2] = (double *) malloc (3*sizeof(double)); randMat[0][0] = 1; randMat[1][2] = 3; printf(“\nrandMat[0][0] = %f”, randMat[0][0]); printf(“\nrandMat[1][2] = %f”, randMat[1][2]); free(randMat[0]); free(randMat[1]); free(randMat[2]); free(randMat); return(0); } Dynamic allocation as pointer of pointers. Essentially 3x3 matrix is represented as 3 vectors, in a vector. Obviously you can do this allocation in a for() loop, but I haven’t gotten there yet. When freeing, you have to free each thing individually, this is very important. This technique is particularly effective for sparse matrices.

Control Flow • for(), if(), and while() are the ones we are concerned about • Use {} brackets • “Controls flow” in the sense that it may not simply do the next command on the line • if(): used to condition execution of a statement • while(): use to loop over a chunk of code until condition is met • for(): used to loop over a chunk of code in some order

Control.c --- Control Flow demonstration int main(){ int i, a=4, b=10; int temp[10]; if(b>a){ printf(“\nb>a is true”); } if(b<a){ printf(“\nb<a is true”); } while(a != b){ printf(“\na equals %i”, a); a=a+1; } for(i=0;i<10;i++){ temp[i] = i; } for(i=0;i<10;i++){ printf(“\nElement %i in temp = %i”, i, temp[i]); } } if() statements evaluate booleans (ie. True/False statements), which can also be 1/0 integers. Evaluative statements include >, <, ==, !=, >=, <=. note: ‘==’ is not the same as ‘=’ while() statements have the same syntax as if(), but an important point to note here is that the condition has to change at some point (a=a+1 here). Otherwise you will get an infinite loop. for() loops typically loop around vectors like this, doing something to each element. Note that 3 things are present in the syntax. First, a counter is initialized to a start value (i=0). Next, a condition is set, and the loop runs until while the condition is met (i<10). Finally, a piece of code that increments/decrements the counter at the end of each loop is included (i++)

Functions • Just like functions in R, think of these as black boxes • Usually output something after getting some (multiple) inputs • In fact, with pointers you can have multiple outputs (we will see this soon) • Use ‘()’ brackets, match arguments to call • One question to think about: what is the computer actually doing here?

Function.c --- Temperature Converter Put a header for each function in the code. For trival programs this won’t matter, but it will matter a lot for larger programs. double convert(double fahrenheit); int main(){ printf("\n\t30 degrees fahrenheit == %f celcius",convert(30)); printf("\n\t20 degrees fahrenheit == %f celcius",convert(20)); } double convert(double fahrenheit){ double celcius; celcius = (fahrenheit - 32)/1.8; return(celcius); } Call the function with the function name, and arguments in ‘()’ brackets. This defines a function. The first double declares output type. ‘convert’ is the name. “double celcius” defines input type. This line is equivalent to convert <- function(double celcius) in R, except R doesn’t declare output type Just like in R: define a quantity like “fahrenheit”, do some stuff with it, then call return() on it. Remember that the output has to be the same output type you declared!

Pointers • Consider the following line in R, where each matrix is NxN and huge • superMat = superMat %*% anotherMat • What is/could be happening here? • Why might this be stupid? • What is the alternative? • Pass by Reference vs. Pass by Value (convert() was done pass by value) • Two key commands: ‘*’ (dereference) and ‘&’ (reference) • Matrices are special because of this

Pointer.c ---Temp Converter with Pointer void convert(double *celcius); int main(){ double temperature = 80; printf("\n\tBefore conversion == %f celcius",temperature); convert(&temperature); printf("\n\tAfter conversion == %f celcius\n\n",temperature); } void convert(double *celcius){ double temp; temp = *celcius; temp = (temp - 32)/1.8; *celcius = temp; } Note that an address to temperature is being passed here. This means temperature is being passed by reference, so it can be modified by the function it is passed to. temp = *celcius takes the value at the address of celcius (i.e. the dereferencing operator), and stores it in the variable temp. At the end here, we are referencing the value of celcius and storing a new temperature in it. Hence, the original variable “temperature” has been modified.

LAPACK/BLAS • Often you will want to do linear algebra routines to matrices • While you can write functions to do calculations manually, this is definitely not advised • LAPACK/BLAS standardizes the functions, and runs much more efficiently • General rule: Need to read documentation very carefully, and test with small examples • Let’s work through a manual calculation of OLS using dgesv() and dgemm() • You really need to have the documentation of the functions to understand this example as I walk through this. See handouts. For other functions, just Google them!

ols.c --- Manual OLS with BLAS functions int main(){ int i,info, ipiv[2]; char trans = 't', notrans ='n'; double alpha = 1.0, beta=0.0; int ncol=2; int nrow=5; int one=1; double XprimeX[4]; double X[10] = {1,1,1,1,1,0.3,-0.2,0.4,-0.5,0.3}; double Y[5] = {0.7,-0.5,0.9,-1.1,0.7}; double XXinv[4] = {1,0,0,1}; double XXinvX[10]; double coef[2]; printf("\n\nX = "); for(i=0;i<5;i++) printf("\n%f %f", X[i],X[i+5]); printf("\n\nY = "); for(i=0;i<5;i++) printf("\n%f", Y[i]); Everything here should be pretty straightforward. Just define a few variables, planning to solve for the beta hats in OLS given X and Y. The only new thing here so far is that X is a 5x2 matrix, but I am storing it as a 10x1 vector. This is very common when working with LAPACK/BLAS.

ols.c --- continued //solve X’X dgemm_(&trans,&notrans,&ncol,&ncol,&nrow,&alpha,X,&nrow,X,&nrow,&beta,XprimeX,&ncol); printf("\n\nX'X = "); for(i=0;i<2;i++) printf("\n%f %f",XprimeX[i], XprimeX[i+2]); //solve (X’X)-1 dgesv_(&ncol,&ncol,XprimeX,&ncol,ipiv,XXinv,&ncol,&info); printf("\n\n(X'X)-1 = "); for(i=0;i<2;i++) printf("\n%f %f",XXinv[i], XXinv[i+2]); //solve (X’X)-1X’ dgemm_(&notrans,&trans,&ncol,&nrow,&ncol,&alpha,XXinv,&ncol,X,&nrow,&beta,XXinvX,&ncol); //solve (X’X)-1X’Y dgemm_(&notrans,&notrans,&ncol,&one,&nrow,&alpha,XXinvX,&ncol,Y,&nrow,&beta,coef,&nrow); printf("\n\nB0 = %f", coef[0]); printf("\nB1 = %f\n\n", coef[1]); return(0); }

Data Input and Output • Up to now, we have just manually created data. What if we want to read data from a file? • Core idea: Create a file pointer, open the file with permissions, and then read or write to it • Even better: Have some error checking on the reading • In many cases: you will start with a large memory space you read data into that you will need to resize • Common mistake here: Incorrect casting of what you are reading • Two programs here are attached. We first generate some random data in a file. Then, we read the data in another program.

writedata.c --- Writes random numbers into file #include <stdlib.h> #include <stdio.h> int main(void){ FILE *fp; double data[10]; int i = 0; for(i=0;i<10;i++){ data[i] = ( (double)rand() / ((double)(RAND_MAX)+(double)(1))); } fp = fopen("data.txt","w"); for(i=0;i<10;i++){ fprintf(fp, "%2.3f \n", data[i]); } fclose(fp); } You will need some C libraries here for file and random number functions. Declare a file pointer here. The file pointer has not yet been opened. rand() generates a random number from 0 to RAND_MAX, so this line generates a random number from 0 to 1. Here you open a file with the file pointer you created, with the “w” permission to write to it. fprintf() works just like printf(), except you have to specify the file pointer it is printing to. Always close your pointers after you are done with them!!

readdata.c --- Reads random numbers into file Open required libraries again, including one for error handling. #include <stdlib.h> #include <stdio.h> #include <errno.h> int main(void) { int MAXVOTES = 10000; FILE *fp; double *numlist; numlist = (double *) malloc (MAXVOTES*sizeof(double)); int i; if((fp = fopen("data.txt","r"))==NULL) { printf("\nUnable to open file DATA.TXT: %s\n", strerror(errno)); exit(1); } else { i=0; while (!feof(fp)) { fscanf(fp,"%f", (double *) &numlist[i]); i++; } } fclose(fp); numlist = (double *) realloc(numlist, i* sizeof(double)); printf("\nAllocation OK, %i votes allocated.\n", i); } Usually you will allocate a ton of space before reading data because you don’t know how much space you need. Declaring that space up front is a good idea, like MAXVOTES Very typical error recovery. Try opening the file with read permissions. If it fails, print error message. feof() returns true if it is the end of file, so this while() loop reads data until there is no more to be read. fscanf() is the reverse of fprintf(), it reads an observation from a data file. Same syntax. Notice I used ‘i’ to count the number of entries, then I resized the array. If you only have 10 entries, you don’t need a container that can contain 10,000!

An Introduction to C