480 likes | 615 Vues
Using R in Astronomy. Getting started quickly with R. Juan Pablo Torres-Papaqui. Departamento de Astronomía Universidad de Guanajuato DA-UG, México. papaqui@astro.ugto.mx División de Ciencias Naturales y Exactas, Campus Guanajuato, Sede Valenciana. Why R?
E N D
Using R in Astronomy Getting started quickly with R Juan Pablo Torres-Papaqui Departamento de Astronomía Universidad de Guanajuato DA-UG, México papaqui@astro.ugto.mx División de Ciencias Naturales y Exactas, Campus Guanajuato, Sede Valenciana
Why R? R is a free language and environment for statistical computing and graphics. It is superior to other graphics/analysis packages commonly used in Astronomy, e.g. supermongo, pgplot, gnuplot, and features a very wide range of statistical analysis and data plotting functions. Furthermore, R is already the most popular amongst the leading software for statistical analysis, as measured by a variety of indicators, and is rapidly growing in influence. “R is really important to the point that it's hard to overvalue it. It allows statisticians to do very intricate and complicated analyses without knowing the blood and guts of computing systems.” Google research scientist, quoted in the New York Times.
Key features • It's a mature, widely used (around 2-3 million users) and well-supported free and open source software project; it's committed to a twice-yearly update schedule (it runs on GNU/linux, unix, MacOS and Windows) • Excellent graphics capabilitie • Vector arithmetic, plus many built-in basic & advanced statistical and numerical analysis tools • Intelligent handling of missing data values (denoted by “NA”) • Highly extensible, with around 3000 user-contributed packages available • It's easy to use and has excellent online help and associated documentation
Download The R Project for Statistical Computing http://www.r-project.org/
Getting started The online help pages can be accessed as follows: help.start() # Load HTML help pages into browser help(package) # List help page for "package" ?package # Shortform for "help(package)" help.search("keyword") # Search help pages for "keyword" ?help # For more options help(package=base) # List tasks in package "base" q() # Quit R
Getting started Some extra packages are not loaded by default, but can be loaded as follows: library("package") library() # List available packages to load library(help="package") # list package contents
Getting started Commands can be separated by a semicolon (";'') or a newline All characters after "#'' are treated as comments (even on the same line as commands) Variables are assigned using "<-'' (although "='' also works): x <- 4.5 y <- 2*x^2 # Use "<<-'' for global assignment
Vectors x <- c(1, 4, 5, 8) # Concatenate values & assign to x y <- 2*x^2 # Evaluate expression for each element in x > y # Typing the name of a variable evaluates it [1] 2 32 50 128
If x is a vector, y will be a vector of the same length. The vector arithmetic in R means that an expression involving vectors of the same length (e.g. x+y, in the above example) will be evaluated piecewise for corresponding elements, without the need to loop explicitly over the values: > x+y [1] 3 36 55 136 > 1:10 # Same as seq(1, 10) [1] 1 2 3 4 5 6 7 8 9 10 > a <- 1:10 a[-1] # Print whole vector *except* the first element a[-c(1, 5)] # Print whole vector except elements 1 & 5 a[1:5] # Print elements 1 to 5 a[length(a)] # Print last element, for any length of vector head(a, n=3) # Print the first n elements of a tail(a, n=3) # Print the last n elements of a
Demonstrations of certain topics are available: demo(graphics) # Run grapics demonstration demo() # List topics for which demos are available ?demo # List help for "demo" function
Typing the name of a function lists its contents; typing a variable evaluates it, or you can use print(x). A great feature of R functions is that they offer a lot of control to the user, but provide sensible hidden defaults. A good example is the histogram function, which works very well without any explicit control: First, generate a set of 10,000 Gaussian distributed random numbers: data <- rnorm(1e4) # Gaussian distributed numbers with mean=0 & sigma=1 hist(data) # Plots a histogram, with sensible bin choices by default See ?hist for the full range of control parameters available, e.g. to specifiy 7 bins: hist(data, breaks=7)
Assuming you have a file (“file.dat”) containing the following data, with either spaces or tabs between fields: r x y 1 4.2 14.2 2 2.4 64.8 3 8.76 63.4 4 5.9 32.2 5 3.4 89.6 setwd("C:/Users/juanchito/Documents/Astronomy School/") #only for windows users
Now read this file into R: > inp <- read.table("file.dat", header=T) # Read data into data frame > inp # Print contents of inp > inp[0] # Print the type of object that inp is: NULL data frame with 5 rows or data frame with 0 columns and 5 rows > colnames(inp) # Print the column names for inp: [1] "r" "x" "y" # This is simply a vector that can easily be changed: > colnames(inp) <- c("radius", "temperature", "time") > colnames(inp) [1] "radius" "temperature" "time"
Note that you can refer to R objects, names etc. using the first few letters of the full name, provided that is unambiguous, e.g.: > inp$r # Print the Radius column but note what happens if the information is ambiguous or if the column doesn't exist: > inp$t # Could match "inp$temperature" or "inp$time" NULL > inp$wibble # "wibble" column doesn't exist NULL Alternatively, you can refer to the columns by number: > inp[[1]] # Print data as a vector (use "[[" & "]]") [1] 1 2 3 4 5 > inp[1] # Print data as a data.frame (only use "[" and "]")
Writing data out to a file (if no filename is specified (the default), the output is written to the console) > write.table(inp, quote=F, row.names=F, col.names=T, sep="\t:\t") > write.table(inp, quote=F, row.names=F, col.names=T) radius temperature time 1 4.2 14.2 2 2.4 64.8 3 8.76 63.4 4 5.9 32.2 5 3.4 89.6
Saving data ?save save(inp, t, file="data.RData") load(file="data.RData") # read back in the saved data: # - this will load the saved R objects into the current session, with # the same properties, i.e. all the variable will have the same names # and contents Note that when you quit R (by typing q()), it asks if you want to save the workspace image, if you specify yes (y), it writes out two files to the current directory, called .RData and .Rhistory. The former contains the contents of the saved session (i.e. all the R objects in memory) and the latter is a list of all the command history (it's just a simple text file you can look at). At any time you can save the history of commands using: savehistory(file="my.Rhistory") loadhistory(file="my.Rhistory") # load history file into current R session
Useful functions: ?plot # Help page for plot command ?par # Help page for graphics parameter control ?Devices # or "?device" # (R can output in postscript, PDF, bitmap, PNG, JPEG and more formats) dev.list() # list graphics devices???? colours() # or "colors()" List all available colours ?plotmath # Help page for writing maths expressions in R Tip: to generate png, jpeg etc. files, I find it's best to create pdf versions in R, then convert them in gimp (having specified strong antialiasing, if asked, when loading the pdf file into gimp), otherwise the figures are "blocky" (i.e. suffer from aliasing).
To create an output file copy of a plot for printing or including in a document etc. dev.copy2pdf(file="myplot.pdf") # } Copy contents of current graphics device dev.copy2eps(file="myplot.eps") # } to a PDF or Postscript file
Adding more datasets to a plot: x <- 1:10; y <- x^2 z <- 0.9*x^2 plot(x, y) # Plot original data points(x, z, pch="+") # Add new data, with different symbols lines(x,y) # Add a solid line for original data lines(x, z, col="red", lty=2) # Add a red dashed line for new data curve(1.1*x^2, add=T, lty=3, col="blue") # Plot a function as a curve text(2, 60, "An annotation") # Write text on plot abline(h=50) # Add a horizontal line abline(v=3, col="orange") # Add a vertical line
Error bars source("errorbar.R") # source code # Create some data to plot: x <- seq(1,10); y <- x^2 xlo <- 0.9*x; xup <- 1.08*x; ylo <- 0.85*y; yup <- 1.2*y Plot graph, but don't show the points (type="n") & plot errors: plot(x, y, log="xy", type="n") errorbar(x, xlo, xup, y, ylo, yup) ?plot gives more info on plotting options (as does ?par). To use different line colours, styles, thicknesses & errorbar types for different points: errorbar(x, xlo, xup, y, ylo, yup, col=rainbow(length(x)), lty=1:5, lwd=1:3)
Plotting a function or equation curve(x^2, from=1, to=100) # Plot a function over a given range curve(x^2, from=1, to=100, log="xy") # With log-log axes Plot chi-squared probability vs. reduced chi-squared for 10 and then 100 degrees of freedom dof <- 10 curve(pchisq(x*dof,df=dof), from=0.1, to=3, xlab="Reduced chi-squared", ylab="Chi-sq distribution function") abline(v=1, lty=2, col="blue") # Plot dashed line for reduced chisq=1 dof <- 100 curve(pchisq(x*dof, df=dof), from=0.1, to=3, xlab="Reduced chi-squared", ylab="Chi-sq distribution function") abline(v=1, lty=2, col="blue") # Plot dashed line for reduced chisq=1
Define & plot custom function: myfun <- function(x) x^3 + log(x)*sin(x) #--Plot dotted (lty=3) red line, with 3x normal line width (lwd=3): curve(myfun, from=1, to=10, lty=3, col="red", lwd=3) #--Add a legend, inset slightly from top left of plot: legend("topleft", inset=0.05, "My function", lty=3, lwd=3, col="red") Plotting maths symbols: demo(plotmath) # Shows a demonstration of many examples ?plotmath # Manual page with more information
Basic stuff x <- rnorm(100) # Create 100 Gaussian-distributed random numbers hist(x) # Plot histogram of x summary(x) # Some basic statistical info d <- density(x) # Compute kernel density estimate d # Print info on density analysis ("?density" for details) plot(d) # Plot density curve for x plot(density(x)) # Plot density curve directly Plot histogram in terms of probability & also overlay density plot: hist(x, probability=T) lines(density(x), col="red") # overlay smoothed density curve #--Standard stuff: mean(x); median(x); max(x); min(x); sum(x); weighted.mean(x) sd(x) # Standard deviation var(x) # Variance mad(x) # median absolute deviation (robust sd)
Testing the robustness of statisical estimators ?replicate # Repeatedly evaluate an expression Calculate standard deviation of 100 Gaussian-distributed random numbers (NB default sd is 1; can specify with rnorm(100, sd=3) or whatever) sd(rnorm(100)) Now, let's run N=5 Monte Carlo simulations of the above test: replicate(5, sd(rnorm(100))) and then compute the mean of the N=5 values: mean(replicate(5, sd(rnorm(100))))
Obviously 5 is rather small for a number of MC sims (repeat the above command & see how the answer fluctuates). Let's try 10,000 instead: mean(replicate(1e4, sd(rnorm(100)))) With a sample size of 100, the measured sd closely matches the actual value used to create the random data (sd=1). However, see what happens when the sample size decreases: mean(replicate(1e4, sd(rnorm(10)))) # almost a 3% bias low mean(replicate(1e4, sd(rnorm(3)))) # >10% bias low
You can see the bias more clearly with a histogram or density plot: a <- replicate(1e4, sd(rnorm(3))) hist(a, breaks=100, probability=T) # Specify lots of bins lines(density(a), col="red") # Overlay kernel density estimate abline(v=1, col="blue", lwd=3) # Mark true value with thick blue line This bias is important to consider when esimating the velocity dispersion of a stellar or galaxy system with a small number of tracers. The bias is even worse for the robust sd estimator mad(): mean(replicate(1e4, mad(rnorm(10)))) # ~9% bias low mean(replicate(1e4, mad(rnorm(3)))) # >30% bias low
However, in the presence of outliers, mad() is robust compared to sd() as seen with the following demonstration. First, create a function to add a 5 sigma outlier to a Gaussian random distribution: outlier <- function(N,mean=0,sd=1) { a <- rnorm(N, mean=mean, sd=sd) a[1] <- mean+5*sd # Replace 1st value with 5 sigma outlier return(a) } Now compare the performance of mad() vs. sd(): mean(replicate(1e4, mad(outlier(10)))) mean(replicate(1e4, sd(outlier(10))))
You can also compare the median vs. Mean: mean(replicate(1e4, median(outlier(10)))) mean(replicate(1e4, mean(outlier(10)))) ...and without the outlier: mean(replicate(1e4, median(rnorm(10)))) mean(replicate(1e4, mean(rnorm(10))))
Kolmogorov-Smirnov test, taken from manual page (?ks.test) x <- rnorm(50) # 50 Gaussian random numbers y <- runif(30) # 30 uniform random numbers # Do x and y come from the same distribution? ks.test(x, y)
Correlation tests Load in some data: dfile <- "table1.txt" A <- read.table(dfile, header=T, sep="|") cor.test(A$z, A$Tx) # Specify X & Y data separately cor.test(~ z + Tx, data=A) # Alternative format when using a data frame ("A") cor.test(A$z, A$Tx, method="spearman") # } Alternative methods cor.test(A$z, A$Tx, method="kendall") # } plot(A$z,A$Tx)
Spline interpolation of data x <- 1:20 y <- jitter(x^2, factor=20) # Add some noise to vector sf <- splinefun(x, y) # Perform cubic spline interpolation # - note that "sf" is a function: sf(1.5) # Evaluate spline function at x=1.5 plot(x,y); curve(sf(x), add=T) # Plot data points & spline curve
Scatter plot smoothing #--Using above data frame ("A"): plot(Tx ~ z, data=A) #--Return (X & Y values of locally-weighted polynomial regression lowess(A$z, A$Tx) #--Plot smoothed data as line on graph: lines(lowess(A$z, A$Tx), col="red")
Regression dfile <- "http://www.astro.ugto.mx/~papaqui/download/winter/table1.txt" A <- read.table(dfile, header=T, sep="|") names(A) # Print column names in data frame m <- lm(Tx ~ z, data=A) # Fit linear model summary(m) # Show best-fit parameters & errors
Plot data & best-fit model: plot(Tx ~ z, A) abline(m, col="blue") # add best-fit straight line model Plot residuals: plot(A$z, resid(m)) abline(h=0, col="red")
Basics cat # Type function name without brackets to list contents args(cat) # Return arguments of any function body(cat) # Return main body of function Create your own function > fun <- function(x, a, b, c) (a*x^2) + (b*x^2) + c > fun(3, 1, 2, 3) [1] 30 > fun(5, 1, 2, 3) [1] 78
A more complicated example of a function. First, create some data: set.seed(123) # allow reproducible random numbers a <- rnorm(1000, mean=10, sd=1) # 1000 Gaussian random numbers b <- rnorm(100, mean=50, sd=15) # smaller population of higher numbers x <- c(a, b) # Combine datasets hist(x) # Shows outlier population clearly sd(x) # Strongly biased by outliers mad(x) # Robustly estimates sd of main sample mean(x) # biased median(x) # robust
Now create a simple sigma clipping function (you can paste the following straight into R): sigma.clip <- function(vec, nclip=3, N.max=5) { mean <- mean(vec); sigma <- sd(vec) clip.lo <- mean - (nclip*sigma) clip.up <- mean + (nclip*sigma) vec <- vec[vec < clip.up & vec > clip.lo] # Remove outliers if ( N.max > 0 ) { N.max <- N.max - 1 # Note the use of recursion here (i.e. the function calls itself): vec <- Recall(vec, nclip=nclip, N.max=N.max) } return(vec) } Now apply the function to the test dataset: new <- sigma.clip(x) hist(new) # Only the main population remains! length(new) # } Compare numbers of length(x) # } values before & after clipping
Factors are a vector type which treats the values as character strings which form part of a base set of values (called "levels"). The result of plotting a factor is to produce a frequency histogram of the number of entries in each level (see ?factor for more info). Basics a <- rpois(100, 3) # Create 100 Poisson distributed random numbers (lambda=3) hist(a) # Plot frequency histogram of data a[0] # List type of vector ("numeric") b <- as.factor(a) # Change type to factor List type of vector b ("factor"): also lists the "levels", i.e. all the different types of value: b[0] Now, plot b, to get a barchart showing the frequency of entries in each level: plot(b) table(b) # a tabular summary of the same information
A more practical example. Read in table of data (from Sanderson et al. 2006), which refers to a sample of 20 clusters of galaxies with Chandra X-ray data: file <- "http://www.astro.ugto.mx/~papaqui/download/winter/table1.txt" A <- read.table(file, header=T, sep="|") By default, “read.table()” will treat non-numeric values as factors. Sometimes this is annoying, in which case use “as.is=T”; you can specify the type of all the input columns explicitly -- see “?read.table”. A[1,] # Print 1st row of data frame (will also show column names): plot(A$det) # Plot the numbers of each detector type plot(A$cctype) # Plot the numbers of cool-core and non-cool core clusters The “det” column is the detector used for the observation of each cluster: S for ACIS-S and I for ACIS-I. The “cctype” denotes the cool core type: “CC” for cool core, “NCC” for non-cool core.
Now, plot one factor against another, to get a plot showing the number of overlaps between the two categories (the cool-core clusters are almost all observed with ACIS-S): plot(cctype ~ det, data=A, xlab="ACIS Detector", ylab="Cool-core type") You can easily explore the properties of the different categories, by plotting a factor against a numeric variable. The result is to show a boxplot (see “?boxplot”) of the numeric variable for each catagory. For example, compare the mean temperature of cool-core & non-cool core clusters: plot(Tx ~ cctype, data=A) and compare the redshifts of clusters observed with ACIS-S & ACIS-I: plot(z ~ det, data=A) You can define your own factor categories by partitioning a numeric vector of data into discrete bins using “cut()”, as shown here:
Converting a numeric vector into factors Split cluster redshift range into 2 bins of equal width (nearer & further clusters), and assign each z point accordingly; create new column of data frame with the bin assignments (A$z.bin): A$z.bin <- cut(A$z, 2) A$z.bin # Show levels & bin assignments plot(A$z.bin) # Show bar chart Plot mean temperatures of nearer & further clusters. Since this is essentially a flux-limited sample of galaxy clusters, the more distant clusters (higher redshift) are hotter (and hence more X-ray luminous - an example of the common selection effect known as Malmquist bias): A$Tx.bin <- cut(A$Tx, 2) plot(z ~ Tx.bin, data=A) plot(Tx ~ z, data=A) # More clearly seen!
Check if redshifts of (non) cool-core clusters differ: plot(cctype ~ z.bin, data=A) # Equally mixed Now, let's say you wanted to colour-code the different CC types. First, create a vector of colours by copying the cctype column (you can add this as a column to the data frame, or keep it as a separate vector): colour <- A$cctype Now all you need to do is change the names of the levels: levels(colour) # Show existing levels levels(colour) <- c("blue", "red") # specify blue=CC, red=NCC colour[0] # Show data type (& levels, since it's is a factor)
Now change the type of the vector to character (i.e. not a factor): col <- as.character(colour) col[0] # Confirm change of data type col # } Print values and colour # } note the difference Now plot mean temperature vs. redshift, colour coded by CC type: plot(Tx ~ z, A, col=col) Why not add a plot legend: legend(x="topleft", legend=c(levels(A$cctype)), text.col=levels(colour)) Now for something a little bit fancier: plot(Tx ~ z, A, col=col, xlab="Redshift", ylab="Temperature (keV)") legend(x="topleft", c(levels(A$cctype)), col=levels(colour), text.col=levels(colour), inset=0.02, pch=1)