BIO-454 Bio Computing

BIO-454Bio Computing Lecture 13: The R Language Reading Microarray Datasets in different ways Dr. Mohammad Nassef Computer Science Department Faculty of Computers and Information Cairo University

Outline • Revision of data frame • Revision of read.table and read.csv • Accessing datasets in offline mode • Example from XenaBrowser • Types of Geo Data • Different GEO Identifiers • Searching the GEO datasets • Accessing GEO datasets in online mode • Downloading and installing Bioconductor • Downloading and installing the GEOquery package • Using the getGEO function to download GEO data using GEO identifiers

Data Frames in R • A data frame is a list that contains multiple named vectors that are the same length. • Note that each row in the dataframe will get a default rowname. • Let’s construct a data frame with the win/loss results in the Football League in some year: teams <- c(“Ahly",“Ismaieli",“Enby",“Tarsana",“Zamalek") w <- c(92, 89, 94, 72, 59) l <- c(70, 73, 77, 90, 102)

Data Frames in R teams <- c(“Ahly",“Ismaieli",“Enby",“Tarsana",“Zamalek") w <- c(92, 89, 94, 72, 59) l <- c(70, 73, 77, 90, 102) df1 <- data.frame(teams,w,l) df1 teams w l 1 Ahly 92 70 2 Ismaieli 89 73 3 Enby 94 77 4 Tarsana 72 90 5 Zamalek 59 102

Data Frames in R • You can refer to the components of a data frame (or items in a list) by name using the $ operator: df1$w [1] 92 89 94 72 59

Data Frames in R • Let’s say you wanted to find the number of losses by Enby. • You can select a member of an array by using a vector of Boolean values to specify which item to return from a list: df1$teams==”Enby" [1] FALSE FALSE TRUE FALSE FALSE • Then you can use this vector to refer to the right element in the losses vector: df1$l[df1$teams==“Enby"] [1] 77

The read.table() and read.csv() functions • read.table() and read.csv() are the ‘workhorse functions’ of R!

The read.table() function • read.table()function reads in text files and converts to a data frame. • It can be used for reading in external files. It works for any connection: • external hard drive • over the Internet • anywhere data can be accessed • read.table() is designed to read space-separated external text (ASCII) files with free-formatted data and to create a data frame • It will let you read in any type of delimited ASCII file. • It can read in both numeric and character values. • This is by far the easiest and most reliable method of entering data into R.

The read.table() function • Examples from http://www.ats.ucla.edu/stat/r/modules/raw_data.htm • Reading complete data, space delimited, variable names in first row: test <- read.table("http://www.ats.ucla.edu/stat/data/test.txt", header = TRUE) test

The read.table() function • The default delimiter in read.table() is the space delimiter • This could create problems if there are missing data. • The function will not work unless every data line has the same number of values. • If there are missing data, the data lines will have different number of values, and you will receive an error. • If there are missing values, the easiest way to fix this problem is to change the type of delimiter. • In the read.table() function, the sep argument is used to specify the delimiter.

Example of File with Missing Values

The read.table() function • Reading a file with missing data using the same call will through an error: test <- read.table("http://www.ats.ucla.edu/stat/data/test.txt", header = TRUE) Error: line 2 did not have 6 elements

Example of File with Missing, but Comma-separated Values

The read.table() function • Reading a file with missing, but comma-separated data: test <- read.table("http://www.ats.ucla.edu/stat/data/test.txt", header = TRUE, sep = “,”) test

The read.csv() function • Another very common type of file is the comma delimited file. • It can read files saved out from Excel as a comma delimited file. • A file can be read in by the read.table() function by using the sep option, but it • It can also be read in by the read.csv function which was written specifically for comma delimited files.

The read.csv() function • Reading complete data, comma delimited, variable names in first row: test.csv <- read.csv("http://www.ats.ucla.edu/stat/data/test.csv", header = TRUE) test.csv

MicroArray Data: Repositories and Analysis Tools • NCBI – GEO: Gene Expression Omnibus (NIH) • XenaBrowser • FirebBrowse • MSigDB/GSEA (Broad Institute) • Oncomine (U. Michigan) FCI-CU-EG

Data available in XenaBrowser • You can find enormous datasets at the following link: https://xenabrowser.net/datapages/ • Each link refer to some disease, and by pressing a link, you can find the different datasets available about that disease. • As an example, you can find a link for Ovarian cancer:

One of the links referring to Ovarian Cancer datasets

Selecting one link to view its available datasets • By pressing the Liver Cancer link, you can find the different available datasets: https://xenabrowser.net/datapages/?cohort=TCGA Ovarian Cancer (OV) • As you may identify, you can find Microarray Gene Expression datasets.

n = Number of Samples

How to access Microarray Gene-Expression Datasets? • Microarray Gene-Expression Datasets (MGED) can be accessed from different sources on the web. • Offline Mode: You can manually download the dataset file into your PC, and then read it locally multiple times using the learned R functions. • Online Mode: You can use special functions provided by libraries developed for directly downloading dataset files from some famous sources.

Method 1: Offline Mode • You can visit the websites of online repositories (such as XenaBrowser) to download one of the Microarray-Gene-Expression Datasets. • Here, Colon Cancer (COAD) will be used as an example. • After that, you can use the following code to read the local dataset file into a data frame inside your R session. • The file includes a header, and the first column contains the gene-names (row-names) • The data can be stored inside a data.frame called "coad“ coad=read.table("AgilentG4502A_07_3", sep='\t’) dim(coad) [1] 17815 175 class(coad) [1] "data.frame"

Exploring part of the dataset • Displaying first 6 rows/columns: coad[1:6, 1:6] • As you can see, the first column includes the gene names, and each row got a default row number (1, 2, 3 … etc). • Moreover, the first row contains the actual sample titles, but the default titles assigned to columns are (v1, v2, v3 … etc)

Reading Header Row as Column Names • So, the actual sample titles loaded from the file should be the column names: coad=read.table("AgilentG4502A_07_3", header = T, sep='\t’) dim(coad) [1] 17814 175 class(coad) [1] "data.frame"

Reading Header Row as Column Names • Displaying first 6 rows/columns: coad[1:6, 1:6] • As you can see, the first column includes the gene names, and each row got a default row number.

Reading Gene Identifiers as Row Names • So, gene names should be the row names coad=read.table("AgilentG4502A_07_3", header = T, sep='\t’, row.names = 1) dim(coad) [1] 17814 174 class(coad) [1] "data.frame"

Reading Gene Identifiers as Row Names • Displaying first 6 rows/columns: coad[1:6, 1:6] • As you can see, the first column includes the gene names, and each row got a default row number.

Validation of Reading the Full dataset • How can you know the name of the last loaded gene? row.names(coad)[length(row.names(coad))] • Or simply row.names(coad)[nrow(coad)] • To display the last read row: coad[nrow(coad),]

Sample Barcode! • How to know which sample is normal and which is cancer? • This can be known from the sample barcode at the header row.

Status vector of samples

Sample Barcode Breakdown https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/

Knowing Samples’ status: Normal vs. Cancer • How to know which sample is normal and which is cancer? • We can parse the barcode of each sample in order to determine its status: # discriminating between the samples by parsing the barcode of each sample, and # putting the sample labels ("cancer" or "normal") into a vector called status status=ifelse(as.numeric(substr(colnames(coad),14,15))>=11,"normal","cancer")

Datasets at NCBI-GEO • Please refer to the following link: https://www.ncbi.nlm.nih.gov/geo/

Types of GEO Data • GEO Sample (GSM) • GEO Series (GSE) (lists of GSM sample files that together form a single experiment) • GEO Dataset (GDS)

What is the difference between a Series and a DataSet? • A GEO Series (GSExxx) is an original submitter-supplied record that summarizes a study. • These data are reassembled by GEO staff into curated GEO Datasets (GDSxxx). • A DataSet represents a collection of biologically- and statistically-comparable Samples processed using the same Platform. • Information reflecting experimental variables is provided through DataSet subsets. • Both Series and DataSets are searchable using the GEO DataSets interface, but only DataSets form the basis of GEO's advanced data display and analysis tools including gene expression profile charts and DataSet clusters. • Not all submitted data are suitable for DataSet assembly and we are experiencing a backlog in DataSet creation, so not all Series have a corresponding DataSet record(s).

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62932

Three important files …

Soft-formatted File GSE62932_family.soft.gz Contains the following: • Experiment Information • GEO Sample Identifiers: GSM……. • Submitter Info • List of supplementary files • Submission Info • Probe Info • Probe Expression Values for each individual sample

Series Matrix FileGSE62932_series_matrix.txt.gz • Contains general Info then a matrix of probes and samples

Microarray Matched FileGSE62932_microarray_matched.csv.gz • Contains a matrix of genes and samples • Remember that this experiment was interested in 414 genes only!

BIO-454 Bio Computing

BIO-454 Bio Computing

Presentation Transcript

BIO

Bio

Bio

Bio-bio-1 Team

Bio, Nano and Quantum Computing

Bio

BIO.

bio

Bio

Bio-Inspired Computing

Bio

BIO

Bio

“BIO ”

Bio

BIO-454 Bio Computing

Bio-Inspired Computing

Bio

bio

bio