Pharmacogenomics and Bioinformatics

Pharmacogenomics and Bioinformatics M. Saleet Jafri

What is pharmacogenomics? • Pharmacogenomics is the use genomic and sequence data of host and pathogens to identify potential drug targets • Involves a variety of techniques/disciplines such as sequence analysis, protein structure, genomics, micorarray analysis and others • These fields rely heavily on bioinformatics • Usually focuses on medical or agricultural applications

Human Genome Project Project goals are to • identify all the approximately 20,000-25,000 genes in human DNA, • determine the sequences of the 3 billion chemical basepairs that make up human DNA, • store this information in databases, • improve tools for data analysis, • transfer related technologies to the private sector, and • address the ethical, legal, and social issues (ELSI) that may arise from the project. From http://www.ornl.gov/hgmis/

Human Genome Project Progress - Several types of genome maps have already been completed, and a working draft of the entire human genome sequence was announced in June 2000, with analyses published in February 2001. - An important feature of this project is the federal government's long-standing dedication to the transfer of technology to the private sector. By licensing technologies to private companies and awarding grants for innovative research, the project is catalyzing the multibillion-dollar U.S. biotechnology industry and fostering the development of new medical applications. From http://www.ornl.gov/hgmis/

Human Genome Project • Seven organisms were originally chosen for sequencing. • E. coli • Yeast • Fly • Worm • Arabidopsis • Mouse • human • Why were these chosen?

Genome Projects As of January 2005 there were many more sequenced • 25 non-plant eukaryotes • 5 plants • 213 microbes completed • 21 Archae • 274 microbes in progress • 1431 viruses in progress • 833 non-virus organisms with at least on nucleotide sequence submitted • Why were these chosen?

Genome Projects • Chosen by funding agencies • Four main categories • Medical applications • Evolutionary significance • Environmental impact • Food production

How are genomics used for drug target identification? • The basic idea is to look for genes unique to the pathogen that are crucial for its survival. This would be the drug target. • If this is a pathogen in the host, the gene would be in the pathogen and not in the host. • If this was in the environment, the gene should be as specific as possible for the pathogen to avoid harming other organisms that might be beneficial.

How can this be done? • To do this genomics, proteomics and bioinformatics are involved. • In any of these cases bioinformatics tools are necessary.

Genome Sequencing and Comparison • As mentioned earlier, many pathogen (virus, bacteria, and other microorganisms) have been sequenced. • Once they are sequenced, they are annotated. Annotation is the process by which the functions of the different proteins (genes) are determined. • In this way, an understanding of the organisms metabolism is gained.

Malaria • Malaria is caused by the genus Plasmodium, with Plasmodium falciparum being the most lethal. • Its genome has been sequenced • It is a pathogen that digests proteins for food. It does not contain any amino acid producing genes in its genome, i.e. it does not make its own amino acids. • Purines are recycled, but there are not genes for purine synthesis. • Has many solute ATP dependent transporters and one novel multifunctional transporter.

How is annotation done? • Annotation is the process of predicting the function of genes in a genome. • First all the genes have to be found. This is done by finding the open reading frame (ORF). • This is done by gene finding or gene prediction software.

Gene Prediction • Analysis by sequence similarity can only reliably identify about 30% of the protein-coding genes in a genome • 50-80% of new genes identified have a partial, marginal, or unidentified homolog • Frequently expressed genes tend to be more easily identifiable by homology than rarely expressed genes

Gene Finding • Process of identifying potential coding regions in an uncharacterized region of the genome • Still a subject of active research • There are many different gene finding software packages and no one program is capable of finding everything

Eukaryotes vs Prokaryotes • Eukaryotic DNA wrapped around histones that might result in repeated patterns (periodicity of 10) for histone binding. The promotor regions might be near these sites so that they remain hidden. • Prokaryotes have no introns. • Promotor regions and start sites more highly conserved in Prokaryotes • Different codon use frequencies

Gene finding is species-specific • Codon usage patterns vary by species • Functional regions (promoters, splice sites, translation initiation sites, termination signals) vary by species • Common repeat sequences are species-specific • Gene finding programs rely on this information to identify coding regions

The genetic code

Codon usage

Identifying ORFs • Simple first step in gene finding • Translate genomic sequence in six frames. Identify stop codons in each frame • Regions without stop codons are called "open reading frames" or ORFs • Locate and tag all of the likely ORFs in a sequence • The longest ORF from a Met codon is a good prediction of a protein encoding sequence. • SOFTWARE: NCBI ORF Finder

ORF Finder input

ORF finder results

Tests of the Predicted ORF • Check if the third base in the codons tends to be the same one more often than by chance alone. • Are the codons used in the ORF the same as those used in other genes (need codon usage frequency). • Compare the amino acid sequence for similarity with other know amino acid sequences.

Problems with ORF finding • A single-character sequencing error can hide a stop codon or insert a false stop codon, preventing accurate identification of ORFs • Short exons can be overlooked • Multiple transcripts or ORFs on complementary strand can confuse results

Pattern-based gene finding • ORF finding based on start and stop codon frequency is a pattern-based procedure • Other pattern-based procedures recognize characteristic sequences associated with known features and genes, such as ribosome binding sites, promoter sites, histone binding sites, etc. • Statistically based.

Content-based gene finding • Content-based gene finding methods rely on statistical information derived from known sequences to predict unknown genes • Some evaluative measures include: "coding potential" (based on codon bias), periodicity in the sequence, sequence homogeneity, etc.

A standard content-based alignment procedure • Select a window of DNA sequence from the unknown. The window is usually around 100 base pairs long • Evaluate the window's potential as a gene, based on a variety of factors • Move the window over by one base • Repeat procedure until end of sequence is reached; report continuous high-scoring regions as putative genes

Combining measures • Programs rarely use one measure to predict genes • Different values are combined (using probabilistic methods, discriminant analysis, neural net methods, etc.) to produce one "score" for the entire window

Drawbacks to window-based evaluation • A sequence length of at least 100 b.p. is required before significant information can be gained from the analysis • Results in a +/- 100 b.p. uncertainty in the start site of predicted coding regions, unless an unambiguous pattern can also be found to indicate the start.

Most are web-based, but... • Submit sequence; input sequence length may be limited • Select parameters, if any • Interpret results • Most software is first or second generation; results come in non-graphical formats. • GeneMark, GenScan, Glimmer

How is annotation done? • This is done by comparing the DNA sequences of the genes to known genes in a database. If they sequences are similar, the a similar function is assumed. • The comparison is done using sequence comparison tools such as BLAST

Database Searching for Similar Sequences • Database searching for similar sequences is ubiquitous in bioinformatics. • Databases are large and getting larger • Need fast methods

Types of Searches • Sequence similarity search with query sequence • Alignment search with profile (scoring matrix with gap penalties) • Serch with position-specific scoring matrix representing ungapped sequence alignment • Iterative alignment search for similar sequences that starts with a query sequence, builds a multiple alignmnet, and then uses the alignment to augment the search • Search query sequence for patterns representative of protein families From Bioinformatics by Mount

DNA vs Protein Searches • DNA sequences consists of 4 characters (nucleotides) • Protein sequences consist of 20 characters (amino acids) • Hence, it is easier to detect patterns in protein sequences than DNA sequences • Better to convert DNA sequences to protein sequences for searches.

Database Searching Efficacy • To evaluate searching methods, selectivity and sensitivity need to be considered. • Selectivity is the ability of the method not to find members known to be of another group (i.e. false positives). • Sensitivity is the ability of the method to find members of the same protein family as the query sequence.

Protein Searches • Easier to identify protein families by sequence similarity rather than structural similarity. (same structure does not mean same sequence) • Use the appropriate gap penalty scorings • Evaluate results for statistical significance.

History • Historically dynamic programming was used for database sequence similarity searching. • Computer memory, disk space, and CPU speed were limiting factors. • Speed still a factor due to the larger databases and increase number of searches. • FASTA and BLAST allow fast searching.

History • The PAM250 matrix was used for a long time. It corresponds to a period of time where only 20% of the amino acids have remained unchanged. • BLOSUM has replace PAM250 in most applications. BLAST use the BLOSUM62 matrix. FASTA uses the BLOSUM50 matrix.

Search Tools • Similarity Search Tools • Smith-Waterman Searching • Heuristic Search Tools • FASTA • BLAST

Malaria Vaccine • A German and American Team used reverse genetics i.e. they used the sequenced genome, deduced the candidate genes, and then knocked out a particular gene (Uis3). • This give 30 day immunity in mice which is better than vaccines made by traditional methods

Microarray Data Analysis Gene chips allow the simultaneous monitoring of the expression level of thousands of genes. Many statistical and computational methods are used to analyze this data. These include: • statistical hypothesis tests for differential expression analysis • principal component analysis and other methods for visualizing high-dimensional microarray data • cluster analysis for grouping together genes or samples with similar expression patterns • hidden Markov models, neural networks and other classifiers for predictively classifying sample expression patters as one of several types (diseased, ie. cancerous, vs. normal)

What is Microarray Data? In spite of the ability to allow us to simultaneously monitor the expression of thousands of genes, there are some liabilities with micorarray data. Each micorarray is very expensive, the statistical reproducibility of the data is relatively poor, and there are a lot of genes and complex interactions in the genome. Microarray data is often arranged in an n x m matrix M with rows for the n genes and columns for the m biological samples in which gene expression has been monitored. Hence, mij is the expression level of gene i in sample j. A row ei is the gene expression pattern of gene i over all the samples. A column sj is the expression level of all genes in a sample j and is called the sample expression pattern.

Types of Microarrays • cDNA microarray • Nylon membrane and plastic arrays (by Clontech) • Oligonucleotide silicon chips (by Affymetrix) • Note: Each new version of a microarray chip is at least slightly different from the previous version. This means that the measures are likely to change. This has to be taken into account when analyzing data.

cDNA Microarray • The expression level eij of a gene i in sample j is expressed as a log ratio, log(rij/gi), of the log of its actual expression level rij in this sample over its expression level gi in a control. • When this data is visualized eij is color coded to a mixture of red (rij >> gi) and green (rij << gi) and a mixture in between.

Nylon Membrane and Plastic Arrays (by Clontech) • A raw intensity and a background value are measured for each gene. • The analyst is free to choose the raw intensity or can adjust it by subtracting the background intensity.

Oligonucleotide Silicon Chips (by Affymetrix) • These arrays produce a variety of numbers derived from 16-20 pairs of perfect match (PM) and mismatch (MM) probes. • There are several statistics related to gene expression that can be derived from this data. The most commonly used one is the average difference (AVD), which is derived from the differences of PM-MM in the 16-20 probe pairs. • The next most commonly used method is the log absolute value (LAV), which comes from the ratios PM/MM in the probe pairs. • Note: The Affymetrix gene-chip software has a absent/present call for each gene on a chip. According to Jagota, the method is complex and arbitrary so they usually ignore it.

For What Do We Use Microarray Data? • Genes with similar expression patterns over all samples – We can compare the expression patterns ei and ei’ of two genes i and i' over all samples. • If we use cluster analysis, we can separate the genes into groups of genes with similar expression patterns (trees). • This will allow us to find what unknown genes have altered expression in a particular disease by comparing the pattern to genes know to be affiliated with a disease. • It can also find genes that fit a certain pattern such as a particular pattern of change with time. • It can also characterize broad functional classes of new genes from the known classes of genes with similar expression.

For What Do We Use Microarray Data? • Genes with unusual expression levels in a sample – In contrast to standard statistical methods where we ignore outliers, here outliers might have particular importance. Hence, we look for genes whose expression levels are very different from the others. • Genes whose expression levels vary across samples – We can compare gene expression levels of a particular gene or set of genes in different samples. This can be used to look compare normal and diseased tissues or diseased tissue before and after treatment.

For What Do We Use Microarray Data? • Samples that have similar expression patterns – We might want to compare the expression patters of all genes between two samples. We might cluster the genes into gene with similar expression patterns to help with the comparison. This can be used to look compare normal and diseased tissues or diseased tissue before and after treatment. • Tissues that might be cancerous (diseased) – We can take the gene expression pattern of sample and compare it to library expression patterns that indicate diseased or not diseased tissue.

Statistical Methods Can Help • Experimental Design – Since using microarrays is costly and time consuming, we want to design experiments to use the minimal number of micorarrays that will give a statistically significant result. • Data Pre-processing – It is sometimes useful to preprocess the data prior to visualization. An example of this is the log ratio mentioned earlier. It is often necessary to rescale data from different microarrays so that they can be compared. This is due to variation in chip to chip intensity. Another type of preprocessing is subtracting the mean and dividing by the variance.

Pharmacogenomics and Bioinformatics