GenSel4

GenSel4 August 2011

Command Line • GenSel can be run from commandline • For example gensel4 (provided path set appropriately) • GenSel can be run from the BIGS web interface • GenSel jobs can be submitted to the queue on the HPC using the bigscli command from the unix interface of BIGS • Usage • genselinput_file_name • nohupgenselinput_file_name –s status_file_name

Genotypes & Phenotypes • Required for all analyses • trainPhenotypeFileName • markerFileName Used for analysis Read by GenSel4

Genotype File Structure • Space delimited unix file (dos2unix to convert) • header row plus one row for each animal • column for ID then a column for each genotype • One header row • Alphanumeric labels for each genotype/locus • One row for each animal • Alphanumeric ID followed by all the genotypes • -10, 0 or 10 for AA, AB or BB (no support for missing genotypes) • Ordered by genomic location if no map file • Read in binary format (end in .newbin) • Text files are converted to binary in the first analysis • Must be same number of columns in every row

Example Genotype File Disk requirements for 5,000 bovine 50k genotypes in text form are about 1Gb (and the same file in binary format is typically half the size) Species are designated by the first letters of Genus and species bt = Bos Taurus; hs=hom sapiens; oa=ovisarieslss=susscrofaetc This will later provide functionality for species specific genome browsing

Phenotype File Structure • Space delimited unix file • Separate phenotype file for each trait • Header row plus one row for each animal with phenotype • Alphanumeric animal ID must be in column 1 • Trait value must be in column 2 (label in header) • Remainder of file is arbitrary but defines model for trait • Recommend to at least involve a column of 1’s for the mean • Columns headed by alphanumerics – all rows have same no of columns • Columns headed by name ending in $ are class variables • Columns headed by other names are covariates • Columns ending in # are ignored • Column headed by rinverse specifies a weighted analysis

Example Phenotype File rinverse is only proportional (scalar variance factored out) covariates must be numbers! categorical traits must be numbered from 1 upwards trait in column 2 (not required for prediction) sensible to at least have the mean model does not need to be full rank

GenSel matches IDs • Only records with the same alphanumeric ID in the genotype and phenotype file are available for subsequent analysis • Start of analysis reports the number of animals in the genotype file, phenotype file and matching records

Genotypes & Map Files • GenSel now supports the use of a map file • A map file provides chromosome and basepair position information for at least one build • Can support any number of builds • A map file may provide multiple aliases for marker names • Every marker name from the genotype file must exist somewhere in the map file • Additional marker names can be in the map file.

Map File Structure Space delimited unix text file

Map File Options • The minimum requirements are • mapFileName • linkageMap (options depend upon your mapfile) • eg UMD or BTA for my example on last page • This will result in columns of the genotype file being sorted into genomic order to facilitate formation of contiguous marker windows – automatically formed in 1Mb sizes • Options include • addMapInfoToMarkers yes • Results in chromosome and base pair position added to output • outputMarkerHeaderName (options are aliases in your map file)

Filtering Genotypes • 4 methods to filter columns of the genotype file for analysis • Two approaches are always available • includeFileName or excludeFileName • These files contain a list of marker names as in the genotype file header that are to be included or excluded • Include takes precedence over exclude • Two other approaches are available if a map file is used • windowIncFileName or windowExclFileName • List of chromosome_names to include/exclude entire chromosome • List of chromosome_namestart_bpend_bp

Map files & SNP names • Sometimes the genotype file uses one marker name (eg database numeric ID), but the marker output file would benefit from having a different name (egrs number) • Given a map file, Predict can cross reference the different marker names so you can exchange marker results (.mrkRes) files with other users

Output File Name Conventions • Suppose GenSel is run using gensel4 demo.inp • The root for all output files will be “demo” • All options will produce output to demo.out# where # is the next available integer not already used • The first run produces demo.out1, the next demo.out2 etc • Most other options produce additional files that will have the same root name and the same suffix number as the .out file • demo.LD1, demo.mrkRes1, demo.ghat1, demo.winVar1 etc

Analysis Options • Many calculations are time consuming • Computing window Variance • Validating predictive accuracy in test data • Computing PEV and R2 • These are only done in some iterations according to the outputFrequency option • Default is 100 so these calculations occur for 1% iterations • Markov Chains use many random numbers • The seed option (default 1234) can be used to alter sampling

Print • analysisType Print • This can be used to get a printout of the X matrix, ordered by map position if a map file is used, for just those animals in the genotype and phenotype file • The output contains the covariates on a 0, 1, 2 scale, before centering, not on the -10, 0 , 10 scale used in the marker genotype file

LD • analysisType LD • This computes the pairwise squared correlation between every pair of markers in the filtered genotype file • Also computes the minor allele frequencies (MAF) • The output file will be very large if you don’t filter it • Only squared correlations exceeding minLDoutput are stored • minLDoutput (default 0.1)

StepWise • analysisTypeStepWise • Computes (unweighted) forward and reverse submodels after first fitting all the fixed effects • R2 is defined as the proportion of sums of squares after the fixed effects • Three options control the model • inputMaxRsquared (default 0.8) will stop the analysis • inputMaxMarkers (default 100) will stop the analysis • alphaValue (default 0.05) controls significance

Bayes • analysisType Bayes • bayesTypeBayesB • Metropolis-Hastings • Gibbs Sampling • bayesTypeBayesA (Actually just BayesB with pi=0) • bayesTypeBayesC • bayesTypeBayesCPi (Actually BayesC but with pi estimation) • bayesType RBR (Robust Bayesian Regression) • Really Bayes B but with pi, Scale and df (genetic) estimation • FindScale options (no, yes) or for BayesCPi(thruPi)

Bayes Priors • Priors and associated degrees of freedom are required for the genetic and residual variance • genVariance (default 1) • degreesFreedomEffectVar (default 4) • resVariance (default 1) • nuRes (default 10) • Better estimates of genVariance and resVariance should be used • From knowledge of heritability and phenotypic sd

Bayes Options • All analysisType Bayes jobs have extra options • burnIn is the number of iterations in the chain to discard • Probably doesn’t need to be very many (eg 1,000) • chainLength is the number of iterations in the chain • Typically use 41,000 or more (this includes burnIn) • Mixture models (BayesB, BayesC, BayesCPi, RBR) assume a fraction (1-pi) of markers have an effect and pi have 0 effect • Option is for example probFixed 0.95

Bayes Options • BayesB(and therefore BayesA= B0) used to use a Metropolis-Hastings rather than a Gibbs sampler • MHG did 100 MH iterations • Our fast version used a different proposal distribution and required no more than 10 MH iterations • You can specify numMHIter • Long developed an alternative sampler that does not use MH • You select this option using numMHIter 0 • It is faster – the same speed as BayesC

Bayes Options • The 1 Mb windows formed using a map file can be used to compute the variance of the window • This is turned on using windowBV yes • Note the number of markers in each window varies with SNP density along the genome (many markers for chromunk) • This provides posterior distributions of windows so that the previous Permute and Bootstrap options are no longer needed or supported • In the absence of a map file, the columns in the genotype file are assumed to be consecutive, and the number of markers in a window are defined by the windowWidth option • The default is 5 • Automatically get graphs of posteriors and table of variances

Note window Variances typically don’t sum to 100 due to nonzero covariances

Predict • analysisType Predict • markerSolFilename defines the name of a .mrkRes file from a previous training analysis • windowWidth defines the number of markers in a consecutive window from which the overlapping window variances are computed • windowBV yes will result in a file full of ghats with a row for each animal and a column for each overlapping window

GenerateData • Randomly chooses 1-probFixed proportion of loci to be QTL • Samples QTL effects and residual effects according to normal distributions with mean 0 and variance determined by varGenotypic and varResidual • Outputs the simulated genotypes and phenotypes • Phenotypes will be categorical if isCategorical yes with as many categories as specified by numCat (default 2) • Categories will be equal sizes unless specified by the option PortOfCat (eg 0.70:0.20) if numCat 3

Validation • There are two options for validation • Validation can be done jointly with the training analysis • trainPhenotypeFileName • testPhenotypeFileName • If no testPhenotypeFileName, training data is used • This will produce ghat, PEV and R2 for validation animals • Validation can be done in a later session from training • This will produce ghat but no PEV or R2 • All columns of phenotype file are copied into the ghat file to facilitate downstream analysis

Graphing Posteriors • Various posterior distributions will be output if desired using the key word plotPosteriors yes • Samples used in the graphs are in .mcmcSamples which can be produced without graphing if mcmcSamples yes • Requires that gnuplot is installed on the machine in a location accessible using the defined path

Categorical Options • All analysisType Bayes will do categorical analyses if the option isCategorical yes is used • Categories must start from 1, and be ordered without missing categories

Required Libraries • Many routines use matvec libraries • Most matrix and vector computations use Eigen3 • GSL is no longer used • Boost is used (only for format statements) • Limited use of STL • Graphics options require gnuplot • Environment must include paths to gnuplot (/opt/local/bin)

R version • We are developing an R version that will allow you to run any or all of the options from R • Also allow you access to variables created during the analysis • Hope to allow you to replace existing procedures with your own for prototyping new methods or features

Planned Developments • Addition of partial least squares (PLS), Bayesian Lasso • Addition of further random factors beyond the genotypes • Using pedigree, genomic or identity variance-covariance matrices • Extension to multiple trait analysis • Implementation using CUDA graphics processors

GenSel4

GenSel4

Presentation Transcript