Workflow

A step by step workflow for how to run JWAS is shown in this section. The workflow below is used to demonstrate a three-trait Bayesian linear mixed model fitting fixed effects (x1, x3), random effects (x2), direct genetic effects (ID), maternal genetic effects (dam) and genomic information.

Workflow

Method-Specific Guides

For method-specific workflows beyond the general setup in this page, see:

Available Models

Given the data and model equations, several different types of models are available in JWAS as shown below. In the table below, "X" denotes the type of available data, and "Y<=A" denotes that Y individuals is a subset of A individuals.

Linear Mixed Models (LMM)	phenotypes (Y)	pedigree (A)	genotypes (G)	notes
Conventional LMM	X
Pedigree-based LMM	X	X		Y<=A
Complete Genomic LMM	X	optional	X	Y<=G
Incomplete Genomic LMM	X	X	X	Y<=A,G<=A

Note

Incomplete Genomic LMM is also called "single-step" methods in animal breeding.
Pedigree information may be used in Complete Genomic LMM for extra polygenic effects to account for genetic variance not explained by the genomic data (e.g., SNPs).
Pedigree-based LMM (none of the individuals in the pedigree are genotyped) and Complete Genomic LMM (all individuals in the pedigree are genotyped) are special cases of Incomplete Genomic LMM (part of the individuals in the pedigree are genotyped).

Get Data Ready

By default, input data files are comma-separated values (CSV) files, where each line of the file consists of one or more fields, separated by commas. Other field separators such as space (' ') or tab ('\t') can be used if you supply the keyword argument, e.g, CSV.read(...,delim='\t') or add_genotypes(...,separator='\t')

Click on the buttons inside the tabbed menu to see the data:

ID,y1,y2,y3,x1,x2,x3,dam

a1,-0.06,3.58,-1.18,0.9,2,m,NA

a2,-0.6,4.9,0.88,0.3,1,f,NA

a3,-2.07,3.19,0.73,0.7,2,f,NA

a4,-2.63,6.97,-0.83,0.6,1,m,a2

a5,2.31,3.5,-1.52,0.4,2,m,a2

a6,0.93,4.87,-0.01,05,2,f,a3

a7,-0.69,3.1,-1.47,0.5,2,f,a3

a8,-4.69,7.31,-1.09,0.3,2,m,a6

a9,-2.81,7.18,0.76,0.4,2,m,a6

a10,1.92,1.78,-0.88,0.2,1,m,a7

ID,Sire,Dam

a1,0,0

a2,0,0

a3,0,0

a4,a1,a2

a5,a1,a2

a6,a1,a3

a7,a1,a3

a8,a4,a6

a9,a4,a6

a10,a5,a7

ID,m1,m2,m3,m4,m5

a1,1,2,1,1,0

a2,2,1,1,1,1

a3,1,1,0,1,1

a4,2,2,0,1,0

a5,1,1,2,1,1

a6,2,1,0,0,0

a7,0,2,1,2,1

a8,2,2,0,0,0

a9,2,1,0,1,0

a10,0,2,2,2,1

markerID,chromosome,position

m1,1,16977

m2,1,434311

m3,1,1025513

m4,2,70350

m5,2,101135

Step 1: Load Packages

using JWAS,CSV,DataFrames

The JWAS package is loaded, as well as the CSV and DataFrame packages for reading text files.

Step 2: Read Data

Read Phenotypic Data

phenotypes = CSV.read("phenotypes.txt",DataFrame,delim = ',',header=true,missingstring=["NA"])
first(phenotypes)

output:

6×8 DataFrames.DataFrame
│ Row │ ID │ y1    │ y2   │ y3    │ x1  │ x2 │ x3 │ dam       │
├─────┼────┼───────┼──────┼───────┼─────┼────┼────┼───────────┤
│ 1   │ a1 │ -0.06 │ 3.58 │ -1.18 │ 0.9 │ 2  │ m  │ missing   │
│ 2   │ a2 │ -0.6  │ 4.9  │ 0.88  │ 0.3 │ 1  │ f  │ missing   │
│ 3   │ a3 │ -2.07 │ 3.19 │ 0.73  │ 0.7 │ 2  │ f  │ missing   │
│ 4   │ a4 │ -2.63 │ 6.97 │ -0.83 │ 0.6 │ 1  │ m  │ a2        │
│ 5   │ a5 │ 2.31  │ 3.5  │ -1.52 │ 0.4 │ 2  │ m  │ a2        │
│ 6   │ a6 │ 0.93  │ 4.87 │ -0.01 │ 5.0 │ 2  │ f  │ a3        │

The phenotypic data is read on line 1. On line 2, the first several rows of the phenotypic data are shown.

Read Pedigree Data

pedigree   = get_pedigree("pedigree.txt",separator=",",header=true)

link to documentation for get_pedigree

The pedigree data is read on line 1.

Read Genomic Data

genotypes  = get_genotypes("genotypes.txt",G,method="BayesC",separator=",",header=true) #G is optional

link to documentation for get_genotypes

On line 1, the genomic information is read on line 1 with the genotype file. and variance G (a 3x3 matrix). In Bayesian analysis, the G is the mean for the prior assigned for the genomic variance with degree of freedom df, defaulting to 4.0. If G is not provided, a value is calculated from responses (phenotypes).

Dense loading shown above is the default and primary workflow (storage=:dense).

If you want marker annotations to change BayesC prior inclusion probabilities, pass annotations=... to get_genotypes. JWAS automatically prepends an intercept column after marker filtering. See Annotated BayesC for the method description, including:

single-trait dense / fast_blocks / streaming annotated BayesC
dense 2-trait annotated BayesC with the 4-state joint prior, including fast_blocks=true and multi_trait_sampler=:I|:II|:auto

For the dense 2-trait method specifically, including startup Pi, supported sampler modes, and output interpretation, see Multi-Trait Annotated BayesC.

For very large marker panels, an optional streaming backend is available:

prefix = prepare_streaming_genotypes(\"genotypes.txt\",separator=',',header=true)
genotypes = get_genotypes(prefix,G,method=\"BayesC\",storage=:stream) #G is optional

Streaming mode is currently an MVP path (single-trait BayesC, fast_blocks=false, unit residual weights, double_precision=false, exact genotype/phenotype ID order).

Step 3: Build Model Equations

model_equation = "y1 = intercept + x1 + x3 + ID + dam + genotypes
                  y2 = intercept + x1 + x2 + x3 + ID + genotypes
                  y3 = intercept + x1 + x1*x3 + x2 + ID + genotypes"
model=build_model(model_equation,R) #R is optional

link to documentation for build_model

The model equation for a 3-trait analysis is defined on the first 3 lines.

The effects fitted in the model for trait y1 are the intercept, x1, x3, direct genetic effects (ID), maternal genetic effects (dam), and molecular marker effects (genotypes).
The effects fitted in the model for trait y2 are the intercept, x1, x2, x3, direct genetic effects (ID), and molecular marker effects (genotypes).
The effects fitted in the model for trait y3 are the intercept, x1, the interaction between x1 and x3, x2, direct genetic effects (ID), and , and molecular marker effects (genotypes).

On the last line, the model is built given the model equation and residual variance R (a 3x3 matrix). In Bayesian analysis, R is the mean for the prior assigned for the residual variance with degree of freedom df, defaulting to 4.0. If R is not provided, a value is calculated from responses (phenotypes). By default, all effects are treated as fixed and classed as factors (categorical variables) rather than covariates (quantitative variables).

Step 4: Set Factors or Covariate

set_covariate(model,"x1")

link to documentation for set_covariate

On line 1, the effect x1 is defined to be a covariate (a quantitative variable) rather than a factor (a categorical variable).

Step 5: Set Random or Fixed Effects

set_random(model,"x2",G1) #G1 is optional
set_random(model,"ID dam",pedigree,G2) #G2 is optional

link to documentation for set_random

On line 1, the x2 class effect is defined as random with variance G1(a 2x2 matrix). On line 2, direct genetic effects and maternal genetic effects are fitted as ID and dam with G2 (a 4x4 matrix) and the inverse of the numerator relationship matrix defined from pedigree. In Bayesian analysis, G1 and G2 are the means for the priors assigned for the variances with degree of freedom df, defaulting to 4.0. If G1 or G2 is not provided, a value is calculated from responses (phenotypes).

Step 6: Run Bayesian Analysis

outputMCMCsamples(model,"dam")
out=runMCMC(model,phenotypes)

link to documentation for outputMCMCsamples
link to documentation for runMCMC

On line 1, MCMC samples from runMCMC for x2 is saved to a file, where each row represents one sample from the MCMC. On line 2, a multi-trait BayesC analysis is performed with model and phenotypes as had been defined in step 1-6. MCMC samples for marker effects, location parameters specified on line 1, and all variance components from this analysis are saved every output_samples_frequency iterations to files. Set output_marker_effect_samples=false in runMCMC to skip the large marker-effect MCMC sample files while keeping final marker-effect summaries and smaller sample files such as marker-effect variances and pi. This option is not supported with causal_structure, because SEM post-processing requires marker-effect MCMC sample files.

For large marker panels, runMCMC(...; fast_blocks=...) enables the generalized BayesR3 strategy for supported model paths. fast_blocks can be true, a numeric block size, or explicit block starts such as [1, 501, 975]. The default block mode is the exact sequential sweep. Set independent_blocks=true only when you intentionally want the approximate independent-block mode for block-level thread parallelism. See BayesR3 Block Strategy for algorithm details and speed/memory tradeoffs. For standard original BayesC (fast_blocks=false) at very large N and P, see Handling Large Genotype Data Without Loading the Full Matrix into Memory for memory/speed tradeoffs across storage approaches. For feature-level details of the opt-in streaming backend (storage=:stream), including API and benchmarks, see Streaming Genotype Backend (BayesC MVP). For complete SEM usage, reference behavior, and implementation notes, see SEM: Beginner to Advanced.

Several steps above can be skipped if no related information is available, e.g., step 4 is skipped if all effects are classed as factors. Several detailed examples are available in the examples section. Here is the link to documentation for all Public functions.

check results

Posterior means (estimate) and standard deviations (SD) of location parameters, most variance components, and marker effects are saved as the variable out and in text files. They can be listed and obtained as

keys(out)

# output:
#
# Base.KeyIterator for a Dict{Any,Any} with 7 entries. Keys:
#   "polygenic effects covariance matrix"
#   "Model frequency"
#   "residual covariance matrix"
#   "marker effects"
#   "marker effects variance"
#   "location parameters"
#   "Pi"

out["residual variance"]

# output:
#
#Covariance	Estimate	SD
#y1_y1	1.65265	0.29405
#y1_y2	-0.0290279	0.02347
#y1_y3	-0.252009	0.048289
#y2_y1	-0.0290279	0.02347
#y2_y2	0.977405	0.009732
#y2_y3	0.0451994	0.095828
#y3_y1	-0.252009	0.048289
#y3_y2	0.0451994	0.095828
#y3_y3	0.363878	0.049278

In addition, MCMC samples from posterior distributions of marker effects, all variance components, and location parameters specified in step 7, are saved to text files in your working directory. Users can compute the posterior distributions of parameters of interest using these MCMC samples files. A list of output files are shown below.

Output files:

Below is a list of files containing estimates and standard deviations for variables of interest.

file name	description
EBV_y1.txt	estimated breeding values for trait named "y1"
EBV_y2.txt	estimated breeding values for trait named "y2"
EBV_y3.txt	estimated breeding values for trait named "y3"
genetic_variance.txt	estimated genetic variance-covariance for all traits
heritability.txt	estimated heritability
location_parameters.txt	estimated non-genetic effects
pi_genotypes.txt	estimated pi
polygeniceffectscovariance_matrix.txt	estimated variance-covariance between polygenic effects (y1ID, y2ID, y3ID, y1dam)
markereffectsgenotypes.txt	estimated marker effects for all traits
residual_variance.txt	estimated residual variance-covariance for all traits

Below is a list of files containing MCMC samples for variables of interest.

file name	description
MCMCsamplesEBV_y1.txt	MCMC samples from the posterior distribution of breeding values for trait named "y1"
MCMCsamplesEBV_y2.txt	MCMC samples from the posterior distribution of breeding values for trait named "y2"
MCMCsamplesEBV_y3.txt	MCMC samples from the posterior distribution of breeding values for trait named "y3"
MCMCsamplesgenetic_variance.txt	MCMC samples from the posterior distribution of genetic variance-covariance for all traits
MCMCsamplesheritability.txt	MCMC samples from the posterior distribution of heritability
MCMCsamplesmarkereffectsgenotypes_y1	MCMC samples from the posterior distribution of marker effects for trait named "y1"
MCMCsamplesmarkereffectsgenotypes_y2	MCMC samples from the posterior distribution of marker effects for trait named "y2"
MCMCsamplesmarkereffectsgenotypes_y3	MCMC samples from the posterior distribution of marker effects for trait named "y3"
MCMCsamplesmarkereffectsvariances_genotypes.txt	MCMC samples from the posterior distribution of marker effect variance for all traits
MCMCsamplespi_genotypes.txt	MCMC samples from the posterior distribution of pi
MCMCsamplespolygeniceffectsvariance.txt	MCMC samples from the posterior distribution of variance-covariance between y1ID, y2ID, y3ID, y1dam
MCMCsamplesresidual_variance.txt	MCMC samples from the posterior distribution of residual variance-covariance for all traits
MCMCsamplesy1.dam.txt	MCMC samples from the posterior distribution of dam effect for y1
MCMCsamplesy1.IDy2.IDy3.IDy1.damvariances.txt	MCMC samples from the posterior distribution of variance-covariance between y1ID, y2ID, y3ID, y1dam
MCMCsamplesy2.x2y3.x2variances.txt	MCMC samples from the posterior distribution of variance-covariance between y2x2 and y3x2