Skip Connection (Genotypes → Phenotypes) and Multi-Class Priors
This tutorial shows how to add a Layer 1 → Layer 3 skip/shortcut term so your 2→3 equation can include both:
- Middle-layer features (observed and/or partially missing omics), and
- a direct genotype effect on the phenotype.
This is useful when you want NNMM.jl to behave more like JWAS multi-class BayesC in the fully observed-omics case:
y = intercept + omics + genotypeswhile still keeping NNMM’s ability to sample missing omics in the middle layer.
Model overview
We assume a 3-layer NNMM:
Layer 1: Genotypes (X)
Layer 2: Omics/Intermediate traits (O, may contain missing values)
Layer 3: Phenotype (y)With skip connection
The 2→3 equation includes both marker classes:
y = intercept + O + XConceptually, NNMM fits two marker-effect “classes” in the 2→3 step:
- an omics class (effects for Layer 2 features), and
- a genotype-skip class (effects for Layer 1 markers directly on y).
Step-by-step example
This example uses the built-in simulated dataset (same data source as manual/tutorial.md).
1) Load data and write separate files
using NNMM
using NNMM.Datasets
using DataFrames, CSV
using Random
Random.seed!(42)
geno_path = Datasets.dataset("genotypes_1000snps.txt", dataset_name="simulated_omics_data")
pheno_path = Datasets.dataset("phenotypes_sim.txt", dataset_name="simulated_omics_data")
pheno_df = CSV.read(pheno_path, DataFrame)
# Layer 2 (omics)
omics_cols = vcat(:ID, [Symbol("omic$i") for i in 1:10])
omics_df = pheno_df[:, omics_cols]
omics_file = "omics_data.csv"
CSV.write(omics_file, omics_df; missingstring="NA")
# Layer 3 (phenotype)
trait_df = pheno_df[:, [:ID, :trait1]]
trait_file = "phenotypes.csv"
CSV.write(trait_file, trait_df; missingstring="NA")2) Define layers
layers = [
Layer(layer_name="genotypes", data_path=[geno_path]),
Layer(layer_name="omics", data_path=omics_file, missing_value="NA"),
Layer(layer_name="phenotypes", data_path=trait_file, missing_value="NA"),
]3) Define equations (including the skip term)
Key idea: the 2→3 equation explicitly includes + genotypes.
eq1 = Equation(
from_layer_name = "genotypes",
to_layer_name = "omics",
equation = "omics = intercept + genotypes",
traits = ["omic$i" for i in 1:10],
method = "BayesC",
estimatePi = true,
)
# Optional: set different priors/hyperparameters for each marker class in 2->3.
# The keys must match the layer names appearing in the equation ("omics" and "genotypes").
class_priors_23 = Dict(
"omics" => (
method = "BayesC",
Pi = 0.0,
estimatePi = true,
# You can also set (G, df_G, estimate_variance_G, ...) here if desired.
),
"genotypes" => (
method = "BayesC",
Pi = 0.0,
estimatePi = true,
),
)
eq2 = Equation(
from_layer_name = "omics",
to_layer_name = "phenotypes",
equation = "phenotypes = intercept + omics + genotypes",
traits = ["trait1"],
class_priors = class_priors_23,
activation_function = "linear",
)4) Run MCMC
results = runNNMM(layers, [eq1, eq2];
chain_length = 5000,
burnin = 1000,
output_folder = "nnmm_skip_example",
seed = 42,
)Which output should you compare to JWAS?
With observed omics available for an individual, the closest “JWAS-like” fitted value is:
results["EPV_Output_NonLinear"](uses observed Layer 2 values on output IDs, plus the genotype-skip term)
If you want a genotype-only prediction that does not assume omics will be available at prediction time, use:
results["EBV_NonLinear"](genotype-only total EBV when skip is present)
When a skip term is present, NNMM also reports the decomposition:
results["EBV_Indirect_NonLinear"]: mediated component (Genotypes → predicted MiddleLayer → y)results["EBV_Direct_Skip"]: direct component from the 2→3 genotype-skip marker class only
and by definition:
EBV_NonLinear = EBV_Indirect_NonLinear + EBV_Direct_SkipOmics-class scaling for BayesC priors (JWAS-like)
When NNMM sets default BayesC(-like) priors for a marker class, it needs to convert a target genetic variance (σ_g²) into a marker-effect variance (σ_a²).
For a marker class with design matrix X (columns x_j), BayesC can be written (conceptually) as:
\[g = X (\delta \odot a) \\ \delta_j \sim \mathrm{Bernoulli}(1-\pi) \\ a_j \sim \mathcal{N}(0, \sigma_a^2)\]
Assuming centered predictors and (approximately) independent columns, we have:
\[\mathrm{Var}(g) \approx \sum_j \mathrm{Var}(x_j)\,\mathrm{Var}(\delta_j a_j) = \sum_j \mathrm{Var}(x_j)\,(1-\pi)\,\sigma_a^2\]
so the usual scaling is:
\[\sigma_a^2 \approx \frac{\sigma_g^2}{(1-\pi)\,\sum_j \mathrm{Var}(x_j)}\]
Why nFeatures worked before (and when it still does)
If your predictors are standardized so Var(x_j) ≈ 1 for all features, then:
\[\sum_j \mathrm{Var}(x_j) \approx n_\text{features}\]
which recovers the older NNMM scaling:
\[\sigma_a^2 \approx \frac{\sigma_g^2}{(1-\pi)\,n_\text{features}}\]
Why omics needs ∑ var(feature_j) (unstandardized features)
Omics features are often not standardized and can have very different units/scales, so ∑ Var(x_j) can be far from nFeatures. Using nFeatures in that situation makes the prior (and therefore the posterior) depend on arbitrary measurement units, and it can prevent NNMM from matching JWAS multi-class BayesC in the complete-omics case.
To make the omics marker class behave more JWAS-like, NNMM now computes the omics analogue of JWAS’s genotype sum2pq as:
\[\sum_j \mathrm{Var}(\text{omics feature}_j)\]
and uses that value in the σ_g² → σ_a² conversion for omics.
Implementation notes
- The change is implemented in
src/markers/genotype_tools.jlingenetic2marker(::Omics, π). - Per-feature variances are computed ignoring
missingvalues, and the denominator is clamped away from 0 to avoid numerical issues. - If you pre-standardize omics so each feature has variance ≈ 1, the new scaling reduces to the old
nFeaturesscaling.
How this differs from the older “all-missing node” pattern
Before skip connections, a common workaround to represent a genotype shortcut was:
- Add a latent node (e.g.,
latent1) in Layer 2. - Set that node to be missing for all individuals.
- Fit:
- 1→2:
omics = intercept + genotypes(includinglatent1as an “omic” trait), and - 2→3:
phenotypes = intercept + omics(no direct+ genotypesterm).
- 1→2:
This creates an indirect genotype path genotypes → latent1 → phenotype rather than a direct genotypes → phenotype class of marker effects.
Statistical perspective (why “latent-only” is not the same as “skip”)
The “all-missing node” workaround can look similar to a skip term in a linear-Gaussian setting, but it is generally a different statistical model.
Consider the simplest case:
- Middle-layer latent node
zis missing for everyone. - Linear activation.
- Single trait.
No skip + one fully-missing latent node:
z = X a + e_z
y = w z + e_ySubstituting out z gives:
y = X (w a) + (w e_z + e_y)This resembles a direct genotype effect with β = w a, but with important differences:
- Factorized (rank-1) effect: the marker effect is the product
β = w a. The induced prior onβis not the same as putting BayesC directly onβ. - Weak identifiability / scaling:
a,w, and the latent residual variance can trade off (e.g., scalingaup andwdown leavesw asimilar), which can slow MCMC mixing. - Different error model: the effective noise includes
w e_z, so the likelihood is not the same as a standard skip model unless you impose strong constraints.
Skip (direct genotype class in 2→3):
y = Z_obs w_obs + X β + e_yHere β is a directly parameterized marker-effect class with its own prior/hyperparameters (via class_priors), and it is typically more identifiable and stable.
So, no-skip + fully-missing latent node is not guaranteed to produce the same posterior or predictions as skip (especially under BayesC-style priors and when variances are learned).
When some omics are partially missing
With partially missing omics nodes in the MiddleLayer, NNMM is a joint model that both:
- imputes missing omics using 1→2 (Genotypes → MiddleLayer), and
- fits 2→3 (MiddleLayer → Phenotypes) using the current/imputed MiddleLayer values.
Key difference:
- No skip (full mediation) forces genotype information to influence
yonly through the (partly imputed) MiddleLayer. If omics are very missing, prediction can become sensitive to how well that imputation is identified. - Skip (partial mediation) adds a direct
X βpath, so prediction can remain stable even when omics are missing at training time and/or prediction time.
Practical guidance
- If your goal is JWAS-like multi-class BayesC behavior and stable inference, prefer:
phenotypes = intercept + omics + genotypesin 2→3, and- do not add a fully-missing latent node unless you have a specific modeling reason.
- Use a fully-missing latent node only if you intentionally want an extra unobserved mediator dimension—and be aware it can introduce identifiability/mixing issues.
Pros of the skip connection (new approach)
- JWAS-like multi-class structure in 2→3:
y = intercept + omics + genotypes(two marker classes). - Separate priors per class via
class_priors(e.g., differentPi, different variance hyperparameters). - Uses observed omics directly when they are available (via
EPV_Output_NonLinear), without forcing genotype effects to go through a latent mediator node. - Still supports partially missing omics: missing Layer 2 values are sampled (for
activation_function="linear") while conditioning on phenotype residuals that include the skip term.
Pros of the all-missing node (older workaround)
- Keeps the model strictly in the original 2→3 form
phenotypes = intercept + omics(no extra marker class in 2→3). - If you truly want all genotype influence on
yto be mediated through the middle layer, the latent-node pattern is closer to that biological assumption.
Cons / tradeoffs
- Interpretability: with both
omicsandgenotypesin 2→3, they can explain overlapping variance; “direct” vs “mediated” effects depend on modeling assumptions and priors. - MCMC mixing: adding a second large marker block to 2→3 (genotype skip) can increase computation and autocorrelation; longer chains may be needed.
- Prediction target choice:
EPV_Output_NonLinear(omics available) andEBV_NonLinear(genotype-only) answer different questions; make sure you evaluate the one that matches your deployment setting.
Real-data reference script (repository)
For an end-to-end real-data comparison against JWAS multi-class BayesC, see:
test_real_data_codex.jl(usesTempTestData/nnmm_small_dataset)