Block BayesC (fast_blocks)
This page explains how JWAS implements block updates for marker sampling and how the block path changes speed and memory usage. The block BayesC implementation uses a strategy similar to the blocked update scheme described in the BayesR3 paper. JWAS keeps the exact sequential block sweep as the default and also provides an explicit approximate parallel mode through independent_blocks=true. For detailed non-block vs block memory accounting, see Memory Usage. For a real cluster timing benchmark at N=50,000 targeting P=2,000,000 and chain_length=2000, see Benchmark.
BayesR3 paper (Methods): https://www.nature.com/articles/s42003-022-03624-1
When This Path Is Used
Block updates are enabled with:
out = runMCMC(model, phenotypes; fast_blocks=true)
# or provide a fixed block size
out = runMCMC(model, phenotypes; fast_blocks=64)
# or provide explicit marker block starts
out = runMCMC(model, phenotypes; fast_blocks=[1, 501, 975, 1600])
# or opt into the approximate independent-block mode
out = runMCMC(model, phenotypes; fast_blocks=[1, 501, 975, 1600],
independent_blocks=true)- If
fast_blocks=true, JWAS choosesblock_size = floor(sqrt(nObs)). - If
fast_blocksis numeric, JWAS uses that fixed block size. - If
fast_blocksis a vector, JWAS treats the entries as explicit marker block starts. The vector must be sorted, unique, start at1, and stay within1:nMarkers. independent_blocks=falseis the default. It keeps the exact sequential block sweep: after each block, JWAS updates the global corrected phenotype / residual before sampling the next block.independent_blocks=trueis opt-in. It freezes the sweep-level corrected phenotype / residual, updates blocks independently using Julia threads, and merges all block deltas after the block barrier.- In single-trait BayesA/B/C, JWAS calls
BayesABC_block!. - In single-trait BayesR, JWAS calls
BayesR_block!. - Dense annotated single-trait BayesC and BayesR use the same block machinery with marker-specific annotation priors.
- In multi-trait BayesA/B/C with unconstrained marker covariance (
Mi.G.constraint == false), JWAS callsMTBayesABC_block!and honorsmulti_trait_sampler=:I,:II, or:auto. - Dense annotated 2-trait BayesC uses the same multi-trait block sampler choices.
- If
Mi.G.constraint == true, JWAS usesmegaBayesABC!(non-block path). - In current implementation, numeric
fast_blocksshould satisfyblock_size < nMarkers(chain-length scaling indexes the second block start).
Single-Trait BayesC Without Blocks
In the standard BayesC update (BayesABC!), each marker j is updated one-by-one:
- Compute conditional posterior terms (
rhs,lhs,gHat) from currentyCorr. - Compute marker inclusion probability
probDelta1. - Sample
δ[j](include/exclude) and sample/updateβ[j],α[j]. - Immediately update
yCorrusing marker columnx_j.
So yCorr is updated every marker.
Single-Trait Block BayesC in JWAS
In the block version (BayesABC_block!), markers are partitioned into blocks.
Precomputation (once before MCMC)
For each block b, JWAS builds:
X_b(block genotype matrix)X_b'R^{-1}X_b(XpRinvX, block Gram matrix)
X_b'R^{-1} is not persisted as a block matrix. Instead, JWAS computes X_b'R^{-1}yCorr on demand into a reusable block workspace vector each outer iteration.
Update flow inside each MCMC outer iteration
For each block:
- Build block RHS once:
XpRinvycorr = X_b'R^{-1} yCorr. - Update markers inside the block using BayesC logic.
- Instead of touching full
yCorreach marker, update the block RHS using columns ofX_b'R^{-1}X_b. - After finishing the block, update full
yCorronce:yCorr += X_b * (α_old_block - α_new_block).
Important implementation detail
In the current JWAS code:
- for
fast_blocks=trueor numericfast_blocks, inner repeats are set tonreps = block_size; - for
fast_blocks=trueor numericfast_blocks, outerchain_lengthis reduced by approximatelyblock_size; - for explicit block starts, one MCMC iteration means one full sweep over the supplied blocks and JWAS does not rescale
chain_length.
This keeps effective marker-update work on a similar scale while moving much of the per-marker work from nObs-length operations to block_size-length operations. The explicit block-start form uses sweep semantics because user-provided LD or pedigree-informed blocks can have different sizes.
Exact Sequential Blocks vs Independent Blocks
The default block sampler is exact for the current model because blocks are updated in sequence and the global corrected phenotype / residual is refreshed after each block. The independent-block mode changes only the inter-block schedule.
Single-Trait View
Let:
y*be the phenotype after non-marker terms are removedX = [X_1, X_2, ..., X_B]be the marker matrix partitioned into blocksalpha_bbe marker effects in blockbWbe the diagonal observation-weight matrix represented byRinv
The exact block update for block b uses:
s_b = X_b' W (r + X_b alpha_b)
= X_b' W (y* - sum_{c != b} X_c alpha_c)independent_blocks=true assumes the off-block weighted genotype crossproducts are negligible:
X_b' W X_c = 0 for b != cUnder that condition, the off-block terms vanish. For two marker blocks, this is the user's intuition:
x1' W (y - x1 alpha1 - x2 alpha2) = x1' W (y - x1 alpha1)when:
x1' W x2 = 0If the off-block crossproducts are small but not zero, independent_blocks=true is an approximation.
Multi-Trait View
For multi-trait models, let:
Y*be the matrix of corrected phenotypes after non-marker terms are removedA_bbe the marker-effect matrix for blockbR^{-1}be the residual precision matrix among traits
The within-block sampler still uses the existing multi-trait residual covariance logic through R^{-1}. The independent-block assumption is still about genotype-side block leakage:
X_b' W X_c = 0 for b != cSo the approximate parallel mode:
- freezes the trait-wise corrected phenotype vectors at the start of the marker sweep;
- updates each genotype block independently from that frozen state;
- applies all block effect deltas to the global corrected phenotypes after all blocks finish.
The prior is not changed by this option. Plain and annotated priors keep their existing marker-state logic.
Server Use
The independent-block path uses Julia threads over blocks. On a server, use:
export JULIA_NUM_THREADS=<num_cores>
export OPENBLAS_NUM_THREADS=1
julia --project=.OPENBLAS_NUM_THREADS=1 avoids oversubscribing the machine with nested BLAS threads while JWAS is already parallelizing over marker blocks.
Detailed Comparison with BayesR3
Reference paper: BayesR3 (Communications Biology, 2022), DOI: 10.1038/s42003-022-03624-1.
JWAS uses BayesC (not BayesR), but the block linear-algebra strategy closely follows the same blocked Gibbs pattern.
Step-by-step correspondence
| BayesR3 paper step | JWAS block BayesC implementation | Status |
|---|---|---|
| Partition markers into blocks | fast_blocks builds marker blocks | Same strategy |
| Build per-block RHS (rb = Vb'We) | block_rhs!(XpRinvycorr, XArray[i], yCorr, Rinv, unit_weights) | Same strategy |
| Within-block marker update uses current block RHS | BayesC per-marker rhs/lhs/gHat from XpRinvycorr | Same strategy |
| In-block RHS correction via block Gram column | BLAS.axpy!(..., view(XpRinvX[i],:,j), XpRinvycorr) | Same strategy |
| Update residual once on block exit | yCorr += X_b*(α_old_block-α_new_block) | Same strategy |
What is different
| Topic | BayesR3 paper | JWAS block BayesC | Practical implication |
|---|---|---|---|
| Marker prior model | BayesR mixture (multiple non-zero normal components plus zero component) | BayesC spike-slab style inclusion (δ∈{0,1} for this path) | Same acceleration idea, different posterior model |
| Marker state sampling | Multi-class mixture state | Binary include/exclude state | Not numerically identical to BayesR |
| Inner-repeat schedule | Uses nominal block size as fixed repeat count across blocks | nreps = current block_size | Last short block may receive fewer inner repeats |
| Outer-loop scheduling | Described as a fixed block sweep schedule | JWAS also rescales outer chain_length by block size | Compare effective updates, not just outer iterations |
| Scope | BayesR algorithm | JWAS block path is wired to BayesA/B/C and BayesR marker samplers, with multi-trait BayesC honoring sampler I/II dispatch | Strategy reused across Bayesian alphabet members |
Scheduling detail (explicit)
JWAS sets nreps equal to the current block size.
- Full blocks:
nrepsequals the nominal block size. - Final short block:
nrepsis smaller than full blocks.
In the BayesR3 description, nreps is treated as fixed by the nominal block size for all blocks, including the final short block.
This difference changes the number of within-block Gibbs sweeps for short blocks, but not the core residual/RHS block-update identity.
Algorithm Comparison
| Aspect | Standard BayesC (BayesABC!) | Exact block BayesC (BayesABC_block!) | Independent blocks (independent_blocks=true) |
|---|---|---|---|
| Update unit | One marker at a time | One block, then markers inside block | One block per thread/chunk, then markers inside block |
yCorr / corrected phenotype updates | Every marker | Once per block, before the next block | Once after all blocks in the sweep finish |
| Inter-block schedule | Fully sequential | Sequential and exact | Parallelizable and approximate unless off-block crossproducts vanish |
| Main per-marker linear algebra size | nObs | block_size (inside block cache) | block_size (inside block cache) |
| Extra precompute | Minimal | X_b'R^{-1}X_b for all blocks (RHS computed on demand) | Same block precompute |
| Extra memory | Minimal | Stores block Gram matrices (XpRinvX) and block workspaces | Same plus block-local deltas and thread-local buffers |
| Chain behavior | Direct chain_length | Fixed-size blocks use inner repeats + outer chain scaling; explicit starts use full sweeps | Same chain semantics as the selected fast_blocks partition |
Computational Complexity
Use the notation:
N: number of records (nObs)P: number of markers (nMarkers)b: nominal block sizeB = ceil(P/b): number of blocksL: standard (non-block) chain length
Standard BayesC (non-block)
- Per MCMC iteration:
O(PN)(marker-wise dot products and residual updates overNrecords) - Total over
Literations:O(LPN)
JWAS block BayesC (current implementation)
Let block sizes be s_i with sum_i s_i = P.
Per outer iteration:
- Block RHS construction across all blocks:
O(NP) - In-block updates:
O(sum_i s_i^3)(becausenreps = s_iand in-block RHS updates are length-s_i) - Residual updates on block exit across all blocks:
O(NP)
So per outer iteration:
O(NP + sum_i s_i^3)- With near-uniform blocks (
s_i ≈ b):O(NP + P b^2)
JWAS rescales outer iterations to approximately m = floor(L/b), so the main total cost is:
O((L/b) * (NP + P b^2)) = O(LP(N/b + b))
BayesR3 (block strategy and paper fit)
BayesR3 uses the same blocked-update strategy family (block RHS, in-block updates using a block Gram matrix, then a block-exit residual update), but it is a BayesR mixture model rather than BayesC. This changes constants (more mixture-state work per marker), not the core block linear-algebra pattern.
Operation-count view (dense blocked implementation):
- BayesR3 runs a nominal number of inner cycles
nper block, and the paper recommendsnbe equal to the (nominal) block sizeb. - With
n = b, the leading operation-count terms match the same block strategy family as JWAS: total work scales likeO(LP(N/b + b)).
Paper runtime fit (Fig. 5):
- The BayesR3 paper reports an empirical timing model where processing time per SNP is proportional to
(N + b)/b = N/b + 1. - This is a fit to measured runtime for their implementation/hardware and is not a formal asymptotic operation-count derivation. It effectively treats the in-block work (the
+b-type term) as a small constant relative to theN/bterm in that regime.
Practical differences in complexity interpretation
- JWAS and BayesR3 share the same blocked-update strategy family, but are not identical in constants/scheduling details.
- JWAS uses
nreps = current block_sizefor each block (including the final short block). - BayesR3 describes using the nominal block repeat count for all blocks, including the final short block.
Numerical example (N=200,000, P=2,000,000)
Assume fast_blocks=true, so JWAS uses b = floor(sqrt(N)) = 447. Then:
B = ceil(P/b) = ceil(2,000,000/447) = 4,475blocks- Standard BayesC total scaling:
O(LPN) = O(L * 2,000,000 * 200,000) - JWAS block BayesC main scaling:
O(LP(N/b + b)) = O(L * 2,000,000 * (200,000/447 + 447)) - BayesR3 paper timing fit: runtime per SNP
∝ (N + b)/b, so total runtime∝ L * 2,000,000 * (200,000/447 + 1)
So the per-LP coefficients are:
- Standard BayesC:
200,000 - JWAS block BayesC operation-count:
~894.4(fromN/b + b) - BayesR3 paper fit:
~448.4(fromN/b + 1)
This implies:
- JWAS block vs standard:
~224xlower - BayesR3 paper fit vs standard:
~446xlower - The apparent
~2.0xgap between~894and~448is not an apples-to-apples complexity comparison: it is the difference between an operation-count model (N/b + b) and an empirical runtime fit (N/b + 1).
Detailed Resource Model (Current fast_blocks Path)
Use:
N: recordsP: markersb: nominal block sizes_i: size of blocki,sum_i s_i = Pt: bytes per floating-point value (4forFloat32,8forFloat64)
Memory formulas
Current block implementation (after removing persistent XRinvArray) stores:
- dense
X:N*Pvalues - block Gram matrices
XpRinvX:sum_i s_i^2values - marker summary
xpRinvx:Pvalues - optional
xRinvArrayextraN*Ponly for non-unit weights in the non-block marker path - small reusable block workspaces (
O(b)) for RHS and local deltas
Approximate totals:
- Unit weights:
Mem_block_unit ~= t * (N*P + sum_i s_i^2 + P) + O(b*t)
- Non-unit weights:
Mem_block_nonunit ~= t * (2*N*P + sum_i s_i^2 + P) + O(b*t)
Runtime working set
Per block update, active high-volume buffers are:
X_bview (N x s_i; no data copy)XpRinvX[i](s_i x s_i)- block RHS workspace (
s_i) yCorr(N)
So peak additional block-local workspace is roughly:
O(s_i^2 + s_i + N)values
I/O and precompute considerations
For in-memory dense X, no out-of-core read is required during MCMC sweeps. The expensive setup term is building XpRinvX:
- precompute cost scales with approximately
O(N * sum_i s_i^2)(~O(NPb)under near-uniform blocks)
This setup can dominate startup time for very large N,P, even when per-iteration sampling is fast.
Worked Large-Scale Example (N=500,000, P=2,000,000)
Assume fast_blocks=true, so b=floor(sqrt(N))=707.
B = ceil(P/b) = 2,829sum_i s_i^2 = 1,413,937,788
Memory-relevant terms:
| Term | Float32 | Float64 |
|---|---|---|
X | ~4.00 TB (3.64 TiB) | ~8.00 TB (7.28 TiB) |
XpRinvX | ~5.66 GB (5.27 GiB) | ~11.31 GB (10.53 GiB) |
xpRinvx | ~8.0 MB | ~16.0 MB |
So for unit weights, block mode remains dominated by X, with XpRinvX as the main incremental memory term.
What To Watch Closely
block_sizechoice:- too small: less speedup (
N/bterm remains large) - too large:
XpRinvXmemory and precompute cost rise (~P*bmemory and~NPbsetup trend)
- too small: less speedup (
- Effective chain length:
- current implementation rescales outer iterations and uses inner repeats (
nreps = block_size) - compare runs by effective updates, not only outer-iteration count
- current implementation rescales outer iterations and uses inner repeats (
- Final short block behavior:
- last block uses smaller
nreps(equal to its own size), so sweep symmetry differs slightly
- last block uses smaller
- Independent blocks:
- exact only when off-block weighted genotype crossproducts vanish
- otherwise it is an explicit approximation for speed and parallelism
- Multi-trait block path specifics:
- sampler I, sampler II, and
:autoare supported for unconstrained covariance mode - extra temporaries (e.g., trait-wise old-alpha handling) can become noticeable at larger trait counts
- sampler I, sampler II, and
- Numerical reproducibility:
- mathematically equivalent refactors (e.g., in-place BLAS updates) can change floating-point roundoff
- expect tiny non-bitwise differences, especially in
Float32
- Weighting mode:
- block
XRinvArrayis no longer persisted - separate non-unit weighted non-block
xRinvArraymaterialization remains a distinct issue
- block
- Scope constraints:
- current source notes this fast block option is intended for one genotype category
- numeric
fast_blocksshould keepblock_size < nMarkers
Example: Speed/Memory Tradeoff
Assume:
nObs = 5_000nMarkers = 50_000block_size = 70(similar tosqrt(nObs))nBlocks = 715
Approximate memory (Float32):
| Item | Approx size |
|---|---|
Genotype matrix X (nObs x nMarkers) | ~953.7 MiB |
Extra XpRinvX in block mode | ~13.4 MiB |
Interpretation:
- Block mode can be much faster for large
nObs, because heavy per-marker operations are shifted to block-level cached operations. - Block mode uses more memory than non-block mode, mainly from
XpRinvX(plus small block workspaces). - Practical speedup is typically below theoretical arithmetic speedup due to random branching, allocation overhead, and BLAS/runtime effects.
Practical Guidance
- Start with
fast_blocks=truefor large marker sets and enough RAM. - If memory is tight, set a smaller numeric block size (e.g.,
32or64) and benchmark. - If speed gain is small, try a few block sizes and choose based on wall time + memory headroom.
- If you have external LD, recombination, IBD, or pedigree-informed blocks, pass them as explicit starts with
fast_blocks=[...]. - Use
independent_blocks=trueonly when you intentionally accept the independent-block approximation and want block-level thread parallelism.