Block BayesC (fast_blocks)
This page explains how JWAS implements block updates for BayesC marker sampling and how the block path changes speed and memory usage. The block BayesC implementation uses a strategy similar to the blocked update scheme described in the BayesR3 paper. For detailed non-block vs block memory accounting, see Memory Usage. For a real cluster timing benchmark at N=50,000 targeting P=2,000,000 and chain_length=2000, see Benchmark.
BayesR3 paper (Methods): https://www.nature.com/articles/s42003-022-03624-1
When This Path Is Used
Block updates are enabled with:
out = runMCMC(model, phenotypes; fast_blocks=true)
# or provide a fixed block size
out = runMCMC(model, phenotypes; fast_blocks=64)- If
fast_blocks=true, JWAS choosesblock_size = floor(sqrt(nObs)). - In single-trait BayesA/B/C, JWAS calls
BayesABC_block!. - In multi-trait BayesA/B/C with unconstrained marker covariance (
Mi.G.constraint == false), JWAS callsMTBayesABC_block!. - If
Mi.G.constraint == true, JWAS usesmegaBayesABC!(non-block path). - In current multi-trait block mode, the update is sampler-I style (trait-wise
δupdates fromBigPi[d0]/BigPi[d1]) rather than the non-block sampler-I/II dispatcher. - Current implementation note in source: this option is intended for one genotype category.
- In current implementation, numeric
fast_blocksshould satisfyblock_size < nMarkers(chain-length scaling indexes the second block start).
Single-Trait BayesC Without Blocks
In the standard BayesC update (BayesABC!), each marker j is updated one-by-one:
- Compute conditional posterior terms (
rhs,lhs,gHat) from currentyCorr. - Compute marker inclusion probability
probDelta1. - Sample
δ[j](include/exclude) and sample/updateβ[j],α[j]. - Immediately update
yCorrusing marker columnx_j.
So yCorr is updated every marker.
Single-Trait Block BayesC in JWAS
In the block version (BayesABC_block!), markers are partitioned into blocks.
Precomputation (once before MCMC)
For each block b, JWAS builds:
X_b(block genotype matrix)X_b'R^{-1}X_b(XpRinvX, block Gram matrix)
X_b'R^{-1} is not persisted as a block matrix. Instead, JWAS computes X_b'R^{-1}yCorr on demand into a reusable block workspace vector each outer iteration.
Update flow inside each MCMC outer iteration
For each block:
- Build block RHS once:
XpRinvycorr = X_b'R^{-1} yCorr. - Update markers inside the block using BayesC logic.
- Instead of touching full
yCorreach marker, update the block RHS using columns ofX_b'R^{-1}X_b. - After finishing the block, update full
yCorronce:yCorr += X_b * (α_old_block - α_new_block).
Important implementation detail
In the current JWAS code:
- inner repeats are set to
nreps = block_size; - outer
chain_lengthis reduced by approximatelyblock_size.
This keeps effective marker-update work on a similar scale while moving much of the per-marker work from nObs-length operations to block_size-length operations.
Detailed Comparison with BayesR3
Reference paper: BayesR3 (Communications Biology, 2022), DOI: 10.1038/s42003-022-03624-1.
JWAS uses BayesC (not BayesR), but the block linear-algebra strategy closely follows the same blocked Gibbs pattern.
Step-by-step correspondence
| BayesR3 paper step | JWAS block BayesC implementation | Status |
|---|---|---|
| Partition markers into blocks | fast_blocks builds marker blocks | Same strategy |
| Build per-block RHS (rb = Vb'We) | block_rhs!(XpRinvycorr, XArray[i], yCorr, Rinv, unit_weights) | Same strategy |
| Within-block marker update uses current block RHS | BayesC per-marker rhs/lhs/gHat from XpRinvycorr | Same strategy |
| In-block RHS correction via block Gram column | BLAS.axpy!(..., view(XpRinvX[i],:,j), XpRinvycorr) | Same strategy |
| Update residual once on block exit | yCorr += X_b*(α_old_block-α_new_block) | Same strategy |
What is different
| Topic | BayesR3 paper | JWAS block BayesC | Practical implication |
|---|---|---|---|
| Marker prior model | BayesR mixture (multiple non-zero normal components plus zero component) | BayesC spike-slab style inclusion (δ∈{0,1} for this path) | Same acceleration idea, different posterior model |
| Marker state sampling | Multi-class mixture state | Binary include/exclude state | Not numerically identical to BayesR |
| Inner-repeat schedule | Uses nominal block size as fixed repeat count across blocks | nreps = current block_size | Last short block may receive fewer inner repeats |
| Outer-loop scheduling | Described as a fixed block sweep schedule | JWAS also rescales outer chain_length by block size | Compare effective updates, not just outer iterations |
| Scope | BayesR algorithm | JWAS block path is wired to BayesA/B/C marker samplers, with multi-trait block updates implemented in sampler-I style | Strategy reused in a different Bayesian alphabet member |
Scheduling detail (explicit)
JWAS sets nreps equal to the current block size.
- Full blocks:
nrepsequals the nominal block size. - Final short block:
nrepsis smaller than full blocks.
In the BayesR3 description, nreps is treated as fixed by the nominal block size for all blocks, including the final short block.
This difference changes the number of within-block Gibbs sweeps for short blocks, but not the core residual/RHS block-update identity.
Algorithm Comparison
| Aspect | Standard BayesC (BayesABC!) | Block BayesC (BayesABC_block!) | Practical effect |
|---|---|---|---|
| Update unit | One marker at a time | One block, then markers inside block | Better cache locality in block path |
yCorr updates | Every marker | Once per block | Fewer full-length vector updates |
| Main per-marker linear algebra size | nObs | block_size (inside block cache) | Lower per-marker arithmetic cost |
| Extra precompute | Minimal | X_b'R^{-1}X_b for all blocks (RHS computed on demand) | More startup work |
| Extra memory | Minimal | Stores block Gram matrices (XpRinvX) and block workspaces | Higher memory than non-block, lower than older persisted-XRinvArray design |
| Chain behavior in current implementation | Direct chain_length | Inner repeats + outer chain scaling | Compare runs by effective updates, not only outer iterations |
Computational Complexity
Use the notation:
N: number of records (nObs)P: number of markers (nMarkers)b: nominal block sizeB = ceil(P/b): number of blocksL: standard (non-block) chain length
Standard BayesC (non-block)
- Per MCMC iteration:
O(PN)(marker-wise dot products and residual updates overNrecords) - Total over
Literations:O(LPN)
JWAS block BayesC (current implementation)
Let block sizes be s_i with sum_i s_i = P.
Per outer iteration:
- Block RHS construction across all blocks:
O(NP) - In-block updates:
O(sum_i s_i^3)(becausenreps = s_iand in-block RHS updates are length-s_i) - Residual updates on block exit across all blocks:
O(NP)
So per outer iteration:
O(NP + sum_i s_i^3)- With near-uniform blocks (
s_i ≈ b):O(NP + P b^2)
JWAS rescales outer iterations to approximately m = floor(L/b), so the main total cost is:
O((L/b) * (NP + P b^2)) = O(LP(N/b + b))
BayesR3 (block strategy and paper fit)
BayesR3 uses the same blocked-update strategy family (block RHS, in-block updates using a block Gram matrix, then a block-exit residual update), but it is a BayesR mixture model rather than BayesC. This changes constants (more mixture-state work per marker), not the core block linear-algebra pattern.
Operation-count view (dense blocked implementation):
- BayesR3 runs a nominal number of inner cycles
nper block, and the paper recommendsnbe equal to the (nominal) block sizeb. - With
n = b, the leading operation-count terms match the same block strategy family as JWAS: total work scales likeO(LP(N/b + b)).
Paper runtime fit (Fig. 5):
- The BayesR3 paper reports an empirical timing model where processing time per SNP is proportional to
(N + b)/b = N/b + 1. - This is a fit to measured runtime for their implementation/hardware and is not a formal asymptotic operation-count derivation. It effectively treats the in-block work (the
+b-type term) as a small constant relative to theN/bterm in that regime.
Practical differences in complexity interpretation
- JWAS and BayesR3 share the same blocked-update strategy family, but are not identical in constants/scheduling details.
- JWAS uses
nreps = current block_sizefor each block (including the final short block). - BayesR3 describes using the nominal block repeat count for all blocks, including the final short block.
Numerical example (N=200,000, P=2,000,000)
Assume fast_blocks=true, so JWAS uses b = floor(sqrt(N)) = 447. Then:
B = ceil(P/b) = ceil(2,000,000/447) = 4,475blocks- Standard BayesC total scaling:
O(LPN) = O(L * 2,000,000 * 200,000) - JWAS block BayesC main scaling:
O(LP(N/b + b)) = O(L * 2,000,000 * (200,000/447 + 447)) - BayesR3 paper timing fit: runtime per SNP
∝ (N + b)/b, so total runtime∝ L * 2,000,000 * (200,000/447 + 1)
So the per-LP coefficients are:
- Standard BayesC:
200,000 - JWAS block BayesC operation-count:
~894.4(fromN/b + b) - BayesR3 paper fit:
~448.4(fromN/b + 1)
This implies:
- JWAS block vs standard:
~224xlower - BayesR3 paper fit vs standard:
~446xlower - The apparent
~2.0xgap between~894and~448is not an apples-to-apples complexity comparison: it is the difference between an operation-count model (N/b + b) and an empirical runtime fit (N/b + 1).
Detailed Resource Model (Current fast_blocks Path)
Use:
N: recordsP: markersb: nominal block sizes_i: size of blocki,sum_i s_i = Pt: bytes per floating-point value (4forFloat32,8forFloat64)
Memory formulas
Current block implementation (after removing persistent XRinvArray) stores:
- dense
X:N*Pvalues - block Gram matrices
XpRinvX:sum_i s_i^2values - marker summary
xpRinvx:Pvalues - optional
xRinvArrayextraN*Ponly for non-unit weights in the non-block marker path - small reusable block workspaces (
O(b)) for RHS and local deltas
Approximate totals:
- Unit weights:
Mem_block_unit ~= t * (N*P + sum_i s_i^2 + P) + O(b*t)
- Non-unit weights:
Mem_block_nonunit ~= t * (2*N*P + sum_i s_i^2 + P) + O(b*t)
Runtime working set
Per block update, active high-volume buffers are:
X_bview (N x s_i; no data copy)XpRinvX[i](s_i x s_i)- block RHS workspace (
s_i) yCorr(N)
So peak additional block-local workspace is roughly:
O(s_i^2 + s_i + N)values
I/O and precompute considerations
For in-memory dense X, no out-of-core read is required during MCMC sweeps. The expensive setup term is building XpRinvX:
- precompute cost scales with approximately
O(N * sum_i s_i^2)(~O(NPb)under near-uniform blocks)
This setup can dominate startup time for very large N,P, even when per-iteration sampling is fast.
Worked Large-Scale Example (N=500,000, P=2,000,000)
Assume fast_blocks=true, so b=floor(sqrt(N))=707.
B = ceil(P/b) = 2,829sum_i s_i^2 = 1,413,937,788
Memory-relevant terms:
| Term | Float32 | Float64 |
|---|---|---|
X | ~4.00 TB (3.64 TiB) | ~8.00 TB (7.28 TiB) |
XpRinvX | ~5.66 GB (5.27 GiB) | ~11.31 GB (10.53 GiB) |
xpRinvx | ~8.0 MB | ~16.0 MB |
So for unit weights, block mode remains dominated by X, with XpRinvX as the main incremental memory term.
What To Watch Closely
block_sizechoice:- too small: less speedup (
N/bterm remains large) - too large:
XpRinvXmemory and precompute cost rise (~P*bmemory and~NPbsetup trend)
- too small: less speedup (
- Effective chain length:
- current implementation rescales outer iterations and uses inner repeats (
nreps = block_size) - compare runs by effective updates, not only outer-iteration count
- current implementation rescales outer iterations and uses inner repeats (
- Final short block behavior:
- last block uses smaller
nreps(equal to its own size), so sweep symmetry differs slightly
- last block uses smaller
- Multi-trait block path specifics:
- current path is sampler-I style for unconstrained covariance mode
- extra temporaries (e.g., trait-wise old-alpha handling) can become noticeable at larger trait counts
- Numerical reproducibility:
- mathematically equivalent refactors (e.g., in-place BLAS updates) can change floating-point roundoff
- expect tiny non-bitwise differences, especially in
Float32
- Weighting mode:
- block
XRinvArrayis no longer persisted - separate non-unit weighted non-block
xRinvArraymaterialization remains a distinct issue
- block
- Scope constraints:
- current source notes this fast block option is intended for one genotype category
- numeric
fast_blocksshould keepblock_size < nMarkers
Example: Speed/Memory Tradeoff
Assume:
nObs = 5_000nMarkers = 50_000block_size = 70(similar tosqrt(nObs))nBlocks = 715
Approximate memory (Float32):
| Item | Approx size |
|---|---|
Genotype matrix X (nObs x nMarkers) | ~953.7 MiB |
Extra XpRinvX in block mode | ~13.4 MiB |
Interpretation:
- Block mode can be much faster for large
nObs, because heavy per-marker operations are shifted to block-level cached operations. - Block mode uses more memory than non-block mode, mainly from
XpRinvX(plus small block workspaces). - Practical speedup is typically below theoretical arithmetic speedup due to random branching, allocation overhead, and BLAS/runtime effects.
Practical Guidance
- Start with
fast_blocks=truefor large marker sets and enough RAM. - If memory is tight, set a smaller numeric block size (e.g.,
32or64) and benchmark. - If speed gain is small, try a few block sizes and choose based on wall time + memory headroom.