Streaming Genotype Workflow: A Conceptual Walkthrough
This page walks through the streaming genotype backend from start to finish, without code execution. For API details and constraints, see Streaming Genotype Backend and Handling Large Genotype Data.
1. The Problem
At very large scale (e.g., N=500,000 individuals × P=2,000,000 markers), a dense genotype matrix in memory requires terabytes of RAM (~4 TB for Float32). The streaming backend avoids ever materializing that matrix by keeping genotypes on disk in a compact format and decoding only what is needed, one marker at a time.
2. Input: Text Genotype File
You start with a text genotype file (e.g., CSV):
- First column: individual IDs
- Remaining columns: marker genotypes (0, 1, 2)
- Missing values: coded as 9.0 (or your chosen value)
Example format:
ID,m1,m2,m3,...,m100
ind_001,0,1,2,...,1
ind_002,1,9,0,...,2 ← 9 = missing
...3. One-Time Conversion: prepare_streaming_genotypes
This step reads the text file and writes a packed backend. It runs once, before any MCMC.
What it does:
- Choose conversion path via
conversion_mode::lowmem(default): out-of-core conversion:dense: in-memory conversion for smaller files:auto: chooses dense when estimatedN*P*4 <= auto_dense_max_bytes, else lowmem
- Scan/process genotypes and compute marker statistics.
- Quality control: replace missing with column means; drop fixed loci and markers below the MAF threshold.
- Center each marker column (optional).
- Precompute per-marker x′x (for unit-weight BayesC).
- Pack output:
- in low-memory mode: write temporary row-major spool, then transpose to marker-major
- in dense mode: pack directly from in-memory matrix
For large data, prefer :lowmem/:auto with appropriate tmpdir.
Output files:
prefix.jgb2— binary file containing 2-bit packed genotypes (0, 1, 2, missing) in marker-major layout; four individuals per byteprefix.meta— manifest (nObs, nMarkers, paths, etc.)prefix.obsid.txt,prefix.markerid.txt— individual and marker IDsprefix.mean.f32,prefix.xpRinvx.f32,prefix.afreq.f32— precomputed values
After QC, markers are stored in column order. For 50 individuals, each marker occupies ~13 bytes (4 bits per individual, rounded up).
4. Loading: get_genotypes(prefix; storage=:stream)
Instead of loading a dense N×P matrix, this step:
- Reads the manifest (the
prefix.metafile: a small text file listing paths to the packed data and metadata such as nObs, nMarkers, stride_bytes, centered) and validates file sizes. - Opens the
.jgb2file for reading. - Loads metadata into memory: means, xpRinvx, allele frequencies, IDs.
- Creates a
Genotypesstruct with:genotypes= empty matrix (no dense X)stream_backend= handle to the packed file and metadatastorage_mode=:stream
5. MCMC: Per Iteration, Per Marker
For each MCMC iteration, BayesC updates markers one at a time.
For each marker j:
Decode marker j into a temporary buffer:
- Seek to the marker's byte offset in the packed file.
- Read one packed row.
- Decode 2-bit codes → 0, 1, 2, or mean-imputed value.
- Apply centering if the backend was built with
center=true. - Write into a length-N buffer.
Update Gibbs:
- Use the same formulas as dense mode.
- Use
dot(buffer, yCorr)for x′y. - Use precomputed
xpRinvx[j]for x′x. - Sample δ, β, α.
- Update
yCorrwithyCorr += (old_α - new_α) * buffer.
Discard the buffer; it is reused for the next marker.
So the working set is O(N) per marker (one buffer plus yCorr, α, β, δ), not O(N×P).
6. Initial yCorr
The full computation yCorr = y - X*sol - X*α is done once, before the MCMC loop (using starting values for sol and α). After that, yCorr is maintained by incremental updates: each iteration adds/subtracts X*sol for fixed-effect updates, and each marker update applies yCorr += (old_α - new_α) * x_j in place.
Code: src/1.JWAS/src/MCMC/MCMC_BayesianAlphabet.jl, lines 108–124 (inside MCMC_BayesianAlphabet, before the iteration loop).
Dense: yCorr = y - X*sol - X*α (matrix multiply).
Streaming: streaming_mul_alpha! replaces X*α:
- Loop over markers j.
- For each j with α[j] ≠ 0: decode marker j into a buffer, add
α[j] * bufferinto the output vector. - No dense X is ever formed.
7. MVP Constraints
- Single-trait analysis only
- Method: BayesC only
fast_blocks=falseonly- Unit residual weights only
- Float32 only (
double_precision=false) - Complete genomic data only (no single-step)
- Phenotype IDs must match genotype IDs exactly and in the same order (no alignment)
outputEBVandoutput_heritabilityare disabled
8. End-to-End Flow
Text CSV → prepare_streaming_genotypes() → .jgb2 + metadata
↓
get_genotypes(prefix; storage=:stream)
↓
Genotypes struct (no dense X)
↓
runMCMC()
↓
Per iteration:
Per marker j:
decode_marker!(buffer, backend, j) → buffer of length N
bayesabc_update_marker!(buffer, yCorr, α, β, δ, ...)9. Memory Comparison
| Component | Dense (N×P) | Streaming |
|---|---|---|
| Genotype matrix | N×P×4 bytes | 0 |
| Per-marker buffer | 0 | N×4 bytes |
| xpRinvx | P×4 bytes | P×4 bytes |
| Total | ~4×N×P | ~O(N + P) |
At large N and P, streaming keeps memory roughly constant in N×P, at the cost of decode and I/O per marker per iteration.