Streaming Genotype Backend (BayesC MVP)

This page documents the new opt-in streaming genotype backend for large-data original BayesC in JWAS.

Status and Design

  • Dense loading remains the default and primary path.
    • get_genotypes(...; storage=:dense) (default)
    • Existing dense workflows are unchanged.
  • Streaming loading is additive and opt-in.
    • get_genotypes(...; storage=:stream)
    • Current status: experimental MVP for original BayesC.

Supported Scope (MVP)

  • Single-trait analysis only
  • Method: BayesC only
  • fast_blocks=false only
  • Unit residual weights only
  • double_precision=false only
  • Complete genomic data only (no single-step)
  • Exact genotype/phenotype ID match and order required
  • outputEBV and output_heritability are disabled in MVP streaming mode

Public API

prepare_streaming_genotypes(...)

One-time converter from text genotype file to packed marker-major 2-bit files.

prefix = prepare_streaming_genotypes(
    "genotypes.csv";
    output_prefix = "genotypes_stream",
    separator = ',',
    header = true,
    missing_value = 9.0,
    quality_control = true,
    MAF = 0.01,
    center = true,
    conversion_mode = :lowmem,   # :lowmem | :dense | :auto
    auto_dense_max_bytes = 2^30, # used only when conversion_mode=:auto
    tmpdir = nothing,          # optional temp workspace for conversion spool
    cleanup_temp = true,       # remove temporary conversion files after success
    disk_guard_ratio = 0.9,    # fail fast if required bytes exceed ratio * free bytes
)

Output artifacts with the chosen prefix:

  • prefix.jgb2: packed genotype payload (2-bit marker-major)
  • prefix.meta: metadata manifest
  • prefix.obsid.txt: individual IDs
  • prefix.markerid.txt: marker IDs (after QC)
  • prefix.mean.f32: per-marker means
  • prefix.xpRinvx.f32: per-marker x'x for unit-weight BayesC path
  • prefix.afreq.f32: per-marker allele frequencies

Conversion notes:

  • conversion_mode=:lowmem keeps conversion out-of-core (default).
  • conversion_mode=:dense uses an in-memory dense conversion path (faster for small files, not RAM-safe for large files).
  • conversion_mode=:auto chooses :dense when estimated dense bytes (N*P*4) fit under auto_dense_max_bytes; otherwise uses :lowmem.
  • In low-memory mode, conversion performs a staged write path with temporary row-major spool + transpose.
  • Temporary disk can approach one extra packed payload in low-memory mode; for very large N, P, place tmpdir on a high-capacity filesystem.

How to use conversion_mode

Use one of these patterns:

# 1) Large files (safest): always low-memory conversion
prefix = prepare_streaming_genotypes("genotypes.csv";
                                     conversion_mode=:lowmem,
                                     tmpdir="/scratch/jwas_tmp")

# 2) Small files (fast path): force dense in-memory conversion
prefix = prepare_streaming_genotypes("genotypes.csv";
                                     conversion_mode=:dense)

# 3) Hybrid default: auto-select dense for small jobs, lowmem otherwise
prefix = prepare_streaming_genotypes("genotypes.csv";
                                     conversion_mode=:auto,
                                     auto_dense_max_bytes=2^30) # 1 GiB threshold

Practical rule: start with conversion_mode=:auto; use :lowmem explicitly for very large jobs or constrained RAM environments.

get_genotypes(...; storage=:stream)

Load packed genotype backend without materializing dense N x P genotype matrix.

geno = get_genotypes(
    prefix,            # prefix or .meta/.jgb2 path
    1.0;
    method = "BayesC",
    estimatePi = true,
    storage = :stream,
)

Dense behavior is unchanged:

geno = get_genotypes("genotypes.csv", 1.0; method="BayesC")  # storage=:dense default

End-to-End Example

using JWAS, CSV, DataFrames

phenotypes = CSV.read("phenotypes.csv", DataFrame)

# one-time conversion
prefix = prepare_streaming_genotypes("genotypes.csv";
                                     separator=',',
                                     header=true,
                                     quality_control=true,
                                     center=true)

# streaming load
global geno = get_genotypes(prefix, 1.0;
                            method="BayesC",
                            estimatePi=true,
                            storage=:stream)

model = build_model("trait1 = intercept + geno", 1.0)

out = runMCMC(model, phenotypes;
              chain_length=500,
              burnin=100,
              output_samples_frequency=50,
              output_folder="results_stream",
              outputEBV=false,
              output_heritability=false,
              seed=314,
              memory_guard=:off)

Benchmark Snapshot

All numbers below were measured on:

  • Julia 1.11.7
  • Apple M1 (8 CPU threads available)
  • macOS arm64

1) Correctness: simulated_omics dense vs stream

Dataset:

  • /src/4.Datasets/data/simulated_omics
  • 3534 individuals
  • 1000 SNPs (927 markers after QC)

Same seed/settings, single-trait BayesC:

MetricValue
Marker effects correlation0.999999999978
Marker max abs diff1.10e-6
Marker mean abs diff1.65e-7
Residual variance abs diff4.29e-7

Interpretation:

  • Streaming and dense are numerically equivalent up to floating-point noise.

2) Large synthetic benchmark (N=10,000, P=5,000)

Synthetic CSV genotype file size: ~100 MB.

runMCMC benchmark (chain_length=25, burnin=5):

ModeReal TimeMax RSSPeak footprint
Dense (load + run)24.19s1.396 GB1.352 GB
Stream run (after prepare)20.70s0.931 GB0.727 GB
Stream prepare (one-time)11.99s1.367 GB1.144 GB

Interpretation:

  • Streaming reduced run-time memory significantly in this benchmark.
  • Conversion has one-time cost and memory use.
  • Absolute times are hardware- and IO-dependent.

2b) Conversion-phase benchmark (N=10,000, P=5,000)

Measured with /usr/bin/time -l:

StepReal TimeMax RSS
prepare_streaming_genotypes~11.8s~1.14 GB

Backend sizes:

ArtifactSize
packed payload (.jgb2)~12.5 MB
temporary row-major spool (during conversion)~12.5 MB

Interpretation:

  • Converter RAM stays bounded for this size and no dense matrix is materialized.
  • Conversion needs temporary disk in addition to final payload.

3) Target-scale feasibility estimator (N=500,000, P=2,000,000)

Using estimate_marker_memory(...):

PathEstimated marker-path memory
Dense (storage=:dense, Float32)3.64 TiB
Stream (storage=:stream, Float32)17.29 MiB

Interpretation:

  • Dense original BayesC is generally infeasible at this scale on normal nodes.
  • Streaming is designed to keep marker working memory near O(N+P).

Notes