Saturday, September 20, 2025

Reproducible Randomness in Bioinformatics (or: why -s 42 keeps saving your day)

If you’ve ever subsampled a giant FASTA only to realize you can’t recreate the exact same subset later, you’ve felt the sting of non-reproducible randomness. Tools like seqtk fix this with a wonderfully simple trick: seeding the random number generator.

Here’s the intuition:

  • Random ≠ unpredictable (for computers). When tools say “random,” they almost always mean pseudorandom—numbers produced by a deterministic algorithm.

  • Same seed → same stream. Initialize the pseudorandom number generator (PRNG) with a specific seed (e.g., -s 42) and you’ll get the same “random” sequence every time.

  • Determinism + streaming algorithms = scalable & reproducible. For tasks like sampling k sequences from a multi-FASTA, reservoir sampling lets you make a single pass over the data (great for huge files) while the seeded PRNG ensures repeatable picks.

Concretely, seqtk sample -s 42 input.fa 100 will always select the same 100 sequences in the same order (given the same input). Change the seed, change the sample—but it’s still perfectly reproducible for that new seed.

Why it matters:

  • Scientific rigor: you (and reviewers) can reproduce figures exactly.

  • Debugging: if something looks odd, re-run with the same seed to isolate variables.

  • Collaboration: share the seed so teammates can replicate your subset.

Below is a tiny Python script that mirrors seqtk sample behavior for FASTA files:

  • If you pass a fraction (e.g., 0.1), it does Bernoulli sampling: keep each record with probability p.

  • If you pass an integer (e.g., 100), it does reservoir sampling for exactly k sequences (or fewer if the file has < k records).

  • The -s/--seed flag pins the PRNG so runs are reproducible.

  • Works on stdin or .gz files, and wraps sequences neatly.


Python: fasta_sample.py (seeded, stream-safe, seqtk-style)

#!/usr/bin/env python3 """ fasta_sample.py — Seeded sampler for multi-FASTA, mirroring `seqtk sample -s`. Modes (mutually exclusive): 1) Fraction p in (0,1]: Bernoulli keep-with-probability p (approx. n*p outputs) 2) Integer k >= 1: Reservoir sample exactly k sequences (or all if <k present) Examples # Bernoulli sampling (≈10% of records), reproducible with seed 7: python fasta_sample.py -i input.fa.gz -s 7 0.10 > out.fa # Sample exactly 100 sequences (without replacement), deterministic with seed 42: python fasta_sample.py -i input.fa -s 42 100 > out.fa # Stream from stdin, no wrapping: zcat input.fa.gz | python fasta_sample.py --wrap 0 -s 11 0.25 > out.fa """ import sys, argparse, gzip, random from typing import Iterator, Tuple, BinaryIO, List # ---------- IO helpers ---------- def open_maybe_gzip(path: str) -> BinaryIO: if path == "-" or not path: return sys.stdin.buffer return gzip.open(path, "rb") if path.endswith(".gz") else open(path, "rb") def fasta_iter(handle: BinaryIO) -> Iterator[Tuple[str, bytes]]: """Yield (header_str_without_gt, sequence_bytes_unwrapped).""" header = None seq_parts: List[bytes] = [] for line in handle: if not line: break line = line.rstrip(b"\r\n") if not line: continue if line.startswith(b">"): if header is not None: yield (header, b"".join(seq_parts)) header = line[1:].decode("utf-8", errors="replace") seq_parts = [] else: # Strip spaces/tabs just in case; FASTA lines may be wrapped seq_parts.append(line.replace(b" ", b"").replace(b"\t", b"")) if header is not None: yield (header, b"".join(seq_parts)) def wrap_bytes(seq: bytes, width: int = 60) -> Iterator[bytes]: if width is None or width <= 0: yield seq else: for i in range(0, len(seq), width): yield seq[i:i+width] # ---------- Sampling strategies ---------- def bernoulli_sample(handle: BinaryIO, p: float, rng: random.Random) -> List[Tuple[int, str, bytes]]: """ Keep each record with probability p (0<p<=1). Returns list of (idx, header, seq). We record idx to preserve input order deterministically. """ out = [] idx = 0 for header, seq in fasta_iter(handle): if rng.random() < p: out.append((idx, header, seq)) idx += 1 return out def reservoir_sample(handle: BinaryIO, k: int, rng: random.Random) -> List[Tuple[int, str, bytes]]: """ Reservoir sample k records without replacement. Returns list of (idx, header, seq). Uses standard Algorithm R; we store (idx, ...) and sort by idx at the end to preserve input order in the output (seqtk also outputs in input order). """ reservoir: List[Tuple[int, str, bytes]] = [] t = 0 # number seen for header, seq in fasta_iter(handle): if t < k: reservoir.append((t, header, seq)) else: j = rng.randrange(t + 1) # uniform in [0, t] if j < k: reservoir[j] = (t, header, seq) t += 1 # Preserve input order for readability/reproducibility reservoir.sort(key=lambda rec: rec[0]) return reservoir # ---------- CLI ---------- def parse_args(): ap = argparse.ArgumentParser( description="Seeded FASTA sampler (fraction or exact k), similar to `seqtk sample -s`." ) ap.add_argument("amount", help="If in (0,1], interpreted as fraction p. If >=1 (int), interpreted as exact count k.") ap.add_argument("-i", "--input", default="-", help="Input FASTA (.fa/.fasta[.gz]) or '-' for stdin. Default: '-'") ap.add_argument("-s", "--seed", type=int, default=None, help="Random seed for reproducibility (int).") ap.add_argument("--wrap", type=int, default=60, help="Line width for output sequence wrapping (0 = no wrap). Default: 60") return ap.parse_args() def main(): args = parse_args() # Determine mode: fraction or integer k amt_str = args.amount.strip() mode = None k = None p = None try: # First try integer k_val = int(amt_str) if k_val <= 0: sys.exit("Error: integer amount must be >= 1.") mode = "k" k = k_val except ValueError: # Not an int; try float fraction try: p_val = float(amt_str) except ValueError: sys.exit("Error: amount must be an integer (k) or float fraction p in (0,1].") if not (0.0 < p_val <= 1.0): sys.exit("Error: fraction p must be in (0,1].") mode = "p" p = p_val rng = random.Random(args.seed) with open_maybe_gzip(args.input) as fh: if mode == "p": sampled = bernoulli_sample(fh, p, rng) else: sampled = reservoir_sample(fh, k, rng) out = sys.stdout.buffer wrapw = args.wrap if args.wrap is not None else 60 for _, header, seq in sampled: out.write(b">" + header.encode("utf-8") + b"\n") for chunk in wrap_bytes(seq, wrapw): out.write(chunk + b"\n") if __name__ == "__main__": main()

Parity with seqtk sample -s

  • -s/--seed gives deterministic results: same input + same seed → same output.

  • Fraction mode (0<p<=1): independent Bernoulli decisions per record (like seqtk sample input.fa 0.25).

  • Count mode (k>=1): single-pass reservoir sampling (like seqtk sample input.fa 100), no replacement.

  • Streaming: makes one pass; doesn’t load the entire file into memory.

  • Order: outputs in input order for readability (we track indices and sort the final sample, which is reproducible).

Quick checks

# Reproducible: python fasta_sample.py -s 123 0.2 -i big.fa.gz | md5sum python fasta_sample.py -s 123 0.2 -i big.fa.gz | md5sum # same hash # Different seed → different but still reproducible streams: python fasta_sample.py -s 7 100 -i big.fa > a.fa python fasta_sample.py -s 42 100 -i big.fa > b.fa diff a.fa b.fa # likely differences

No comments: