If you’ve ever subsampled a giant FASTA only to realize you can’t recreate the exact same subset later, you’ve felt the sting of non-reproducible randomness. Tools like seqtk fix this with a wonderfully simple trick: seeding the random number generator.
Here’s the intuition:
-
Random ≠ unpredictable (for computers). When tools say “random,” they almost always mean pseudorandom—numbers produced by a deterministic algorithm.
-
Same seed → same stream. Initialize the pseudorandom number generator (PRNG) with a specific seed (e.g.,
-s 42
) and you’ll get the same “random” sequence every time. -
Determinism + streaming algorithms = scalable & reproducible. For tasks like sampling
k
sequences from a multi-FASTA, reservoir sampling lets you make a single pass over the data (great for huge files) while the seeded PRNG ensures repeatable picks.
Concretely, seqtk sample -s 42 input.fa 100
will always select the same 100 sequences in the same order (given the same input). Change the seed, change the sample—but it’s still perfectly reproducible for that new seed.
Why it matters:
-
Scientific rigor: you (and reviewers) can reproduce figures exactly.
-
Debugging: if something looks odd, re-run with the same seed to isolate variables.
-
Collaboration: share the seed so teammates can replicate your subset.
Below is a tiny Python script that mirrors seqtk sample
behavior for FASTA files:
-
If you pass a fraction (e.g.,
0.1
), it does Bernoulli sampling: keep each record with probabilityp
. -
If you pass an integer (e.g.,
100
), it does reservoir sampling for exactlyk
sequences (or fewer if the file has <k
records). -
The
-s/--seed
flag pins the PRNG so runs are reproducible. -
Works on stdin or
.gz
files, and wraps sequences neatly.
Python: fasta_sample.py
(seeded, stream-safe, seqtk-style)
Parity with seqtk sample -s
-
-s/--seed
gives deterministic results: same input + same seed → same output. -
Fraction mode (
0<p<=1
): independent Bernoulli decisions per record (likeseqtk sample input.fa 0.25
). -
Count mode (
k>=1
): single-pass reservoir sampling (likeseqtk sample input.fa 100
), no replacement. -
Streaming: makes one pass; doesn’t load the entire file into memory.
-
Order: outputs in input order for readability (we track indices and sort the final sample, which is reproducible).
Quick checks
No comments:
Post a Comment