2.2 years ago by
New Zealand
Fairly close is rtg samplereplay
from RTG Tools, which generates a "pseudo-genome" for an individual based on the genotypes in the VCF, e.g.:
# Make test data
$ cat <<EOF >ref.fa
>seq
ATAAATTCCC
EOF
# RTG likes to store sequence data in it's "SDF" format:
$ rtg format -o ref.sdf ref.fa
Formatting FASTA data
Processing "ref.fa"
Input Data
Files : ref.fa
Format : FASTA
Type : DNA
Number of sequences: 1
Total residues : 10
Minimum length : 10
Mean length : 10
Maximum length : 10
Output Data
SDF-ID : 9b0da935-33c8-421d-bf83-54fd8a13ce5f
Number of sequences: 1
Total residues : 10
Minimum length : 10
Mean length : 10
Maximum length : 10
# Make test VCF (contains embedded tabs)
$ cat <<EOF >ind.vcf
##fileformat=VCFv4.1
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Ind1 Ind2
seq 2 . T C . . . GT 0/1 1/1
seq 5 . A G . . . GT 0/0 0/1
EOF
# Index the variants VCF
$ rtg bgzip ind.vcf && rtg index ind.vcf.gz
Creating index for: ind.vcf.gz (ind.vcf.gz.tbi)
# Generate the pseudo-genome for Ind1 and show us the FASTA
$ rtg samplereplay -t ref.sdf -i ind.vcf.gz -o Ind1.sdf --sample Ind1 && rtg sdf2fasta -i Ind1.sdf -o -
>seq_0
ATAAATTCCC
>seq_1
ACAAATTCCC
# Generate the pseudo-genome for Ind2 and show us the FASTA
$ rtg samplereplay -t ref.sdf -i ind.vcf.gz -o Ind2.sdf --sample Ind2 && rtg sdf2fasta -i Ind2.sdf -o -
>seq_0
ACAAATTCCC
>seq_1
ACAAGTTCCC
The rtg samplereplay
command doesn't randomly sample the alleles, it treats the input VCF as though it were phased and produces the two simulated chromosomes (it's usually used with rtg samplesim
which produces a VCF of phased simulated genotypes, and rtg readsim
to simulate NGS sequencing of the sample genome). Hope this helps with your use case.