Question: Create a pseudo-haploid fasta file based genotype calls for an individual (vcf file)
1
gravatar for Jautis
10 months ago by
Jautis280
United States
Jautis280 wrote:

Hi, I have a reference fasta file and a vcf file of SNP variant calls. For each individual in the vcf file, I want to create a new, "pseudo-haploid" fasta file where every base is randomly sampled from one of the individual's alleles.

This seems to be a different problem than GATK's FastaAlternateReferenceMaker which inserts the alternate allele (not caring about individual genotypes or ever retaining the reference allele).

Can anybody offer a tool or some advice? Thanks!

Short example:
    Sequence: ATAAATTCCC (10 bp long)
    VCF:
        POS    REF     ALT      Ind1    Ind2
        2      T       C        0/1     1/1
        5      A       G        0/0     0/1
   Output:
        Ind1: A **T** AAATTCCC     or  A **C** AAATTCCC
        Ind2: A*C*AA **A** TTCCC     or  A*C*AA **G** TTCCC
snp sequence assembly vcf • 415 views
ADD COMMENTlink modified 10 months ago by Len Trigg1.3k • written 10 months ago by Jautis280
0
gravatar for Len Trigg
10 months ago by
Len Trigg1.3k
New Zealand
Len Trigg1.3k wrote:

Fairly close is rtg samplereplay from RTG Tools, which generates a "pseudo-genome" for an individual based on the genotypes in the VCF, e.g.:

# Make test data
$ cat <<EOF >ref.fa
>seq
ATAAATTCCC
EOF
# RTG likes to store sequence data in it's "SDF" format:
$ rtg format -o ref.sdf ref.fa
Formatting FASTA data
Processing "ref.fa"

Input Data
Files              : ref.fa
Format             : FASTA
Type               : DNA
Number of sequences: 1
Total residues     : 10
Minimum length     : 10
Mean length        : 10
Maximum length     : 10

Output Data
SDF-ID             : 9b0da935-33c8-421d-bf83-54fd8a13ce5f
Number of sequences: 1
Total residues     : 10
Minimum length     : 10
Mean length        : 10
Maximum length     : 10

# Make test VCF (contains embedded tabs)
$ cat <<EOF >ind.vcf
##fileformat=VCFv4.1
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Ind1    Ind2
seq 2   .   T   C   .   .   .   GT  0/1 1/1
seq 5   .   A   G   .   .   .   GT  0/0 0/1
EOF
# Index the variants VCF
$ rtg bgzip ind.vcf && rtg index ind.vcf.gz
Creating index for: ind.vcf.gz (ind.vcf.gz.tbi)

# Generate the pseudo-genome for Ind1 and show us the FASTA
$ rtg samplereplay -t ref.sdf -i ind.vcf.gz -o Ind1.sdf --sample Ind1 && rtg sdf2fasta -i Ind1.sdf -o -
>seq_0
ATAAATTCCC
>seq_1
ACAAATTCCC

# Generate the pseudo-genome for Ind2 and show us the FASTA
$ rtg samplereplay -t ref.sdf -i ind.vcf.gz -o Ind2.sdf --sample Ind2 && rtg sdf2fasta -i Ind2.sdf -o -
>seq_0
ACAAATTCCC
>seq_1
ACAAGTTCCC

The rtg samplereplay command doesn't randomly sample the alleles, it treats the input VCF as though it were phased and produces the two simulated chromosomes (it's usually used with rtg samplesim which produces a VCF of phased simulated genotypes, and rtg readsim to simulate NGS sequencing of the sample genome). Hope this helps with your use case.

ADD COMMENTlink written 10 months ago by Len Trigg1.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1807 users visited in the last hour