Question: Generate genotype specific FASTA sequences from VCF file and reference sequence
gravatar for gtho123
5.8 years ago by
New Zealand
gtho123220 wrote:

I have some VCF files, each of which I have merged to contain >300 genotypes. Furthermore, to make them more manageable I have subsetted them to just contain the chromosome regions I am interested in.

Now I wish to generate some genotype specific FASTA sequences using these files and a reference sequence; i.e. a sequence for each genotype which is the same as the reference sequence but with the SNPs specific to each genotype in place of their counterparts in the reference sequence.

Now I know that there is variation in the genotypes. Here is a picture visualizing three exemplar genotypes that I generated by loading the VCF file into Geneious.

I then try to create individual VCF files for each genotype using this:

java -jar GenomeAnalysisTK.jar -R ~/Path/to/reference/sequence/ref.fasta -T SelectVariants  --variant ~/Path/to/complete/vcf/example.vcf -o ~/Path/to/individual/genotype.vcf -sn genotype

While I can't be sure this had the desired effect as it is difficult to assess a whole VCF file I can say that the header now only contains the relevant genotype so I assume this is the case.

I then try and use this individual VCF file for each genotype like this:

java -jar GenomeAnalysisTK.jar -R ~/Path/to/reference/sequence/ref.fasta -T FastaAlternateReferenceMaker --variant~/Path/to/individual/genotype.vcf -L chrX:XX,XXX,XXX-XX,XXX,XXX -o ~/Path/to/individual/genotype.fasta

Here the Xs represent the location on the reference sequence of the regions of interest.

I did this in a loop and got identical sequences for every genotypes. I then implemented it individually for the 3 exemplar genotypes in the picture above and in both cases I get identical sequences for every genotype. Interestingly they are not the reference sequence.

What am I doing wrong?


I will also post this on the GATK forum.



snp sequence gatk • 3.5k views
ADD COMMENTlink modified 5.8 years ago by Len Trigg1.5k • written 5.8 years ago by gtho123220
gravatar for gtho123
5.8 years ago by
New Zealand
gtho123220 wrote:

It turns out that when creating individual VCF files using the SelectVariants tool there is a optional flag -env which needs to be called too. The docs say that it is called to not "include loci found to be non-variant after the subsetting procedure". I find the interpretation of this confusing but it appears to give me the sequences I was trying to get.

ADD COMMENTlink modified 5.8 years ago • written 5.8 years ago by gtho123220
gravatar for Len Trigg
5.8 years ago by
Len Trigg1.5k
New Zealand
Len Trigg1.5k wrote:

RTG Core includes quite a few simulation tools, one of which is used to replay sample variants from a VCF into a reference sequence. The intention is to allow the generation of the full genome of that sample for subsequent read simulation (so when dealing with diploid chromosomes for example you end up with two copies of those chromosomes), but it'll work for your case. e.g:

rtg format -o ref.sdf ref.fasta

rtg samplereplay --reference ref.sdf --input genotype.vcf \
  --sample genotype --output genotype_genome.sdf

rtg sdfsubseq --input genotype_genome.sdf --fasta \
  chrX:XXXXXXXX-XXXXXXXX >genotype_chrX.fasta



ADD COMMENTlink modified 5.8 years ago • written 5.8 years ago by Len Trigg1.5k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2581 users visited in the last hour