Question: Does vcftools or GATK have an option to skip indels when making alternate sequence?
0
gravatar for severalorks
2.8 years ago by
severalorks90
severalorks90 wrote:

In my previous question, I asked: I would like to take a vcf file and a reference genome from the 1000Genomes project, and obtain a fasta file that lists the genomes for each individual in the vcf, according to the SNPs each individual has in the vcf file. Answers showed that GATK (FastaAlternativeReferenceMarker) and vcftools (vcf-consensus, http://www.1000genomes.org/faq/are-there-any-fasta-files-containing-1000-genomes-variants-or-haplotypes/) were able to do something similar to this. However, I'd like to skip indels when creating these sequences. Do existing tools have options to do this?

ADD COMMENTlink modified 2.8 years ago by Matt Shirley9.0k • written 2.8 years ago by severalorks90
1
gravatar for Matt Shirley
2.8 years ago by
Matt Shirley9.0k
Cambridge, MA
Matt Shirley9.0k wrote:

You might try just removing INDELs from your VCF file before passing to these tools, but you could also consider looking at the FastaVariant class in pyfaidx. See A: Make fasta file from SNPs in two vcf files for an example.

EDIT: The FastaVariant approach is currently too slow for whole chromosome access.

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by Matt Shirley9.0k

I've tried parsing through the vcf files using python to remove indels but I've run it for days and it hasn't finished, and I'm not sure if it's a bug in my script or because there's 84.4 million SNPs to go through. I've looked at FastaVariant; does it create an alternate sequence like FastaAlternateReferenceMaker in GATK?

ADD REPLYlink written 2.8 years ago by severalorks90
1

Yes, it creates a variant-substituted FASTA sequence that you can work with using the Fasta interface. It's only designed to handle SNPs, so in fact it does ignore INDELs (I thought this was a design flaw and was thinking about correcting it but maybe not...).

ADD REPLYlink written 2.8 years ago by Matt Shirley9.0k

I ran the script and created a FastaVariant object. It works well but it outputs 1 alternative genome. I was wondering if I can use it to get an alternative sequence for each individual (so if I want the genomes of 100 individuals, including HG00097, from the vcf file, can pyfaidx do this?)

ADD REPLYlink written 2.8 years ago by severalorks90
1

I think this script would do the trick:

I just realized I hadn't documented the use of the sample argument, so thanks for pointing this out!

ADD REPLYlink written 2.8 years ago by Matt Shirley9.0k

Thank you! I will try it

ADD REPLYlink written 2.8 years ago by severalorks90
1

I got this error:

Traceback (most recent call last):
  File "pyfaidx_test.py", line 11, in <module>
    sample_fasta.write(record.seq)
AttributeError: 'FastaRecord' object has no attribute 'seq'

What does it mean?

ADD REPLYlink written 2.8 years ago by severalorks90
1

Oops. I didn't test before submitting. I just edited the Gist, so please try again.

ADD REPLYlink written 2.8 years ago by Matt Shirley9.0k

What's the estimated run time for one individual? I've ran it for around 30 min and it hasn't given output yet for the first individual, HG00096. It's created the file but it's empty so far. I can continue running it for several more hours to see what happens, though I was wondering if I might be doing something if it's supposed to be faster.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by severalorks90

Also, just to understand how the code works better, does this give the same DNA sequence for each individual as what you wrote above? It hasn't finished running yet so I don't know:

from pyfaidx import FastaVariant
import vcf

samples = vcf.Reader(open('calls.vcf.gz', 'r')).samples
for sample in samples:
  with FastaVariant('reference.fa', 'calls.vcf.gz', sample=sample, het=True, hom=True) as consensus:
    with open(sample + '.fasta', 'w') as sample_fasta:
      sample_fasta.write('>' + sample)
      sample_fasta.write(str(consensus['20']))
ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by severalorks90
1

That should give you chromosome 20 for each of the individuals, yes.

ADD REPLYlink written 2.8 years ago by Matt Shirley9.0k
1

Thanks! It took 33 seconds for str(consensus['20'][0:100000]) to finish running, so I estimate it'll take 6 hours for str(consensus['20']) to finish running.

ADD REPLYlink written 2.8 years ago by severalorks90

LOL. I'm glad the code actually works, and granted the module wasn't purpose built for this, but it's still a bit embarrassing...

ADD REPLYlink written 2.8 years ago by Matt Shirley9.0k
1

I hadn't ever tested fetching entire (large) chromosomes using FastaVariant, and it turns out that I really designed the class to support efficient random access of smaller subsequences. I'm testing your use case right now and you're correct - it's excruciatingly slow. The reason for this is that pysam is used to query every variant position on every chromosome for every individual. It will be much better to use a tool that is designed for this purpose. You might take a look at the recommendation from the 1000 Genomes team: http://www.1000genomes.org/faq/are-there-any-fasta-files-containing-1000-genomes-variants-or-haplotypes/

ADD REPLYlink written 2.8 years ago by Matt Shirley9.0k
1

I've tried the option for 1000genomes, along with many other options (at least 6, spending hours on each option) but the vcf-subset doesn't work for me as it gives me 'Broken VCF header- no columns names?', and Data Slicer crashes (it says page taking too long to respond) when I try to use it to get an entire chromosome for 1 individual. So I think pyfaidx is the best option i have right now, until I figure out why vcf-subset doesn't work.

ADD REPLYlink written 2.8 years ago by severalorks90

Actually 100000 takes 30 seconds, while 500000 takes 7 minutes, so 6 hrs is far off. I've tracked how much the time increases for each increment i of 10000 for consensus[0:i]. It might take a day to run, I will see.

ADD REPLYlink written 2.8 years ago by severalorks90
1

It's going to depend on SNP density.

ADD REPLYlink written 2.8 years ago by Matt Shirley9.0k
1

Maybe it would be faster if I was to take chunks of it, says str(consensus[0:10000]) + str(consensus[10000:20000]), and add them together. That way it sort of 'parallelizes' it.

EDIT: len(str(consensus['20'][100000:110000])) takes around 12 seconds len(str(consensus['20'][200000:210000])) takes around 12 seconds.

So maybe it could work? I'll test it out.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by severalorks90
1

Awesome!! I just ran it in parallel on the sun grid engine from 0 to a million and it all finished in less than 2 minutes!!

I ran the whole thing for 1 individual... finished in around 3-5 minutes. There were around 600 jobs. Next time I can use only 60 jobs and it'll probably finish in 30 min, and 6 jobs (10 mil each) for a couple of hours.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by severalorks90
1

Yes, the access times will be linear with the size of the query, independent of the position of the query in the genome. Glad you figured out a way to parallelize your job!

ADD REPLYlink written 2.8 years ago by Matt Shirley9.0k

In the phased vcf file it gives the information for each diploid chromosome for each individual. For example, for column HG00096, it has 0|0, or 0|1, etc., where 1 indicates the chromosome has the alternative, while 0 means it has the reference SNP. Does the pyfaidx FastaVariant object only create it for 1 of the 2 chromosomes in a set? Is it possible to use it to get both chromosomes in a set (say, both chromosome 1s?)

I guess it may have something to do with line 790 in the __init__.py script?

if sample.gt_type in self.gt_type and eval(self.filter):

gt_type depends on 'het' and 'hom', which seem to have something to do with heterozygous/homozygous? I'm not sure how to interpret them.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by severalorks90
1

I think what you're asking for is full respect of phasing information, and currently pyfaidx ignores phasing completely. You only get one sequence out, and it's SNP-substituted using either hetero/homozygous combination of SNPs. Really you seem to want haplotypes output separately, but unless your entire call set is completely phased you won't get "just" 2 versions of every chromosome. In fact, for unphased calls you'll get an exponential growth in the number of possible chromosomes. That's why I ignore phrasing currently...

ADD REPLYlink written 2.8 years ago by Matt Shirley9.0k

In this thread: VCF to FASTA

You linked to a program called vcf2fasta.cpp. Would that do what I need?

EDIT: I wrote a simple python script that gets phased fasta from vcf for 1 individual, and it takes 5 min to run, so if I run it in parallel for all individuals I should get fastas

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by severalorks90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 644 users visited in the last hour