Question

Running SnpSift in parallel

0

Entering edit mode

2.6 years ago

williamsbrian5064 ▴ 510

Hi,

I am trying to using SnpSift to calculate case vs control groups. The file I am using is quite large and the first time I ran SnpSift on the file took quite a few days to finish. I am in a bit of a time crunch and it is unclear if SnpSift will finish calculating the case vs control groups before I need the data. I was looking at the SnpSift documentation and it doesn't look like there is a way to speed things up with multi-threading. I realize that SnpSift is having to do a calculation for each line of the file, which just takes time.

However, I was wondering if you can split the annotated vcf file that I created using Snpeff into smaller files. So for example, if my starting annotated vcf is 1 terabyte I could split that into ten 100 gigabytes files. What I could do from there is run SnpSift on each of the 10 files in parallel and then merge all of them when they're done running? I admit this is not an ideal situation but I am not sure what else to do.

I was wondering if there are any flaws with this plan? Or if anyone has any other solutions? I know there will be some formatting issues that I will have to deal with.

VCF WGS Annotation • 782 views

ADD COMMENT • link updated 2.6 years ago by Pierre Lindenbaum 161k • written 2.6 years ago by williamsbrian5064 ▴ 510

0

Entering edit mode

use a workflow manager,split per chromosome or region , run each region in parallel, merge each region.

ADD REPLY • link 2.6 years ago by Pierre Lindenbaum 161k