Hi,
I am trying to using SnpSift to calculate case vs control groups. The file I am using is quite large and the first time I ran SnpSift on the file took quite a few days to finish. I am in a bit of a time crunch and it is unclear if SnpSift will finish calculating the case vs control groups before I need the data. I was looking at the SnpSift documentation and it doesn't look like there is a way to speed things up with multi-threading. I realize that SnpSift is having to do a calculation for each line of the file, which just takes time.
However, I was wondering if you can split the annotated vcf file that I created using Snpeff into smaller files. So for example, if my starting annotated vcf is 1 terabyte I could split that into ten 100 gigabytes files. What I could do from there is run SnpSift on each of the 10 files in parallel and then merge all of them when they're done running? I admit this is not an ideal situation but I am not sure what else to do.
I was wondering if there are any flaws with this plan? Or if anyone has any other solutions? I know there will be some formatting issues that I will have to deal with.
use a workflow manager,split per chromosome or region , run each region in parallel, merge each region.