Dear all,
I'm going to annotate a large number of variants (about 50 millions) derived from the whole genome sequencing of a given population. For getting output, as you know, we can specify "Pick once line or block of consequence data per variant" or "Pick once line or block of consequence data per variant allele" as explained at here. Could you let me know which one should be selected? Also, please kindly let me know any your experience or comments to reduce the running time.
Thanks
P.S. Regrading the speed, Emily from Ensembl kindly suggested me to use the buffer size of 5000 and 4 fork depend on the system. I'm looking for other experiences that you may obtain during your work.
For other people who might offer to help, note that I have already pointed seta to the options to speed up the VEP page. We have also talked about what to set the forks and buffer size to – I advised her that best option to set forks to is usually 4 and that the buffer is 5000 by default, but what will work best for her depends on the cores/memory/system she has available and she should do a bit of testing with a smaller file.
Seta: if you have already received help and advice on something, it is generally useful to re-state that here, so that other people do not just give you exactly the same advice.
Reducing run time can be done by forking (see manual) and splitting (and later re-joining) the VCFs into chunks and then run VEP in parallel on them. This obviously requires more computational resources.