Hello all,
Do any of you have any recent info on spark versions of gatk4 tools that are still in beta? I have human genomes from which I want to identify variants causing a rare disease, so I follow the "Best practices for snp on germline". If I understood correctly to be able to multithreading with gatk4 we have to go through spark by adding an option such as --spark-master local [2] is that correct?
However, some tools needed to follow best practices are still in beta. I am thinking for example of BQSRPipelineSpark (BETA) or HaplotypeCallerSpark (BETA).
Are they reliable enough to use them or should I stick to the classic versions (but suddenly have to find another way to parallelize?).
I also saw that Markduplicates has been integrated into gatk4 in its MarkduplicatesSpark version, which it is no longer in Beta, so I guess there is no problem using it? Otherwise I had thought of samtools markdup. On the other hand in both cases it seems preferable to have the bams sorted by querynames, except that I have already sorted the bams by coordinates. Under these conditions which tool is best for marking duplicates? Here is what is written for example for MarkduplicatesSpark: "The tool is optimized to run on queryname-grouped alignments (that is, all reads with the same queryname are together in the input file). spend additional time first queryname sorting the reads internally. This can result in the tool being up to 2x slower processing under some circumstances.
Thanks a lot in advance 😀