Question: addressing software versions
0
gravatar for prasundutta87
3 months ago by
prasundutta87170
prasundutta87170 wrote:

Hi,

I started using GATK v4.0.0.0 on some of my WGS samples some days back. Few samples got a bug- "HaplotypeCaller exception: contig must be non-null and not equal to *, and start must be >= 1". I checked online and found that this bug was solved in the new sub-version. I downloaded GATK v4.0.1.2 and started running it on the unsuccessful samples.

Mostly, new versions (not sub-versions) get released when there is a major change in the code/algorithm of the tool. This is a common thing in bioinformatics community that new versions of tools get released, bugs get reported and then new sub-versions get released within a span of few days/months leaving the users in a fix.

In my case, the inherent algorithm did not change, only some bugs got solved. Should I be running the updated sub-version of the tool on the successful samples again so that a commonality is maintained? Although, the final output will not change. Or, should I just write while reporting the procedure that GATK 4.0 was used, and not mention the sub-version at all. What is the best practice that should be followed in this case?

ADD COMMENTlink modified 3 months ago • written 3 months ago by prasundutta87170
0
gravatar for genomax
3 months ago by
genomax48k
United States
genomax48k wrote:

If you are comparing multiple samples and reporting the finding in a single publication then preferably all samples should be analyzed using identical version(s) of software packages to prevent any unseen bias. Someone else being able to reproduce the results you are reporting (as long as they use an identical version of the software) is important for reproducible research. To facilitate that, no information should be considered insignificant. It is the best policy to report accurate metadata for all data and informatics software/pipelines.

ADD COMMENTlink modified 3 months ago • written 3 months ago by genomax48k

I agree with that..from a publication point of view, it definitely makes sense..I am just concerned over the time being wasted..

ADD REPLYlink modified 3 months ago • written 3 months ago by prasundutta87170
0
gravatar for Devon Ryan
3 months ago by
Devon Ryan79k
Freiburg, Germany
Devon Ryan79k wrote:

If you look at the release notes for version 4.0.0.0 (it annoys me that they use an extra digit in their versioning), you'll see that aside from this bugfix, they also fixed a bug relating to -mbq being ignored before. If you used that, then I would suggest rerunning the variant calling on all of the samples. If you didn't use that, then presumably the results would be identical (sans the problematic sample). If you want to be sure, run one of the non-problematic samples and compare the results.

For what it's worth, the best practice would be to rerun all of the samples...but the best practice isn't always the most sensible one.

ADD COMMENTlink written 3 months ago by Devon Ryan79k

Luckily I ran with default parameters and had not set -mbq..and I am running the gvcf generation per sample now..theoretically it should not change at all..but again..as genomax suggested..from a publication point of view, this change of version will be a difficult thing to sell..the reviewers will also question me if I mention that sub versions differed...

ADD REPLYlink modified 3 months ago • written 3 months ago by prasundutta87170
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 941 users visited in the last hour