Picard MarkDuplicates error after using bamaddrg: "bin field of BAM record does not equal value computed..."
1
1
Entering edit mode
7.4 years ago
Mark Ebbert ▴ 90

Hi,

We have a lot of bams from data that was sequenced and aligned before we received it. All of the bams are missing the read group (@RG) in the reads even though the read group is defined in the header. We ran bamaddrg (https://github.com/ekg/bamaddrg) to add the read groups to all reads and it appeared to run fine. The problem is that running Picard's MarkDuplicates now throws the following error on the bam with read groups added:

Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 1642900, Read name HS2000-1005_167:8:1103:3541:88508, bin field of BAM record does not equal value computed based on alignment start and end, and length of sequence to which read is aligned
        at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:452)
        at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:643)
        at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:628)
        at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:598)
        at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:514)
        at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:488)
        at picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:413)
        at picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:177)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183)
        at picard.sam.MarkDuplicates.main(MarkDuplicates.java:161)

We do not get the error on the original bam. Another post suggested the option VALIDATION_STRINGENCY=LENIENT, which allows MarkDuplicates to proceed, but there are a fair number of reads that get this error. I don't know how these are being treated, but for our study we cannot ignore these reads. Perhaps we could MarkDuplicates before adding read groups, but I'm concerned about unknown consequences. Reindexing does not solve the problem. I also posted an issue on the bamaddrg github page. Hopefully the author will be able to make a suggestion.

Questions:

  1. Does anyone know why we would get this error after adding read groups? There are no other differences in the reads as far as I can tell.
  2. The error states that the bin is calculated based on alignment start and end. These values did not change! So why would the calculated bin change?
  3. I came across one solution to convert the bam > sam > bam so the bins are recalculated, but we have 1000 whole genomes. That seems a bit ridiculous. Any other suggestions?

Thanks in advance! Let me know if I can clarify anything.

SAM picard markduplicates validation error • 4.4k views
ADD COMMENT
4
Entering edit mode
7.4 years ago
lomereiter ▴ 470

This might be a bug in BamTools library, e.g. treating some edge case incorrectly. For some reason, it always calculates bin number upon writing a record to the output file, and here is the place where it may change.

Now regarding your 3rd question. Picard source code contains a tool specifically intended for fixing bin numbers but it's not included in the distribution, you have to compile it yourself.

But I would suggest an alternative for this particular situation: modify Picard tools a little. MarkDuplicates doesn't use bin values at all, so wrong values won't affect the result. Fixing them requires adding a single line in htsjdk/src/java/htsjdk/samtools/BAMFileReader.java after the line ++this.samRecordIndex;: mNextRecord.setIndexingBin(mNextRecord.computeIndexingBin());

That should do the trick, as it forces all tools to recompute bin number on the fly. I've uploaded the jars built with this patch here.

ADD COMMENT
0
Entering edit mode

Super helpful, thanks lomereiter! We opted for the first option and it worked great.

ADD REPLY

Login before adding your answer.

Traffic: 2506 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6