Question

Picard MarkDuplicates error after using bamaddrg: "bin field of BAM record does not equal value computed..."

1

Entering edit mode

9.7 years ago

Mark Ebbert ▴ 90

Hi,

We have a lot of bams from data that was sequenced and aligned before we received it. All of the bams are missing the read group (@RG) in the reads even though the read group is defined in the header. We ran bamaddrg (https://github.com/ekg/bamaddrg) to add the read groups to all reads and it appeared to run fine. The problem is that running Picard's MarkDuplicates now throws the following error on the bam with read groups added:

Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 1642900, Read name HS2000-1005_167:8:1103:3541:88508, bin field of BAM record does not equal value computed based on alignment start and end, and length of sequence to which read is aligned
        at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:452)
        at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:643)
        at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:628)
        at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:598)
        at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:514)
        at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:488)
        at picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:413)
        at picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:177)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183)
        at picard.sam.MarkDuplicates.main(MarkDuplicates.java:161)

We do not get the error on the original bam. Another post suggested the option VALIDATION_STRINGENCY=LENIENT, which allows MarkDuplicates to proceed, but there are a fair number of reads that get this error. I don't know how these are being treated, but for our study we cannot ignore these reads. Perhaps we could MarkDuplicates before adding read groups, but I'm concerned about unknown consequences. Reindexing does not solve the problem. I also posted an issue on the bamaddrg github page. Hopefully the author will be able to make a suggestion.

Questions:

Does anyone know why we would get this error after adding read groups? There are no other differences in the reads as far as I can tell.
The error states that the bin is calculated based on alignment start and end. These values did not change! So why would the calculated bin change?
I came across one solution to convert the bam > sam > bam so the bins are recalculated, but we have 1000 whole genomes. That seems a bit ridiculous. Any other suggestions?

Thanks in advance! Let me know if I can clarify anything.

picard markduplicates SAM • 5.4k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 9.7 years ago by Mark Ebbert ▴ 90

0

Entering edit mode

Dear Colleagues,

I am new. I had a similar problem when I ran with Picard.

Some rows showed SAM validation errors. I think that is because I deleted rRNA in the gff file. But I am not sure. I still get the read count at the end of the pipeline. The result looks quite similar between biological replicates. If you have any ideas, please let me know. I want to make sure that is still fine, and there is no bias if I just ignore this warning.

Thank you very much

INFO    2022-12-02 11:59:05 MarkDuplicates  Tracking 183586 as yet unmatched pairs. 73098 records in RAM.
Ignoring SAM validation error: ERROR: Record 30996032, Read name A00709:495:HCHT3DSX5:2:2562:9372:35446, bin field of BAM record does not equal value computed based on alignment start and end, and length of sequence to which read is aligned
INFO    2022-12-02 11:59:16 MarkDuplicates  Read    31,000,000 records.  Elapsed time: 00:01:10s.  Time for last 1,000,000:   11s.  Last read position: PEKT02000007_C_auris_B8441:1,608,310
INFO    2022-12-02 11:59:16 MarkDuplicates  Tracking 168844 as yet unmatched pairs. 54936 records in RAM.

At the end: I have this result:

[MAIN] Done (54627293 reads mapped (96.34%), 2076605 reads not mapped (94236 discarded), 56609662 lines written)(elapsed: 674.934082s)

ADD REPLY • link updated 16 months ago by Ram 43k • written 16 months ago by Trinh • 0

Ram · Accepted Answer · 2014-08-21

This might be a bug in BamTools library, e.g. treating some edge case incorrectly. For some reason, it always calculates bin number upon writing a record to the output file, and here is the place where it may change.

Now regarding your 3rd question. Picard source code contains a tool specifically intended for fixing bin numbers but it's not included in the distribution, you have to compile it yourself.

But I would suggest an alternative for this particular situation: modify Picard tools a little. MarkDuplicates doesn't use bin values at all, so wrong values won't affect the result. Fixing them requires adding a single line in htsjdk/src/java/htsjdk/samtools/BAMFileReader.java after the line ++this.samRecordIndex;: mNextRecord.setIndexingBin(mNextRecord.computeIndexingBin());

That should do the trick, as it forces all tools to recompute bin number on the fly. I've uploaded the jars built with this patch here.