We have a lot of bams from data that was sequenced and aligned before we received it. All of the bams are missing the read group (@RG) in the reads even though the read group is defined in the header. We ran
bamaddrg (https://github.com/ekg/bamaddrg) to add the read groups to all reads and it appeared to run fine. The problem is that running Picard's MarkDuplicates now throws the following error on the bam with read groups added:
Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 1642900, Read name HS2000-1005_167:8:1103:3541:88508, bin field of BAM record does not equal value computed based on alignment start and end, and length of sequence to which read is aligned at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:452) at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:643) at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:628) at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:598) at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:514) at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:488) at picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:413) at picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:177) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183) at picard.sam.MarkDuplicates.main(MarkDuplicates.java:161)
We do not get the error on the original bam. Another post suggested the option
VALIDATION_STRINGENCY=LENIENT, which allows MarkDuplicates to proceed, but there are a fair number of reads that get this error. I don't know how these are being treated, but for our study we cannot ignore these reads. Perhaps we could MarkDuplicates before adding read groups, but I'm concerned about unknown consequences. Reindexing does not solve the problem. I also posted an issue on the bamaddrg github page. Hopefully the author will be able to make a suggestion.
- Does anyone know why we would get this error after adding read groups? There are no other differences in the reads as far as I can tell.
- The error states that the bin is calculated based on alignment start and end. These values did not change! So why would the calculated bin change?
- I came across one solution to convert the bam > sam > bam so the bins are recalculated, but we have 1000 whole genomes. That seems a bit ridiculous. Any other suggestions?
Thanks in advance! Let me know if I can clarify anything.