Question: Mate-Fixing tools outputting identical bams for different samples?
gravatar for rishi.z.sinha
5.7 years ago by
United States
rishi.z.sinha10 wrote:

Hello! I'm trying to analyze some RNA-Seq data for which I'm eventually hoping to run do a differential expression analysis.

Thus far, I've done everything til alignment through tophat, and some primary filtering, but as I was trying to run MarkDuplicates from Picard, I got this error--

Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 56030473, Read name HISEQ:157:H9UNAADXX:1:1108:17721:54212, Mate Alignment start should be 0 because reference name = *.
    at htsjdk.samtools.SAMUtils.processValidationErrors(
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(
    at htsjdk.samtools.BAMFileReader$
    at htsjdk.samtools.BAMFileReader$
    at htsjdk.samtools.SamReader$
    at htsjdk.samtools.SamReader$
    at picard.sam.MarkDuplicates.buildSortedReadEndLists(
    at picard.sam.MarkDuplicates.doWork(
    at picard.cmdline.CommandLineProgram.instanceMain(
    at picard.sam.MarkDuplicates.main(

To fix that, I tried both SamTools' fixmate command, as well as Picard's FixMateInformation on my samples and then ran MarkDuplicates which didn't error then, but it's outputting identical bam files for almost all of my samples (some are replicates of a sample, but some are also knockouts of a control, so this definitely shouldn't be happening..?)

Does anyone have any idea why this might be happening, and/or what I can try to fix this? The files post-alignment look absolutely fine in IGV Browser, and are not identical prior to mate-fixing and mark duplicates.

ADD COMMENTlink modified 5.7 years ago by Devon Ryan94k • written 5.7 years ago by rishi.z.sinha10
gravatar for Devon Ryan
5.7 years ago by
Devon Ryan94k
Freiburg, Germany
Devon Ryan94k wrote:

Firstly, you don't need or (typically) want to mark duplicates with RNAseq data if you plan to look at differential expression. Secondly, you'll find that picard is often a bit excessive when it comes to standards conformance (e.g., the error you're getting is due to a correct SAM file). So, try using VALIDATION_STRINGENCY=LENIENT.

ADD COMMENTlink written 5.7 years ago by Devon Ryan94k

Oh ok, it did run successfully after that. Thanks!

Also, would you mind explaining why MarkDuplicates wouldn't be necessary/desired for DESeq? I'm still new to learning it, so sorry if that's a basic question...

ADD REPLYlink written 5.7 years ago by rishi.z.sinha10

Any highly expressed gene will appear to have false-positive PCR duplicates that would end up being marked and excluded, artificially deflating counts and compromising the results. The only common situations wherein marking duplicates is useful is SNP/variant calling and non-targeted bisulfite sequencing.

ADD REPLYlink written 5.7 years ago by Devon Ryan94k

+1 for recommending LENIENT over SILENT :)

ADD REPLYlink written 4.1 years ago by John12k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1291 users visited in the last hour