picard MarkDuplicates READ_PAIRS_EXAMINED much lower than the number of read pairs successfully mapped by STAR
0
0
Entering edit mode
7 weeks ago
maria • 0

Hi,

I aligned my reads tot he human genome using STAR. For multimappers, I kept the default <10 loci

I then chose to remove duplicate read pairs using picard. But somehow, the "READ_PAIRS_EXAMINED" values for all of my samples are really low. I do not understand how picard doesn't even consider for deduplication such a large amount of read pairs. For example, for one sample, I have a total of 131,114,907 either uniquely mapped pairs (121,708,194) or pairs mapping to <10 loci (9,406,713), but picard only considers 102,430,303 in "READ_PAIRS_EXAMINED".

This is my samtool flagstat output for the alignment file prior to deduplication, right after STAR alignment. Anything here that would make picard not consider those reads? The documentation says it does not consider pairs where both reads are not properly paired. For reads that map to multiple loci, it only considers the primary alignment, so reads mapping to multiple loci should still be considered. Thank you! :)

Metric Count
in total (QC-passed reads + QC-failed reads) 590468720 + 0
primary 268864074 + 0
secondary 321604646 + 0
supplementary 0 + 0
duplicates 0 + 0
primary duplicates 0 + 0
mapped (100.00% : N/A) 590468720 + 0
primary mapped (100.00% : N/A) 268864074 + 0
paired in sequencing 268864074 + 0
read1 134430016 + 0
read2 134434058 + 0
properly paired (100.00% : N/A) 268859950 + 0
with itself and mate mapped 268859950 + 0
singletons (0.00% : N/A) 4124 + 0
with mate mapped to a different chr 0 + 0
with mate mapped to a different chr (mapQ>=5) 0 + 0
STAR removal RNA-seq picard duplicate • 921 views
ADD COMMENT
1
Entering edit mode

I then chose to remove duplicate read pairs using picard.

You should not remove duplicates from RNAseq data (see --> Removing Duplicates From Rna-Seq Data ) unless you are planning to SNP analysis.

ADD REPLY
0
Entering edit mode

Hi,

Thank you for your response. I have spent more time than I would like to admit deciding on whether or not to remove duplicates and there are quite a few reasons why we decided to do that. At this stage, I was looking for an answer to specifically why picard MarkDuplicates is choosing to not even consider shuch a high percentage of reads :).

ADD REPLY
0
Entering edit mode

there are quite a few reasons why we decided to do that

Then you would be doing so at your own peril, specially if you are planning to do differential expression analysis. Commonly used DE programs expect raw counts (that are from entire dataset).

If you are looking to truly de-duplicate your data, you should consider using a method that only depends on sequence (and not alignments). See --> Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.

ADD REPLY

Login before adding your answer.

Traffic: 5713 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6