Question

picard MarkDuplicates READ_PAIRS_EXAMINED much lower than the number of read pairs successfully mapped by STAR

0

Entering edit mode

3 months ago

maria • 0

Hi,

I aligned my reads tot he human genome using STAR. For multimappers, I kept the default <10 loci

I then chose to remove duplicate read pairs using picard. But somehow, the "READ_PAIRS_EXAMINED" values for all of my samples are really low. I do not understand how picard doesn't even consider for deduplication such a large amount of read pairs. For example, for one sample, I have a total of 131,114,907 either uniquely mapped pairs (121,708,194) or pairs mapping to <10 loci (9,406,713), but picard only considers 102,430,303 in "READ_PAIRS_EXAMINED".

This is my samtool flagstat output for the alignment file prior to deduplication, right after STAR alignment. Anything here that would make picard not consider those reads? The documentation says it does not consider pairs where both reads are not properly paired. For reads that map to multiple loci, it only considers the primary alignment, so reads mapping to multiple loci should still be considered. Thank you! :)

Metric	Count
in total (QC-passed reads + QC-failed reads)	590468720 + 0
primary	268864074 + 0
secondary	321604646 + 0
supplementary	0 + 0
duplicates	0 + 0
primary duplicates	0 + 0
mapped (100.00% : N/A)	590468720 + 0
primary mapped (100.00% : N/A)	268864074 + 0
paired in sequencing	268864074 + 0
read1	134430016 + 0
read2	134434058 + 0
properly paired (100.00% : N/A)	268859950 + 0
with itself and mate mapped	268859950 + 0
singletons (0.00% : N/A)	4124 + 0
with mate mapped to a different chr	0 + 0
with mate mapped to a different chr (mapQ>=5)	0 + 0

STAR removal RNA-seq picard duplicate • 1.1k views

ADD COMMENT • link updated 3 months ago by GenoMax 154k • written 3 months ago by maria • 0

1

Entering edit mode

I then chose to remove duplicate read pairs using picard.

You should not remove duplicates from RNAseq data (see --> Removing Duplicates From Rna-Seq Data ) unless you are planning to SNP analysis.

ADD REPLY • link 3 months ago by GenoMax 154k

0

Entering edit mode

Hi,

Thank you for your response. I have spent more time than I would like to admit deciding on whether or not to remove duplicates and there are quite a few reasons why we decided to do that. At this stage, I was looking for an answer to specifically why picard MarkDuplicates is choosing to not even consider shuch a high percentage of reads :).

ADD REPLY • link 3 months ago by maria • 0

0

Entering edit mode

there are quite a few reasons why we decided to do that

Then you would be doing so at your own peril, specially if you are planning to do differential expression analysis. Commonly used DE programs expect raw counts (that are from entire dataset).

If you are looking to truly de-duplicate your data, you should consider using a method that only depends on sequence (and not alignments). See --> Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.

ADD REPLY • link 3 months ago by GenoMax 154k