Hi,
I aligned my reads tot he human genome using STAR. For multimappers, I kept the default <10 loci
I then chose to remove duplicate read pairs using picard. But somehow, the "READ_PAIRS_EXAMINED" values for all of my samples are really low. I do not understand how picard doesn't even consider for deduplication such a large amount of read pairs. For example, for one sample, I have a total of 131,114,907 either uniquely mapped pairs (121,708,194) or pairs mapping to <10 loci (9,406,713), but picard only considers 102,430,303 in "READ_PAIRS_EXAMINED".
This is my samtool flagstat output for the alignment file prior to deduplication, right after STAR alignment. Anything here that would make picard not consider those reads? The documentation says it does not consider pairs where both reads are not properly paired. For reads that map to multiple loci, it only considers the primary alignment, so reads mapping to multiple loci should still be considered. Thank you! :)
Metric | Count |
---|---|
in total (QC-passed reads + QC-failed reads) | 590468720 + 0 |
primary | 268864074 + 0 |
secondary | 321604646 + 0 |
supplementary | 0 + 0 |
duplicates | 0 + 0 |
primary duplicates | 0 + 0 |
mapped (100.00% : N/A) | 590468720 + 0 |
primary mapped (100.00% : N/A) | 268864074 + 0 |
paired in sequencing | 268864074 + 0 |
read1 | 134430016 + 0 |
read2 | 134434058 + 0 |
properly paired (100.00% : N/A) | 268859950 + 0 |
with itself and mate mapped | 268859950 + 0 |
singletons (0.00% : N/A) | 4124 + 0 |
with mate mapped to a different chr | 0 + 0 |
with mate mapped to a different chr (mapQ>=5) | 0 + 0 |
You should not remove duplicates from RNAseq data (see --> Removing Duplicates From Rna-Seq Data ) unless you are planning to SNP analysis.
Hi,
Thank you for your response. I have spent more time than I would like to admit deciding on whether or not to remove duplicates and there are quite a few reasons why we decided to do that. At this stage, I was looking for an answer to specifically why picard MarkDuplicates is choosing to not even consider shuch a high percentage of reads :).
Then you would be doing so at your own peril, specially if you are planning to do differential expression analysis. Commonly used DE programs expect raw counts (that are from entire dataset).
If you are looking to truly de-duplicate your data, you should consider using a method that only depends on sequence (and not alignments). See --> Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.