featureCounts duplicate counting
1
0
Entering edit mode
7.5 years ago
lkmklsmn ▴ 950

Hi,

I am working on a set of RNAseq samples. I am using featureCounts from the subread package in order to count fragments (not reads) falling into genomic features. I am using the command line option --ignoreDup to exclude duplicate reads. The results seem somewhat strange...

Here is the summary file provided by featureCounts for two samples:

Status  sample1.bam
Assigned        12019290
Unassigned_Ambiguity    0
Unassigned_MultiMapping 7794471
Unassigned_NoFeatures   16908358
Unassigned_Unmapped     0
Unassigned_MappingQuality       0
Unassigned_FragementLength      0
Unassigned_Chimera      0
Unassigned_Secondary    0
Unassigned_Nonjunction  0
Unassigned_Duplicate    68111548

Status  sample2.bam
Assigned        48247506
Unassigned_Ambiguity    0
Unassigned_MultiMapping 15519394
Unassigned_NoFeatures   67192231
Unassigned_Unmapped     0
Unassigned_MappingQuality       0
Unassigned_FragementLength      0
Unassigned_Chimera      0
Unassigned_Secondary    0
Unassigned_Nonjunction  0
Unassigned_Duplicate    0


All samples were processed in the same way and same commands were used to run featureCounts. Why does one sample have 0 unassigned duplicates and the other 68111548? This difference seems so black and white that I am afraid there is an error somewhere. What exactly does it mean to have unassigned duplicates?

subread RNAseq Duplicates featureCounts • 5.5k views
1
Entering edit mode
7.5 years ago
dbpzdbpz ▴ 150

When you specified "--ignoreDup", featureCounts will test the FLAGS field in the SAM/BAM file to see if it is a duplicated mapping. In particular, the bit in the FLAGS being tested is 0x400 (1024). This bit is set by the aligner according to its own strategy.

Based on the summarization information you provided, it seems that the two BAM files were mapped by using different settings. Sample1.bam was generated by allowing duplicated mapping but Sample2.bam did not. You may use samtools to extract few records from both BAM files to compare their FLAGS fields (the second TAB-delimitaed field). Sample1.bam should have a lot of records where the FLAGS values are larger than 1024, but Sampl2.bam do not.

0
Entering edit mode

You were right. After using the MarkDuplicates function form Picard tools and rerunning featureCounts I was able to "recognize" duplicates in the fragment counting. However, I am not aware of using different settings for the alignment. But it is very weird that some of the samples were flagged and others werent.