I am trying to understand the effects of sorting on MarkDuplicates and how to interpret the numbers in the featureCounts log.
According to the MarkDuplicates documentation, it handles coordinated-sorted and query-sorted BAM inputs slightly differently:
The program can take either coordinate-sorted or query-sorted inputs, however the behavior is slightly different. When the input is coordinate-sorted, unmapped mates of mapped records and supplementary/secondary alignments are not marked as duplicates. However, when the input is query-sorted (actually query-grouped), then unmapped mates and secondary/supplementary reads are not excluded from the duplication test and can be marked as duplicate reads.
Q1: Should the sorting in the MarkDuplicates step affect featureCounts? It sounds like query-sorted may remove some additional _unmapped_ mates and secondary/supplementary reads. According to the featureCounts documentation, it does not count any multi-mappings by default. And I would imagine quantification only works on mapped reads and not unmapped either, so I thought the featureCounts result should be the same no matter how MarkDuplicates was performed. But when I ran featureCounts on both types of MarkDuplicates sorting outputs, there are some slight difference in the counts files, albeit not a lot. Why is that?
Q2: I am having some trouble seeing where the numbers from the featureCounts logs come from. The BAM file below was generated by MarkDuplicates using coordinate-sorted mode.
samtools flagstat output for BAM input to featureCounts:
35483648 + 0 in total (QC-passed reads + QC-failed reads)
24895287 + 0 primary
10588361 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
20155191 + 0 mapped (56.80% : N/A)
9566830 + 0 primary mapped (38.43% : N/A)
24895287 + 0 paired in sequencing
12334176 + 0 read1
12561111 + 0 read2
9539764 + 0 properly paired (38.32% : N/A)
9539764 + 0 with itself and mate mapped
27066 + 0 singletons (0.11% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
MarkDuplicates log that generated the deduped BAM file summarized above:
LIBRARY: Unknown Library
UNPAIRED_READS_EXAMINED: 395123
READ_PAIRS_EXAMINED: 17983810
SECONDARY_OR_SUPPLEMENTARY_RDS: 10588361
UNMAPPED_READS: 15328457
UNPAIRED_READ_DUPLICATES: 368057
READ_PAIR_DUPLICATES: 13213928
READ_PAIR_OPTICAL_DUPLICATES: 0
PERCENT_DUPLICATION: 0.736906
ESTIMATED_LIBRARY_SIZE: 4893981
featureCounts log (relevant parts):
//========================== featureCounts setting ===========================\\
|| ||
|| Threads : 1 ||
|| Level : meta-feature level ||
|| Paired-end : yes ||
|| Strand specific : no ||
|| Multimapping reads : not counted ||
|| Multi-overlapping reads : not counted ||
|| Min overlapping bases : 1 ||
|| ||
|| Chimeric reads : not counted ||
|| Both ends mapped : not required ||
|| ||
\\===================== http://subread.sourceforge.net/ ======================//
//================================= Running ==================================\\
|| ||
|| Load annotation file /home/chan/mice_microgravity/genomes/transcripts/ ... ||
|| Features : 663359 ||
|| Meta-features : 43432 ||
|| Chromosomes/contigs : 45 ||
|| ||
|| Process BAM file /home/chan/mice_microgravity/outputs_coordinate/picar ... ||
|| Paired-end reads are included. ||
|| Assign fragments (read pairs) to features... ||
|| ||
|| WARNING: reads from the same pair were found not adjacent to each ||
|| other in the input (due to read sorting by location or ||
|| reporting of multi-mapping read pairs). ||
|| ||
|| Read re-ordering is performed. ||
|| ||
|| Total fragments : 17999315 ||
|| Successfully assigned fragments : 2902259 (16.1%) ||
|| Running time : 1.30 minutes ||
|| ||
\\===================== http://subread.sourceforge.net/ ======================//
featureCounts output:
Assigned 2902259
Unassigned_Ambiguity 132370
Unassigned_MultiMapping 5895062
Unassigned_NoFeatures 1602957
Unassigned_Unmapped 7466667
Unassigned_MappingQuality 0
Unassigned_FragmentLength 0
Unassigned_Chimera 0
Unassigned_Secondary 0
Unassigned_Nonjunction 0
Unassigned_Duplicate 0
In the featureCounts log, it says the total fragments is 17999315 (2902259 + 132370 + 5895062 + 1602957 + 7466667, of which 2902259 is assigned), but not sure where this number came from.
I think I see where the unmapped fragments (Unassigned_Unmapped
) is from. Since this is paired end data, 7466667 * 2 = 14933334 unmapped alignments = total - mapped - singleton - UNPAIRED_READ_DUPLICATES
= 35483648 - 20155191 - 27066 - 368057 (looking at the samtools flagstat and MarkDuplicates outputs)
But not sure about the remaining mapped fragments: 2902259 + 132370 + 5895062 + 1602957 = 10532648 * 2 = 21065296 mapped alignments. But looking at the samtools flagstat output, there's only 20155191 mapped alignments, or counting singletons, 20155191 + 27066 = 20182257 alignments, which is still less than the ones reported by featureCounts. Where is featureCounts getting its total fragment count from?