Understanding MarkDuplicates and featureCounts results
0
2
Entering edit mode
2.4 years ago

I am trying to understand the effects of sorting on MarkDuplicates and how to interpret the numbers in the featureCounts log.

According to the MarkDuplicates documentation, it handles coordinated-sorted and query-sorted BAM inputs slightly differently:

The program can take either coordinate-sorted or query-sorted inputs, however the behavior is slightly different. When the input is coordinate-sorted, unmapped mates of mapped records and supplementary/secondary alignments are not marked as duplicates. However, when the input is query-sorted (actually query-grouped), then unmapped mates and secondary/supplementary reads are not excluded from the duplication test and can be marked as duplicate reads.

Q1: Should the sorting in the MarkDuplicates step affect featureCounts? It sounds like query-sorted may remove some additional _unmapped_ mates and secondary/supplementary reads. According to the featureCounts documentation, it does not count any multi-mappings by default. And I would imagine quantification only works on mapped reads and not unmapped either, so I thought the featureCounts result should be the same no matter how MarkDuplicates was performed. But when I ran featureCounts on both types of MarkDuplicates sorting outputs, there are some slight difference in the counts files, albeit not a lot. Why is that?

Q2: I am having some trouble seeing where the numbers from the featureCounts logs come from. The BAM file below was generated by MarkDuplicates using coordinate-sorted mode.

samtools flagstat output for BAM input to featureCounts:

35483648 + 0 in total (QC-passed reads + QC-failed reads)
24895287 + 0 primary
10588361 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
20155191 + 0 mapped (56.80% : N/A)
9566830 + 0 primary mapped (38.43% : N/A)
24895287 + 0 paired in sequencing
12334176 + 0 read1
12561111 + 0 read2
9539764 + 0 properly paired (38.32% : N/A)
9539764 + 0 with itself and mate mapped
27066 + 0 singletons (0.11% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

MarkDuplicates log that generated the deduped BAM file summarized above:

LIBRARY:            Unknown Library
UNPAIRED_READS_EXAMINED:    395123 
READ_PAIRS_EXAMINED:        17983810
SECONDARY_OR_SUPPLEMENTARY_RDS: 10588361
UNMAPPED_READS:         15328457
UNPAIRED_READ_DUPLICATES:   368057
READ_PAIR_DUPLICATES:       13213928
READ_PAIR_OPTICAL_DUPLICATES:   0
PERCENT_DUPLICATION:        0.736906
ESTIMATED_LIBRARY_SIZE:     4893981

featureCounts log (relevant parts):

//========================== featureCounts setting ===========================\\
||                                                                            ||
||                 Threads : 1                                                ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||         Strand specific : no                                               ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
||          Chimeric reads : not counted                                      ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file /home/chan/mice_microgravity/genomes/transcripts/ ... ||
||    Features : 663359                                                       ||
||    Meta-features : 43432                                                   ||
||    Chromosomes/contigs : 45                                                ||
||                                                                            ||
|| Process BAM file /home/chan/mice_microgravity/outputs_coordinate/picar ... ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Read re-ordering is performed.                                          ||
||                                                                            ||
||    Total fragments : 17999315                                              ||
||    Successfully assigned fragments : 2902259 (16.1%)                       ||
||    Running time : 1.30 minutes                                             ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

featureCounts output:

Assigned            2902259
Unassigned_Ambiguity        132370
Unassigned_MultiMapping     5895062
Unassigned_NoFeatures       1602957
Unassigned_Unmapped     7466667
Unassigned_MappingQuality   0
Unassigned_FragmentLength   0
Unassigned_Chimera      0
Unassigned_Secondary        0
Unassigned_Nonjunction      0
Unassigned_Duplicate        0

In the featureCounts log, it says the total fragments is 17999315 (2902259 + 132370 + 5895062 + 1602957 + 7466667, of which 2902259 is assigned), but not sure where this number came from.

I think I see where the unmapped fragments (Unassigned_Unmapped) is from. Since this is paired end data, 7466667 * 2 = 14933334 unmapped alignments = total - mapped - singleton - UNPAIRED_READ_DUPLICATES = 35483648 - 20155191 - 27066 - 368057 (looking at the samtools flagstat and MarkDuplicates outputs)

But not sure about the remaining mapped fragments: 2902259 + 132370 + 5895062 + 1602957 = 10532648 * 2 = 21065296 mapped alignments. But looking at the samtools flagstat output, there's only 20155191 mapped alignments, or counting singletons, 20155191 + 27066 = 20182257 alignments, which is still less than the ones reported by featureCounts. Where is featureCounts getting its total fragment count from?

samtools MarkDuplicates featureCounts • 805 views
ADD COMMENT

Login before adding your answer.

Traffic: 2536 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6