I'm using Qualimap to quantify the genomic origin of reads of several paired-end samples from SRA. I have downsampled those to be exactly 1M reads by doing head -n 4000000 file_{1,2}.fastq > subsampled_{1,2}.fastq
.
I'm running the samples through STAR and then I'm using Picard markduplicates to identify duplicate reads since the samples do not use UMIs. My issue is that altough the samples have exactly 1M reads, and the initial FastQC correctly reports that there are indeed exactly 1M reads, Qualimap consistently assigns more than 1M reads to genomic features, such as ~1.5M reads mapping to exons, and then a bit more to introns and intergenic regions. The STAR logs also report that there were exactly 1M reads to map. The reads assigned to genomic features by qualimap, when added up, do not add up to 2M, so I don't think Qualimap is quantifying fragments instead of reads, which could explain the discrepancy.
I'm running Qualimap version 2.2.2d using the following code:
qualimap \
rnaseq \
-bam input.markdup.sorted.bam \
-gtf annotation.gtf \
-p strand-specific-reverse \
-pe \
-outdir out/
I've made sure that the counting algorithm is uniquely-mapped-reads
, as is shown in this screenshot from the Qualimap report:
I've also made sure that the strandedness is correct, and that the annotation is just gencode v39 comprehensive annotation. I don't know what else could I be missing or what could explain these inconsistencies in the quantification. Any help is appreciated, thanks in advance!
supplementary reads + secondary alignments ? How can read-count from BAM file be greater than than from fastq file?
Thank you for your reply! So if I understand correctly, when the counting algorithm is
uniquely-mapped-reads
the secondary and chimeric alignments are also included? Do you know if there is any Qualimap option to make sure this does not happen? Thanks again!