Hi there, I'm wondering if anybody can shed some light into what is happening during the count table step with featureCounts. I am loosing more than half of my reads. My mapping statistics seem to be fine when I run STAR.
My library is 75bp paired end using the Nugen Ovation Universal kit. The RNA is from rat. I downloaded the NCBI genome and made the STAR index. Here is my command to run STAR:
STAR --runThreadN 12 \
--genomeDir <path to...>/genomes/rn6/ncbi/star \
--readFilesIn ${R1} ${R2} \
--outFileNamePrefix starMapped/${job_name} \
--outSAMtype BAM Unsorted \
--seedSearchStartLmax 40 \
--outFilterScoreMinOverLread 0.5 \
--outFilterMatchNminOverLread 0.5
My mapping rate is 84-89%. A representative Log.final.out:
Uniquely mapped reads % | 85.59%
Average mapped length | 147.92
Number of splices: Total | 23011150
Number of splices: Annotated (sjdb) | 19835290
Number of splices: GT/AG | 22429313
Number of splices: GC/AG | 180851
Number of splices: AT/AC | 22581
Number of splices: Non-canonical | 378405
Mismatch rate per base, % | 0.28%
Deletion rate per base | 0.02%
Deletion average length | 1.98
Insertion rate per base | 0.01%
Insertion average length | 1.66
% of reads mapped to multiple loci | 10.03%
% of reads mapped to too many loci | 0.38%
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 3.55%
% of reads unmapped: other | 0.45%
Next, I run featureCounts using the following command:
featureCounts -T 12 -p -t exon -g gene_id -a <path to...>/NCBI/Annotation/Genes/genes.gtf -o combined_counts.txt *.bam
My output from featureCounts looks like:
Successfully assigned fragments : 41071240 (44.6%)
And this is representative of one sample in the summary file:
Assigned 41243743
Unassigned_Ambiguity 259701
Unassigned_MultiMapping 30155153
Unassigned_NoFeatures 20857145
My question is, why am I losing so many reads at the step of making the count table? Why are multi-mappers ~10% with STAR and then ~30% with featureCounts?
Although I can't give you hard numbers, it is not uncommon to have a substantial drop between mapping rate and assignment to feature rate. It depends on several factors, and someone may chime in with more suggestions, but how good is the Rattus norvegicus annotation? In general, I consider human and mouse annotations to be of very high quality, with all other annotations being average at best - I am not familiar with the R. norvegicus annotation, though.
Then try with
-s 2
, because I don't think your library is unstranded. If your library is truly unstranded, an assigned rate of 1% is not realistic: one you expect half of the reads would map to each strand, thus half of the reads should have been assigned. This looks like a "reverse stranded" library incorrectly assigned as "forward stranded".Wondering, if it was total RNAseq data? or ployA