Hi everyone,
I’m performing a standard RNA-seq analysis on mouse (Mus musculus) and encountered an issue at the counting step using featureCounts. Below is my workflow:
- Downloaded the mouse reference genome from NCBI (Assembly: GCF_000001635.27/GRCm39) and built the index.
- Performed quality control on raw sequencing data using Trimmomatic.
- Aligned the trimmed reads to the reference genome with STAR, producing sorted BAM files.
- Counted reads using featureCounts to obtain raw count matrix:
featureCounts -T 25 -s 2 \
-a mmu_hg38/ncbi_dataset/data/GCF_000001635.27/GCF_000001635.27_GRCm39_genomic.gtf \
-o BD-1-1_featurecounts.txt \
BD-1-1_counts_Aligned.sortedByCoord.out.bam
Here is the summary file from the result of above code:
Status BD-1-1_counts_Aligned.sortedByCoord.out.bam
Assigned 25698967
Unassigned_Unmapped 2131357
Unassigned_Read_Type 0
Unassigned_Singleton 0
Unassigned_MappingQuality 0
Unassigned_Chimera 0
Unassigned_FragmentLength 0
Unassigned_Duplicate 0
Unassigned_MultiMapping 8512525
Unassigned_Secondary 0
Unassigned_NonSplit 0
Unassigned_NoFeatures 27869011
Unassigned_Overlapping_Length 0
Unassigned_Ambiguity 137274
As you can see, a large number of reads (~27.9 million) are labeled as Unassigned_NoFeatures, and only ~25.7 million reads are “Assigned”. Considering the total alignments reported (~64.3 million), the assigned proportion (~40%) is relatively low.
Questions:
- Is this assigned/unassigned read distribution reasonable for a mouse RNA-seq experiment?
- Could the
-s 2
option (reverse-stranded) be causing or contributing to this high proportion of “NoFeatures” reads? - What other common reasons exist for such a high “Unassigned_NoFeatures” count (for example, annotation mismatch, genome build/version inconsistency, un–annotated transcripts, etc.)?
Do you have suggestions for next diagnostic steps to identify the root cause?
Thank you in advance for any advice or insights!
It's common to have some fraction of the reads to be unassigned but the amount (or ratio) you observe is too high I feel.
Can you tell a bit more on the RNA input data? was this for instance a rRNA depleted sample (or was it 'whole RNA' sample) ?
and yes all those reasons you list there can be the cause indeed, among even more other reasons ;)
Thank you very much for your thoughtful and detailed reply.
I successfully solved this problem by using the option
-s 0
but not-s 2
. The result of using option-s 0
is showing below:As we could see here, the result is much more suitable compared with the result of option
-s 2
.I guess this is because
-s 2
specifies the reverse strand of the reference gene, which caused a large number of reads to fail to match.Thanks again for your kind help!
While that is great do you know for sure if the library you are working with was made in an "unstranded" manner i.e. not preserving strand information. Knowing details about the experimental setup is important since that guides the options used for analysis.