Still relatively new to RNA-seq data. I have taken a single BAM file (filtered to proper pairs) for human paired-end RNA-seq data. Below is the output for STAR alignment.
Number of input reads | 172222031
Average input read length | 245
UNIQUE READS:
Uniquely mapped reads number | 123631446
Uniquely mapped reads % | 71.79%
Average mapped length | 253.43
Number of splices: Total | 56672165
Number of splices: Annotated (sjdb) | 55335338
Number of splices: GT/AG | 55607873
Number of splices: GC/AG | 478746
Number of splices: AT/AC | 52251
Number of splices: Non-canonical | 533295
Mismatch rate per base, % | 0.51%
Deletion rate per base | 0.02%
Deletion average length | 1.75
Insertion rate per base | 0.02%
Insertion average length | 1.38
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 38635060
% of reads mapped to multiple loci | 22.43%
Number of reads mapped to too many loci | 523608
% of reads mapped to too many loci | 0.30%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 8598098
% of reads unmapped: too short | 4.99%
Number of reads unmapped: other | 833819
% of reads unmapped: other | 0.48%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
So I have 71.79% of the reads being uniquely mapped to the reference (not horrendous). However, when I look at the results of featureCounts, I have the following output.
|| Process BAM file Aligned.sortedByCoord.out.properPairs.bam... ||
|| Strand specific : reversely stranded ||
|| Paired-end reads are included. ||
|| Total alignments : 401232754 ||
|| Successfully assigned alignments : 67882178 (16.9%) ||
|| Running time : 15.64 minutes ||
Below is the more detailed output file from featureCounts.
Assigned 67882178
Unassigned_Unmapped 0
Unassigned_Read_Type 0
Unassigned_Singleton 0
Unassigned_MappingQuality 0
Unassigned_Chimera 0
Unassigned_FragmentLength 0
Unassigned_Duplicate 0
Unassigned_MultiMapping 277601308
Unassigned_Secondary 0
Unassigned_NonSplit 0
Unassigned_NoFeatures 50284182
Unassigned_Overlapping_Length 0
Unassigned_Ambiguity 5465086
Clearly, featureCounts is dividing the number of assigned alignments to the total alignments, including the multi-mapped reads. Should I instead take the number of alignments assigned divided by the total number of uniquely mapped reads? This would instead give me 55% of alignments being assigned. Is there a way to change the percentage assigned reported in featureCounts command? Also, is 55% assigned to exons reasonable for human RNA-seq data?
My STAR command is the following:
STAR --runThreadN 16 --readFilesCommand zcat --outFilterMultimapNmax 20 --genomeDir 'star_genome_index' --outSAMtype BAM SortedByCoordinate --outFileNamePrefix 'BAM/' --readFilesIn 'R1.fastq.gz' 'R2.fastq.gz'
My featureCounts command is the following:
featureCounts -p -t exon -g gene_id -s 2 -T 16 -a gencode.v38.annotation.gtf -o counts.txt Aligned.sortedByCoord.out.properPairs.bam
I did use the infer_experiment.py
to check the strandedness of the library and the majority of the reads could not be inferred. I have tried both reverse stranded and un-stranded in the featureCounts command and get almost identical % assigned.