Question

featureCounts percentage of successful assigned alignments lower than expected

0

Entering edit mode

2.4 years ago

gt ▴ 30

Still relatively new to RNA-seq data. I have taken a single BAM file (filtered to proper pairs) for human paired-end RNA-seq data. Below is the output for STAR alignment.

                      Number of input reads |       172222031
                  Average input read length |       245
                                UNIQUE READS:
               Uniquely mapped reads number |       123631446
                    Uniquely mapped reads % |       71.79%
                      Average mapped length |       253.43
                   Number of splices: Total |       56672165
        Number of splices: Annotated (sjdb) |       55335338
                   Number of splices: GT/AG |       55607873
                   Number of splices: GC/AG |       478746
                   Number of splices: AT/AC |       52251
           Number of splices: Non-canonical |       533295
                  Mismatch rate per base, % |       0.51%
                     Deletion rate per base |       0.02%
                    Deletion average length |       1.75
                    Insertion rate per base |       0.02%
                   Insertion average length |       1.38
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |       38635060
         % of reads mapped to multiple loci |       22.43%
    Number of reads mapped to too many loci |       523608
         % of reads mapped to too many loci |       0.30%
                              UNMAPPED READS:
Number of reads unmapped: too many mismatches |       0
    % of reads unmapped: too many mismatches |       0.00%
        Number of reads unmapped: too short |       8598098
             % of reads unmapped: too short |       4.99%
            Number of reads unmapped: other |       833819
                 % of reads unmapped: other |       0.48%
                              CHIMERIC READS:
                   Number of chimeric reads |       0
                        % of chimeric reads |       0.00%

So I have 71.79% of the reads being uniquely mapped to the reference (not horrendous). However, when I look at the results of featureCounts, I have the following output.

|| Process BAM file Aligned.sortedByCoord.out.properPairs.bam...              ||
||    Strand specific : reversely stranded                                    ||
||    Paired-end reads are included.                                          ||
||    Total alignments : 401232754                                            ||
||    Successfully assigned alignments : 67882178 (16.9%)                     ||
||    Running time : 15.64 minutes                                            ||

Below is the more detailed output file from featureCounts.

Assigned        67882178
Unassigned_Unmapped     0
Unassigned_Read_Type    0
Unassigned_Singleton    0
Unassigned_MappingQuality       0
Unassigned_Chimera      0
Unassigned_FragmentLength       0
Unassigned_Duplicate    0
Unassigned_MultiMapping 277601308
Unassigned_Secondary    0
Unassigned_NonSplit     0
Unassigned_NoFeatures   50284182
Unassigned_Overlapping_Length   0
Unassigned_Ambiguity    5465086

Clearly, featureCounts is dividing the number of assigned alignments to the total alignments, including the multi-mapped reads. Should I instead take the number of alignments assigned divided by the total number of uniquely mapped reads? This would instead give me 55% of alignments being assigned. Is there a way to change the percentage assigned reported in featureCounts command? Also, is 55% assigned to exons reasonable for human RNA-seq data?

My STAR command is the following:

STAR --runThreadN 16 --readFilesCommand zcat --outFilterMultimapNmax 20 --genomeDir 'star_genome_index' --outSAMtype BAM SortedByCoordinate --outFileNamePrefix 'BAM/' --readFilesIn 'R1.fastq.gz' 'R2.fastq.gz'

My featureCounts command is the following:

featureCounts -p -t exon -g gene_id -s 2 -T 16 -a gencode.v38.annotation.gtf -o counts.txt Aligned.sortedByCoord.out.properPairs.bam

I did use the infer_experiment.py to check the strandedness of the library and the majority of the reads could not be inferred. I have tried both reverse stranded and un-stranded in the featureCounts command and get almost identical % assigned.

RNA-seq STAR featureCounts • 1.8k views

ADD COMMENT • link 2.4 years ago by gt ▴ 30