featureCounts: Low percentage of assigned fragments
0
5
Entering edit mode
6.6 years ago

Hi,

I have Illumina paired-end RNA-Seq data (prepared with the TruSeq stranded kit) for human tissue biopsies.

After QC and alignment to the ENSEMBL genome and gtf (GRCh38 rel 84 from ensembl.org) using STAR (alignment perc. of 75% - 90%), I use featureCounts (in R) to count genes.

I get very different rates of successfully assigned fragments, ranging from 20% to about 60%, with the majority about 45%. Is this normal? Or could something be wrong with (some of) my libraries? Do I need to account for the different assignment proportions?

Many thanks!

RNA-Seq rna-seq • 17k views
1
Entering edit mode

Take a look in a genome browser, try to figure out if the reads fall along non coding regions, introns, or perhaps you can track peaks in unexpected regions.

0
Entering edit mode

Are you using the correct "stranded" option for TruSeq which should be "reverse stranded"?

0
Entering edit mode

Yes, I use strandSpecific=-1. With forward stranded, I got less than 5% in all samples.

0
Entering edit mode

Did you use -s 2 with featureCounts?

0
Entering edit mode

Yes, or -1 (which seems to be the same thing).

0
Entering edit mode

You may have a lot of multi-mapped reads. Those would not be assigned a gene. What is you alignment rate for unique alignments only?

0
Entering edit mode

I have 75% - 90% of uniquely mapped reads. % of reads mapped to multiple loci ranges from 15% to about 5% - could this be the reason for the observed low assignment rate? Would you suggest to (i) accept this and go on, or (ii) try to count the multiple loci reads, too (which is possible in featurecounts, if I am not mistaken).

0
Entering edit mode

That number of uniquely mapped reads is fine, nothing to worry about. Most likely obvious, but you did align to hg38 and count for the same genome? Could you share the featureCounts command/code you used?

0
Entering edit mode

Sure:

counts <- featureCounts(files=files, annot.ext="/data/STARgenomes/ensembl/Homo_sapiens.GRCh38.84.gtf", isPairedEnd=TRUE, isGTFAnnotationFile=TRUE, strandSpecific=2, nthreads=12, useMetaFeatures=TRUE)


I used the gft I gave to STAR when generating its genome -

STAR --runThreadN 12 --runMode genomeGenerate --genomeDir /data/STARgenomes/ensembl/ --genomeFastaFiles /data/STARgenomes/ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.fa --sjdbGTFfile /data/STARgenomes/ensembl/Homo_sapiens.GRCh38.84.gtf

0
Entering edit mode

I also get this warning, but that should be rather normal because of the multi-mapping? I don´t have STAR sort my SAMs - is that ok?

 WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Read re-ordering is performed.                                          ||
||                                                                            ||
||    Total fragments : 26620478                                              ||
||    Successfully assigned fragments : 12283289 (46.1%)                      ||
||    Running time : 0.69 minutes

0
Entering edit mode

Seems nothing to worry about. Could you tell me, for example for this sample, how many reads you had (before mapping!). It is possible that some reads map each many many times, and as such inflate the 26620478, and lower the percentage. In fact, you want to compare the 'Successfully assigned fragments' to the 'uniquely mapped' reads.

0
Entering edit mode

I feel you may be right, it may all be full of multi-mapping reads.

Different from the defaults, I use several options the STAR manual presents as ENCODE standard options (and I therefore found them possibly useful)

--outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 --alignIntronMin 20 --alignIntronMax 10000 --alignMatesGapMax 1000000

Possibly, those will add to further inflation?

See below for the logfile for the sample I pasted before (and many thanks for you efforts, btw!)

                             Started job on |   Jul 20 06:14:44
Started mapping on |   Jul 20 06:14:45
Finished on |   Jul 20 06:19:14
Mapping speed, Million of reads per hour |   264.22

Number of input reads |   19743421
Average input read length |   246
Uniquely mapped reads number |   16959018
Uniquely mapped reads % |   85.90%
Average mapped length |   245.42
Number of splices: Total |   11930188
Number of splices: Annotated (sjdb) |   11861337
Number of splices: GT/AG |   11858385
Number of splices: GC/AG |   61520
Number of splices: AT/AC |   6388
Number of splices: Non-canonical |   3895
Mismatch rate per base, % |   0.21%
Deletion rate per base |   0.01%
Deletion average length |   1.73
Insertion rate per base |   0.01%
Insertion average length |   1.55
Number of reads mapped to multiple loci |   1908781
% of reads mapped to multiple loci |   9.67%
Number of reads mapped to too many loci |   9447
% of reads mapped to too many loci |   0.05%
% of reads unmapped: too many mismatches |   0.00%
% of reads unmapped: too short |   4.32%
% of reads unmapped: other |   0.06%
Number of chimeric reads |   0
% of chimeric reads |   0.00%

0
Entering edit mode

Based on this, 12283289 were assigned out of 16959018 uniquely mapping reads. That's already better than the fraction you had earlier :)

0
Entering edit mode

Many thanks for your insightful comments. I did a little reading on multi-mapping reads. From what I gather, I would rather avoid to count all of them in an RNA Seq diff exp analysis, would you agree?

1
Entering edit mode

I guess most software by default ignores multimapping reads so you don't have to worry about then. But you could argue that ignoring those reads is a loss of information. You cannot just combine the reads with the uniquely mapped, but I find the method described in this paper an interesting approach.

0
Entering edit mode

You could choose to map them to a random best site (that way you would not lose that information). ambig=random is what BBMap uses for this purpose.