Question: featureCounts: Low percentage of assigned fragments
5
gravatar for aggregatibacter
3.3 years ago by
Bonn, Germany
aggregatibacter140 wrote:

Hi,

I have Illumina paired-end RNA-Seq data (prepared with the TruSeq stranded kit) for human tissue biopsies.

After QC and alignment to the ENSEMBL genome and gtf (GRCh38 rel 84 from ensembl.org) using STAR (alignment perc. of 75% - 90%), I use featureCounts (in R) to count genes.

I get very different rates of successfully assigned fragments, ranging from 20% to about 60%, with the majority about 45%. Is this normal? Or could something be wrong with (some of) my libraries? Do I need to account for the different assignment proportions?

Many thanks!

rna-seq • 6.5k views
ADD COMMENTlink written 3.3 years ago by aggregatibacter140
1

Take a look in a genome browser, try to figure out if the reads fall along non coding regions, introns, or perhaps you can track peaks in unexpected regions.

ADD REPLYlink written 3.3 years ago by Asaf6.4k

Are you using the correct "stranded" option for TruSeq which should be "reverse stranded"?

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by genomax74k

Yes, I use strandSpecific=-1. With forward stranded, I got less than 5% in all samples.

ADD REPLYlink written 3.3 years ago by aggregatibacter140

Did you use -s 2 with featureCounts?

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by genomax74k

Yes, or -1 (which seems to be the same thing).

ADD REPLYlink written 3.3 years ago by aggregatibacter140

You may have a lot of multi-mapped reads. Those would not be assigned a gene. What is you alignment rate for unique alignments only?

ADD REPLYlink written 3.3 years ago by igor8.8k

I have 75% - 90% of uniquely mapped reads. % of reads mapped to multiple loci ranges from 15% to about 5% - could this be the reason for the observed low assignment rate? Would you suggest to (i) accept this and go on, or (ii) try to count the multiple loci reads, too (which is possible in featurecounts, if I am not mistaken).

ADD REPLYlink written 3.3 years ago by aggregatibacter140

That number of uniquely mapped reads is fine, nothing to worry about. Most likely obvious, but you did align to hg38 and count for the same genome? Could you share the featureCounts command/code you used?

ADD REPLYlink written 3.3 years ago by WouterDeCoster42k

Sure:

counts <- featureCounts(files=files, annot.ext="/data/STARgenomes/ensembl/Homo_sapiens.GRCh38.84.gtf", isPairedEnd=TRUE, isGTFAnnotationFile=TRUE, strandSpecific=2, nthreads=12, useMetaFeatures=TRUE)

I used the gft I gave to STAR when generating its genome -

STAR --runThreadN 12 --runMode genomeGenerate --genomeDir /data/STARgenomes/ensembl/ --genomeFastaFiles /data/STARgenomes/ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.fa --sjdbGTFfile /data/STARgenomes/ensembl/Homo_sapiens.GRCh38.84.gtf
ADD REPLYlink modified 3.3 years ago by genomax74k • written 3.3 years ago by aggregatibacter140

I also get this warning, but that should be rather normal because of the multi-mapping? I don´t have STAR sort my SAMs - is that ok?

 WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Read re-ordering is performed.                                          ||
||                                                                            ||
||    Total fragments : 26620478                                              ||
||    Successfully assigned fragments : 12283289 (46.1%)                      ||
||    Running time : 0.69 minutes
ADD REPLYlink modified 3.3 years ago by genomax74k • written 3.3 years ago by aggregatibacter140

Seems nothing to worry about. Could you tell me, for example for this sample, how many reads you had (before mapping!). It is possible that some reads map each many many times, and as such inflate the 26620478, and lower the percentage. In fact, you want to compare the 'Successfully assigned fragments' to the 'uniquely mapped' reads.

ADD REPLYlink written 3.3 years ago by WouterDeCoster42k

I feel you may be right, it may all be full of multi-mapping reads.

Different from the defaults, I use several options the STAR manual presents as ENCODE standard options (and I therefore found them possibly useful)

--outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 --alignIntronMin 20 --alignIntronMax 10000 --alignMatesGapMax 1000000

Possibly, those will add to further inflation?

See below for the logfile for the sample I pasted before (and many thanks for you efforts, btw!)

                             Started job on |   Jul 20 06:14:44
                         Started mapping on |   Jul 20 06:14:45
                                Finished on |   Jul 20 06:19:14
   Mapping speed, Million of reads per hour |   264.22

                      Number of input reads |   19743421
                  Average input read length |   246
                                UNIQUE READS:
               Uniquely mapped reads number |   16959018
                    Uniquely mapped reads % |   85.90%
                      Average mapped length |   245.42
                   Number of splices: Total |   11930188
        Number of splices: Annotated (sjdb) |   11861337
                   Number of splices: GT/AG |   11858385
                   Number of splices: GC/AG |   61520
                   Number of splices: AT/AC |   6388
           Number of splices: Non-canonical |   3895
                  Mismatch rate per base, % |   0.21%
                     Deletion rate per base |   0.01%
                    Deletion average length |   1.73
                    Insertion rate per base |   0.01%
                   Insertion average length |   1.55
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   1908781
         % of reads mapped to multiple loci |   9.67%
    Number of reads mapped to too many loci |   9447
         % of reads mapped to too many loci |   0.05%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |   0.00%
             % of reads unmapped: too short |   4.32%
                 % of reads unmapped: other |   0.06%
                              CHIMERIC READS:
                   Number of chimeric reads |   0
                        % of chimeric reads |   0.00%
ADD REPLYlink written 3.3 years ago by aggregatibacter140

Based on this, 12283289 were assigned out of 16959018 uniquely mapping reads. That's already better than the fraction you had earlier :)

ADD REPLYlink written 3.3 years ago by WouterDeCoster42k

Many thanks for your insightful comments. I did a little reading on multi-mapping reads. From what I gather, I would rather avoid to count all of them in an RNA Seq diff exp analysis, would you agree?

ADD REPLYlink written 3.3 years ago by aggregatibacter140
1

I guess most software by default ignores multimapping reads so you don't have to worry about then. But you could argue that ignoring those reads is a loss of information. You cannot just combine the reads with the uniquely mapped, but I find the method described in this paper an interesting approach.

ADD REPLYlink written 3.3 years ago by WouterDeCoster42k

You could choose to map them to a random best site (that way you would not lose that information). ambig=random is what BBMap uses for this purpose.

ADD REPLYlink written 3.3 years ago by genomax74k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1700 users visited in the last hour