Question

featureCounts: Low % assigned fragments and discrepancy between STAR alignment multi mapped reads

1

Entering edit mode

4.0 years ago

david.gonzalez ▴ 10

Hi all,

I'm an MD/PhD student with a background in stem cell/development biology with abslutely no background in computational biology but am earnestly trying to develop a skillset in bulk and single cell RNAseq data analysis to complement my wetlab skillset. Please excuse my very limited knowledge and experience in this space, I'm hoping some of you may be able to answer a few questions I have about some ongoing bulk RNAseq analysis. Apologies for the long post, but I tried to be thorough and provide as much information as I could in advance -- feel free to skip to the bottom for the questions themselves which are italicized.

I recently submitted a large number of samples of sorted PSC-derived cardiomyocytes that have perturbed with a few small molecules to the genomics core for library prep (Ribozero) and paired end sequencing on a NovaSeq6000. I downloaded the FASTQ files, trimmed the reads, and aligned them using STAR using the hg38 ref genome and the associated hg38.12 gtf (generated with STAR). I used the following code for my alignment where the $varname refer to each sample name:

STAR --runThreadN 10 --genomeDir $genomeDir --sjdbGTFfile $GTF --sjdbOverhang 100 --readFilesIn $FASTQ/${varname}_R1_001_val_1.fq $FASTQ/${varname}_R2_001_val_2.fq --twopassMode Basic --outSAMtype BAM SortedByCoordinate Unsorted --quantMode TranscriptomeSAM GeneCounts

The alignment for the samples seemed to have gone well with most samples having about 85% of the reads mapping uniquely. Summary from the Log.final.out file for a representative sample is here: enter image description here

The report shows good alignment with a relatively low # of reads mapping to multi-loci (~10% for most samples)

I then proceeded to count a small subset of those samples using featureCounts using the following code:

featureCounts -T 5 --ignoreDup -g 'gene_name' --extraAttributes 'gene_id' -F GTF -t 'transcript' -s 0 -p -O -a $GTF -o $OUTF/bam_pooled_gene_genename.txt $LIB1 $LIB2 $LIB3 $LIB4 $LIB5 $LIB6 $LIB7 $LIB8

Where each of the $LIB variables refers to an individual sample's Aligned.out.bam file generated from STAR.

However in the error file from the featureCounts job I saw the following:

enter image description here

Despite the uniquely mapping reads being around 85%, only about 50-60% of the alignments appear to be assigned and counted is my take-away from this?

Of note, when i ran featureCounts with s -2 it only successfully assigned ~10% of the alignments. I'm hoping theres nothing incorrect with running the job on either strand as above?

After doing some googling I found a similar thread from someone who saw something similar and someone helpfully pointed out that there may be some multi-mapping reads. My txt summary file from featureCounts does indeed show that theres a lot of multi-mapping reads: enter image description here

This is confusing to me, as it seems that a large percentage of the alignments are being detected through featureCounts as multi-mapping (~35-40% of all reads) despite the original STAR output reporting only 10% multi-mapping reads?? Where does this discrepancy come from?

After seeing a few other posts that suggested that MMR might be due to ribosomal RNA I wondered if there as ribosome contamination in my samples. After checking the FASTQ QC files I did notice that while the per base sequence quality was very good, there did appear to be high GC content, sequence duplication, and a number of overepresented sequences in some of the samples. I BLASTed these overrepresented sequences and found that indeed they do match to ribosomal RNA which supports that this may be contributing to the multi-mapping issue.

Given the above, I suppose my questions are as follows:

1) Why is there a discrepancy in the STAR and featureCounts quantification of the multi-mapping reads?

2) Can I be confident that indeed the multi-mapping reads is indeed due to the ribosomal RNA contamination, or did I possibly make a mistake during my featureCounts step? I'm very new to this so my own stupidity is always my #1 suspect for any errors that arise.... I'm assuming these MMR should not be counted for downstream analysis, but I'm concerned that the discrepancy in STAR and featureCounts means that I may be leaving behind additional reads in there and that I may be biasing my DXP analysis unknowingly.

3) If all the above is correct and simply is the way it is, am I ok to proceed with my analysis with the data in its current state? If I've lost significant read depth to ribosomal RNA not being cleared out sufficiently thats a shame, but I still have between 20k and 40k M reads for each of my samples which should hopefully be enough for my purposes of understanding transcriptome wide differences across my samples.

Any help you all could provide or any additional input/feedback you have on the above would be enormously helpful. I'm the first grad student in the lab to really try to analyze their own computational data so I don't have a ton of direct mentorship in this so all of the posts on Biostars and elsewhere on the internet have been extremely helpful.

Thanks in advance all

RNA-Seq featureCounts STAR subread next-gen • 5.0k views

ADD COMMENT • link 4.0 years ago by david.gonzalez ▴ 10

0

Entering edit mode

check the output of -M --fraction argument for one of your sampe, what the difference?

one read could reported up to ten alignments with default parameters in STAR, and report the number of multiple reads, but featureCounts would treat each alignments as one count. A quick check for this, check the Unassigned_Multimapping reads from featureCounts report with STAR output, (eg: ND1-VCTRL-D14.aligned.out.bam, what are the numbers?)

multimapping could be from repeat regions, more likely than from rRNA. I always remove the multimapping reads before quantification, you could check NH:i:1 tag, and a easier way samtools view -q 255 -Sub in.bam > unique.bam to filter the reads with MAQ 255 (the unique reads)
you could check the rRNA percentage by mapping reads directly to the rRNA (pre-rRNA sequences). It could be 20%-40% for some of the RNA-seq data I experienced.

4(extra). If your data are not from a strand-specific library. you have to use -s 0. (you can check it from the protocol), or if it is, make sure the read1 or read2 represent the sense strand of mRNA. (like, dUTP, it is the read2, you can use -s 2)

so, it would be: 1. remove rRNA reads (if you care about the proportion). 2. save only unique reads. 3. featureCount, suggest on gene level; -t gene, -O --fraction

ADD REPLY • link 4.0 years ago by wm ▴ 550

0

Entering edit mode

Hi I'm so sorry that I missed this reply I don't know how! Thanks so much for your detailed responses. I had spoken to another colleague who suggested similar solutions to yours.

I wasn't aware earlier that featureCounts and STAR counted "reads' differently in the way you described which explains the discrepancy in the STAR and featureCount summary for MMR. Sure enough, when I looked more carefully at more of the overrepresented sequences in my FASTQC files there are less rRNA sequences and quite a bit of noncoding portions of the genome that it maps to. I suspect as you did that there are a number of reads that map to many places in the genome and thus featureCount is expanding the count for each read significantly.

I did indeed re-run the featureCounts job including the MMR and then filtered out any read with a mapq score lower than 10 (-M, --Q 10) and unsurprisingly all of the MMR were re-binned into the "Unmapped_Mapping Quality" portion of the summary file. When including all of the MMR in the analysis, the featureCounts job showed about 85% of the alignments being counted similar to STAR output which again supports your suggestion that featureCount was just counting a larger number of possible alignments for any given read.

The trouble I'm currently having is that I checked with the core and the libraries were prepared with a TruStanded library prep kit, so if I understand correctly I should have set -s to 2 however I get much better alignments counted on the opposite strand (65% excluding the MMR with -s 1) vs 10% of the alignments being counted with the -s to 2 which translates to about 20-40 million reads on the forward strand versus 3-5 million on the reverse strand. I took a look at both count matrices and at a couple genes that I expect to be expressed highly and the forward scenario seems to better represent some previous data.

Someone suggested using RSeQC and infer_experiment.py to check the strand information which I plan to to do. I don't know if you have any other suggestions that may be better? I'm assuming I can't look at which strand the reads are aligning to in IGV viewer right?

ADD REPLY • link 4.0 years ago by david.gonzalez ▴ 10

0

Entering edit mode

Yes, infer_experiment.py is a easy-to-run program, to check out the library.
As you mentioned, you get 65% properly assigned reads with (-s 1) and 10% with (-s 2). so, there are >20% reads could not map to the features from the gtf file, it is deserved a closer check of your data.
- make sure you used a correct version of gtf file, like from ensembl link (match to the genome version you used; here is the latest version on Ensembl)
- -s 1 indicate the library is stranded, (what are the infer_experiment.py output? (like: 1++, 1--; 2+-, 2-+ ?)
- for those ~20% pct reads mapped to unannotated regions, it could be from some new features, lncRNAs, or others, you can find it out if you care about it.

ps: The data from stranded RNAseq library I got, (>95%）map to sense-strand of mRNA.

ADD REPLY • link 4.0 years ago by wm ▴ 550