Question: What is the acceptable % of reads that map to no feature (in RNASeq)?
2
gravatar for noushin.farnoud
4.7 years ago by
United States
noushin.farnoud100 wrote:

Hi,

I have RNASeq data from mouse and am trying to (1) align it using STAR, and (2) estimate gene counts using HTSeq. I am using the .fa and .gtf files from Ensembl that correspond to mm9 (Mus_musculus.NCBIM37.67.dna.toplevel.fa, Mus_musculus.NCBIM37.67.gtf).

My question is what is acceptable percentage of reads that show "no features" and "not unique" as the result of htseq-count function?

Here is the parameters I used for htseq:

​samtools view -h $sorted.bam | python2.7 htseq-count -m intersection-nonempty -f sam -r name -s no -a 0 -i gene_id -o samout.out - Mus_musculus.NCBIM37.67.gtf > result.count

For example, one of my RNASeq samples has 36,269,180 uniquely mapped genes (as assessed by STAR), but following HTseq-count 6,065,105 (16%) of the reads show "__no_feature" and 19,474,101 (54%) show "__alignment_not_unique". This means only 30% of original reads are mapped to the genes; Isn't this too low? 

Thank You,
Farnoud

 

rna-seq star htseq • 8.1k views
ADD COMMENTlink modified 4.7 years ago by rahnaman0 • written 4.7 years ago by noushin.farnoud100
1

I've found that the fraction of intergenic reads varies considerably with the kit used for RNA-seq library prep. Do you know which kit(s) were used to prepare your libraries?

You might also want to consider running RNA-SeQC (Broad Institute), which will calculate the fraction of reads that are exonic, intronic, intragenic etc., and also the fraction of rRNA reads.

ADD REPLYlink written 4.6 years ago by Chris-OHRI10

Thanks a lot Chris for your reply. It is very useful! I will figure out the kit and let you know. 

ADD REPLYlink written 4.6 years ago by noushin.farnoud100

54% does seems like a lot. Mine was around 15~23% for not unique and 6~13% for no feature. And the command I used for htseq is much simpler:

$sam view $sampleName.sorted.bam | $htseq -s $lib - $gtf > $sampleName.counts
ADD REPLYlink written 4.7 years ago by Sam2.4k

Thanks Sam. I too think these numbers are a lot!; But couldn't find a reference to compare these to.

Most of the parameters above are just default ones (except -m=non-empty, -s=no and -a=0, which I modified to lower the stringency of htseq-count). 
As for ambiguous calls, I don't think there is much I can do as it depends on how the reads are located, but the 16% no-feature (i.e. mapping to no exons/genes) is what worries me the most. I have checked several times that .gtf and .fa files are both the same Ensembl built, and don't know where else this problem may come from?

ADD REPLYlink written 4.7 years ago by noushin.farnoud100

I wonder if this is due to rRNA or tRNA. Those are often not annotated in the genome, so if you have a sample with a really high amount of them then I could see this happening. That would indicate to me that something went wrong during library prep. You might look at where some of these are mapping along side a repeatmasker track. That'd help resolve whether this is the case. In either case, I suspect that you'll end up excluding this sample from further analysis.

ADD REPLYlink written 4.7 years ago by Devon Ryan91k

but 36 million reads mapped on mouse genome uniquely, thats good number for downstream analysis

ADD REPLYlink written 4.7 years ago by Manvendra Singh2.1k

The problem is, even with a lot of uniquely mapped read, the high amount of non-unique reads might still suggest a possibility of having technical errors in library prep or other up-stream process, and you won't be certain whether if they will or will not affect the down-stream analysis. Maybe those uniquely aligned reads have special properties? 

I guess for now, the best way to do is to follow the instruction provided by #Devon Ryan and hopefully understand whether if it is the rRNA / tRNA problem....

ADD REPLYlink written 4.7 years ago by Sam2.4k

Agreed. I think that if the weird metrics are due to something like rRNA or some other species then perhaps the sample is salvageable. If, however, it's due to background DNA contamination, then you're unlikely to get anything useful from the samples, at least if the goal is standard differential expression. The reason for this is that the normal (edgeR/DESeq2) sample normalization methods aren't designed to compensate for this (they basically perform a scaling with a weight, but there's also an offset here).

ADD REPLYlink written 4.7 years ago by Devon Ryan91k

Thank you all for your feedback and specially Devon for your insightful recommendation. Thinking about what/how these numbers come from, I came up with another ambiguity: I am using the output of STAR that only has "unique reads" (double checked the number of reads in STAR Aligned.sam output is the same as the reported number of unique reads in Log.final.out). Then, how it is possible that 54% of the reads that are aligned "uniquely" turn out as "aligned not unique" by HTSeq? Does that mean that the Ensembl gene model I am using for HTSeq-count (Mus_musculus.NCBIM37.67.gtf) has some issues and need to be curated before using it?

ADD REPLYlink written 4.7 years ago by noushin.farnoud100

It depends on what STAR and htseq-count are describing as unique. Do a samtools view -c -q 5 foo.bam and see if the result from that matches what STAR is telling you. That should also match the results of samtools view foo.bam | grep -cw "NH:i:1". HTSeq-count is basing things off of the NH aux tag.

ADD REPLYlink written 4.7 years ago by Devon Ryan91k

Devon, I think I am not estimating the numbers correctly:

According to STAR final.log, there are 41,539,731 total reads in this RNASeq paired-end experiment, from which 34,373,992 are uniquely mapped to mm9.

Since, my data is paired-end, I modified the command you advised above and did:

​samtools view -S -F 4 -f 2 star.out.sam | grep -cw NH:i:1

which gives me 66,183,904. I assume this should be the total number of properly mapped reads that both ends map to only to 1 region, right? So dividing this by 2 should give me the number of unique properly mapped paired reads (33,091,95). However the latter is not:

(1) equvalent to what STAR reported as "uniquely mapped" reads (34,373,992); nor

(2) what HTSeq numbers suggest:  

19,474,101 (htseq.count "alignmen_not unique")- 41,539,731 (total number of reads based on STAR) = 22,065,630

There must be something I am doing wrong here, but don't know exactly what it is! Appreciate your help. 
Thanks,
Noushin

PS. I also estimated the samtools command without filtering -f 2 flag, but it still was less than the number that STAR reported.

ADD REPLYlink written 4.6 years ago by noushin.farnoud100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2299 users visited in the last hour