Question

What is the acceptable % of reads that map to no feature (in RNASeq)?

2

Entering edit mode

9.7 years ago

noushin.farnoud ▴ 130

Hi,

I have RNASeq data from mouse and am trying to (1) align it using STAR, and (2) estimate gene counts using HTSeq. I am using the .fa and .gtf files from Ensembl that correspond to mm9 (Mus_musculus.NCBIM37.67.dna.toplevel.fa, Mus_musculus.NCBIM37.67.gtf).

My question is what is acceptable percentage of reads that show "no features" and "not unique" as the result of htseq-count function?

Here is the parameters I used for htseq:

samtools view \
  -h $sorted.bam | \
  python2.7 htseq-count \
    -m intersection-nonempty \
    -f sam \
    -r name \
    -s no \
    -a 0 \
    -i gene_id \
    -o samout.out \
    - \
    Mus_musculus.NCBIM37.67.gtf > result.count

For example, one of my RNASeq samples has 36,269,180 uniquely mapped genes (as assessed by STAR), but following HTseq-count 6,065,105 (16%) of the reads show __no_feature and 19,474,101 (54%) show __alignment_not_unique. This means only 30% of original reads are mapped to the genes; Isn't this too low?

Thank You,
Farnoud

htseq RNA-Seq STAR • 14k views

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by noushin.farnoud ▴ 130

1

Entering edit mode

I've found that the fraction of intergenic reads varies considerably with the kit used for RNA-seq library prep. Do you know which kit(s) were used to prepare your libraries?

You might also want to consider running RNA-SeQC (Broad Institute), which will calculate the fraction of reads that are exonic, intronic, intragenic etc., and also the fraction of rRNA reads.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by CJP-OHRI ▴ 30

0

Entering edit mode

Thanks a lot Chris for your reply. It is very useful! I will figure out the kit and let you know.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by noushin.farnoud ▴ 130

0

Entering edit mode

54% does seems like a lot. Mine was around 15~23% for not unique and 6~13% for no feature. And the command I used for htseq is much simpler:

$sam view $sampleName.sorted.bam | $htseq -s $lib - $gtf > $sampleName.counts

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Sam ★ 4.8k

0

Entering edit mode

Thanks Sam. I too think these numbers are a lot!; But couldn't find a reference to compare these to.

Most of the parameters above are just default ones (except -m=non-empty, -s=no and -a=0, which I modified to lower the stringency of htseq-count).

As for ambiguous calls, I don't think there is much I can do as it depends on how the reads are located, but the 16% no-feature (i.e. mapping to no exons/genes) is what worries me the most. I have checked several times that .gtf and .fa files are both the same Ensembl built, and don't know where else this problem may come from?

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by noushin.farnoud ▴ 130

0

Entering edit mode

I wonder if this is due to rRNA or tRNA. Those are often not annotated in the genome, so if you have a sample with a really high amount of them then I could see this happening. That would indicate to me that something went wrong during library prep. You might look at where some of these are mapping along side a repeatmasker track. That'd help resolve whether this is the case. In either case, I suspect that you'll end up excluding this sample from further analysis.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Devon Ryan 104k

0

Entering edit mode

but 36 million reads mapped on mouse genome uniquely, thats good number for downstream analysis

ADD REPLY • link 9.7 years ago by Manvendra Singh ★ 2.2k

0

Entering edit mode

The problem is, even with a lot of uniquely mapped read, the high amount of non-unique reads might still suggest a possibility of having technical errors in library prep or other up-stream process, and you won't be certain whether if they will or will not affect the down-stream analysis. Maybe those uniquely aligned reads have special properties?

I guess for now, the best way to do is to follow the instruction provided by #Devon Ryan and hopefully understand whether if it is the rRNA / tRNA problem....

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Sam ★ 4.8k

0

Entering edit mode

Agreed. I think that if the weird metrics are due to something like rRNA or some other species then perhaps the sample is salvageable. If, however, it's due to background DNA contamination, then you're unlikely to get anything useful from the samples, at least if the goal is standard differential expression. The reason for this is that the normal (edgeR/DESeq2) sample normalization methods aren't designed to compensate for this (they basically perform a scaling with a weight, but there's also an offset here).

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you all for your feedback and specially Devon for your insightful recommendation. Thinking about what/how these numbers come from, I came up with another ambiguity: I am using the output of STAR that only has "unique reads" (double checked the number of reads in STAR Aligned.sam output is the same as the reported number of unique reads in Log.final.out). Then, how it is possible that 54% of the reads that are aligned "uniquely" turn out as "aligned not unique" by HTSeq? Does that mean that the Ensembl gene model I am using for HTSeq-count (Mus_musculus.NCBIM37.67.gtf) has some issues and need to be curated before using it?

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by noushin.farnoud ▴ 130

0

Entering edit mode

It depends on what STAR and htseq-count are describing as unique. Do a samtools view -c -q 5 foo.bam and see if the result from that matches what STAR is telling you. That should also match the results of samtools view foo.bam | grep -cw "NH:i:1". HTSeq-count is basing things off of the NH aux tag.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Devon Ryan 104k

0

Entering edit mode

Devon, I think I am not estimating the numbers correctly:

According to STAR final.log, there are 41,539,731 total reads in this RNASeq paired-end experiment, from which 34,373,992 are uniquely mapped to mm9.

Since, my data is paired-end, I modified the command you advised above and did:

samtools view -S -F 4 -f 2 star.out.sam | grep -cw NH:i:1

which gives me 66,183,904. I assume this should be the total number of properly mapped reads that both ends map to only to 1 region, right? So dividing this by 2 should give me the number of unique properly mapped paired reads (33,091,95). However the latter is not:

(1) equivalent to what STAR reported as "uniquely mapped" reads (34,373,992); nor

(2) what HTSeq numbers suggest:

19,474,101 (htseq.count alignment_not unique) - 41,539,731 (total number of reads based on STAR) = 22,065,630

There must be something I am doing wrong here, but don't know exactly what it is! Appreciate your help.

Thanks,
Noushin

PS. I also estimated the samtools command without filtering -f 2 flag, but it still was less than the number that STAR reported.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by noushin.farnoud ▴ 130