htseq-count for RNA Seq
1
2
Entering edit mode
9.4 years ago
Ron ★ 1.2k

Hi all,

I am using htseq-count for getting the read count of my genes from RNASeq data.While I have got the read counts but I am getting the warning below:

Warning: 35062120 reads with missing mate encountered.
35138401 SAM alignment pairs processed.

I am guessing this could be because of how I sorted my sam file.I did not specify any parameters while sorting.This was the command I used(gnu parallel). {1} is used to specify my input files.

samtools sort - -m 40G {1/.}.sorted

My question is do I have to sort by any specific parameter such as -n ?For example do I need to rerun sorting and then produce results of htseq-count.

Thanks in advance

Ron

next-gen htseq-count RNA-Seq • 9.8k views
ADD COMMENT
1
Entering edit mode

htseq-count requires it sorted but it can be either name or alignment position. You can ignore the warnings. Use -q to suppress warnings.

ADD REPLY
0
Entering edit mode

So,can i just ignore the warning?

ADD REPLY
0
Entering edit mode

To be honest, I have used sorted as well as unsorted files in htseq-count and it gave me the same warnings. But either way it works fine. You can just ignore the warnings.

ADD REPLY
0
Entering edit mode

Okay.What is the default sort if you dont specify any parameter?Is it by name or position

ADD REPLY
0
Entering edit mode

name. Also read htseq-count for pair-end RNA-seq post.

ADD REPLY
1
Entering edit mode

I think the default sorting is by position not by name

ADD REPLY
1
Entering edit mode

Actually you are right. Samtools manual says it is position, and htseq-count says it is by name. It's confusing.

From htseq-count manual:

For paired-end data, the alignment have to be sorted either by read name or by alignment position. If your data is not sorted, use the samtools sort function of samtools to sort it. Use this option, with name or pos for <order> to indicate how the input data has been sorted. The default is name.

If name is indicated, htseq-count expects all the alignments for the reads of a given read pair to appear in adjacent records in the input data. For pos, this is not expected; rather, read alignments whose mate alignment have not yet been seen are kept in a buffer in memory until the mate is found. While, strictly speaking, the latter will also work with unsorted data, sorting ensures that most alignment mates appear close to each other in the data and hence the buffer is much less likely to overflow.

From samtools specification:

Sort alignments by leftmost coordinates

ADD REPLY
0
Entering edit mode

What version of htseq are you using? it's true that it can work also with coordinate sorted bam files, but only the latest versions have that option.

My suggestion is that you try to sort it by name and see if you get the same warnings. I wouldn't be confident in simply ignoring them

ADD REPLY
0
Entering edit mode

I am using HTSeq-0.6.1,its the latest version.What option can I specify for coordinated sorted bam files?

ADD REPLY
7
Entering edit mode
9.4 years ago

First check how many reads have proper mates. May be using samtools flagstat

If you find many reads with missing mates, like shown in htseq warning, there are no issues with htseq. If you find most of the reads have their mates aligned, then

Sort the samfile either by position ( default ) or name ( -n ). Then use latest version of htseq and set the -r option to eithe name ( -r name [default] ) or position ( -r pos ) based on how you have sorted the bam file. Make sure you have used the strandedness (-s) option appropriately.

ADD COMMENT

Login before adding your answer.

Traffic: 2143 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6