Question

htseq-count for pair-end RNA-seq

0

Entering edit mode

10.0 years ago

Ming Tommy Tang ★ 3.9k

Hi there,

I used htseq-count to count the pair-end RNA-seq data.

at the end, there are warnings like

Warning: 8612298 reads with missing mate encountered.

my question is how htseq-count deals with the pairs with only one end existing in the sam file?

does htseq-count simply discard those counts?

Thanks,
Ming

RNA-Seq • 13k views

ADD COMMENT • link updated 2.6 years ago by Ram 43k • written 10.0 years ago by Ming Tommy Tang ★ 3.9k

0

Entering edit mode

are you sure that these reads really don't have a mate in the sam file? If you didn't sort the sam file by read name but you told htseq it is, htseq thinks there are lot of missing mates

ADD REPLY • link 10.0 years ago by Irsan ★ 7.8k

0

Entering edit mode

I sorted it with samtools sort -n by name. any ideas? Thank you.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by Ming Tommy Tang ★ 3.9k

score 1 · Answer 1 · 2014-05-14

1

Entering edit mode

10.0 years ago

Ido Tamir 5.2k

HtSeq count counts them as 1

ADD COMMENT • link 10.0 years ago by Ido Tamir 5.2k

0

Entering edit mode

I know HTseq count one pair of reads as 1, but how about one pair missing the other end? Does HTseq discard it or still count it as 1? Thank you.

ADD REPLY • link 10.0 years ago by Ming Tommy Tang ★ 3.9k

2

Entering edit mode

This is what I meant. My investigations made me believe that each read of a read pair where only one is mapped was counted as 1.

You could check by randomizing your bam file or simply taking the position sorted one. I bet, you will have 2x read count per gene. Please try it and tell me the result. - But I think the last version does not need a read name sorted one anymore, so try it with an old version.

ADD REPLY • link 10.0 years ago by Ido Tamir 5.2k

0

Entering edit mode

what do you mean by randomizing the bam file? It is now sorted by name.

ADD REPLY • link 10.0 years ago by Ming Tommy Tang ★ 3.9k

1

Entering edit mode

I mean, you can simply test the question how htseq-count counts paired end reads where only one was mapped by sorting this file by position or by randomly sorting it, then doing htseq-count and compare name sorted vs randomly sorted counts. randomly sorted counts should be 2x name sorted counts (in old versions of htseq-count). Or you filter the name sorted by read number (take only 1. read) and then count. Counts should be the same. This should work with all versions of htseq-count and is the better one. Or maybe not, because the "mate aligned flag" is set in those reads.

ADD REPLY • link 10.0 years ago by Ido Tamir 5.2k

Ram · Answer 2 · 2015-02-11

1

Entering edit mode

9.2 years ago

raphael.poujol ▴ 10

According to htseq documentation : http://www-huber.embl.de/users/anders/HTSeq/doc/counting.html

The fact that the records describe the same fragment can be seen from the fact that they have the same read name

So Tophat for instance is ouputing read paris with different names. (Adding 1 or 2 at the end of the name)

So simply do that:

(samtools view -H in.bam; in.bam | awk '{print substr($1,1,length($1)-1),$0}' | sed 's/ [^   ]*//') | samtools view -bSh - > in.samereadname.bam

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 9.2 years ago by raphael.poujol ▴ 10

1

Entering edit mode

just a small correction for people passing by, the line should be:

(samtools view -H accepted_hits.bam; samtools view accepted_hits.bam | awk '{print substr($1,1,length($1)-1),$0}' | sed 's/ [^   ]*//') | samtools view -bSh -> in.samereadname.bam

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Martombo ★ 3.1k