Question

Tool:RNA-Seq paired-end sequencing tophat outputted SAM file: single line or double lines?

0

Entering edit mode

8.6 years ago

lliu.hsph • 0

Hi I've been playing around a set of RNA-Seq data for several days and the short-term goal is to infer intron retention from this RNA-Seq data. But what confused me during the past few days is that the bam file (or sam file) outputted from tophat mapping contain only one line per paired-end mapping. The basic format is as follows:

GWZHISEQ01:207:C2337ACXX:3:1102:14008:23774     385     chr1    10001   0       101M    chrX    155260214       0       TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC   CCCFFFFFHHHHHJJJJJJJJJIIJJJJJJJJJJJJJJJJIIJJJIIJJJHIGGGGIFCGHEHFFFFFFEBDC?CABBB?CD<<ACD<?BD?CDDD@ACBB   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101        YT:Z:UU NH:i:20 CC:Z:chr5       CP:i:10085      XS:A:+  HI:i:0

So basically my understanding is that this read is a combo between two ends, one mapped to chr1 and the other mapped to chrX. You might wonder why they mapped to different chromosomes. That's just a side issue of setting the parameter r in tophat. Let's not worry about it for now.

To my surprise (or because of my lack of knowledge), this data only contain this single line for this read instead of double lines for a pair of reads. I discussed with my friend and he told me this was the standard tophat output. Now I'm confused because I need to use HTSeq in python to get the coverage of each intron from this dataset. However, HTSeq doesn't recognize this sort of paired-end format and it only recognize paired-end sam format if each pair of reads have double lines in the same file with the same read name. Is there any way to get around this issue? Or am I missing some important points here? I did a lot of google searches but I didn't find any useful answer to this specific question. Maybe it's just me not getting the point?

Thanks in advance!

RNA-Seq SAM Intron-Retention paired-end-sequencing • 2.1k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.6 years ago by lliu.hsph • 0

score 0 · Answer 1 · 2015-10-06

0

Entering edit mode

8.6 years ago

Devon Ryan 104k

That alignment shows just one read. For paired-end reads, the alignment information for the mate is partly included, but the mate none-the-less has its own entry. If the mate's alignment isn't there then either you filtered the file or found a bug in tophat2 (or just haven't looked hard enough).

HTSeq will be fine with tophat's output unless you changed it.

ADD COMMENT • link 8.6 years ago by Devon Ryan 104k

0

Entering edit mode

I looked at the tophat command my collaborator used to generate the data and there's nothing weird about it... I also sorted all the bam by read name but found nothing in pair except multiple alignment cases... I'm really confused here

ADD REPLY • link 8.6 years ago by lliu.hsph • 0