Hi I've been playing around a set of RNA-Seq data for several days and the short-term goal is to infer intron retention from this RNA-Seq data. But what confused me during the past few days is that the bam file (or sam file) outputted from tophat mapping contain only one line per paired-end mapping. The basic format is as follows:
GWZHISEQ01:207:C2337ACXX:3:1102:14008:23774 385 chr1 10001 0 101M chrX 155260214 0 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC CCCFFFFFHHHHHJJJJJJJJJIIJJJJJJJJJJJJJJJJIIJJJIIJJJHIGGGGIFCGHEHFFFFFFEBDC?CABBB?CD<<ACD<?BD?CDDD@ACBB AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:20 CC:Z:chr5 CP:i:10085 XS:A:+ HI:i:0
So basically my understanding is that this read is a combo between two ends, one mapped to chr1 and the other mapped to chrX. You might wonder why they mapped to different chromosomes. That's just a side issue of setting the parameter r in tophat. Let's not worry about it for now.
To my surprise (or because of my lack of knowledge), this data only contain this single line for this read instead of double lines for a pair of reads. I discussed with my friend and he told me this was the standard tophat output. Now I'm confused because I need to use HTSeq in python to get the coverage of each intron from this dataset. However, HTSeq doesn't recognize this sort of paired-end format and it only recognize paired-end sam format if each pair of reads have double lines in the same file with the same read name. Is there any way to get around this issue? Or am I missing some important points here? I did a lot of google searches but I didn't find any useful answer to this specific question. Maybe it's just me not getting the point?
Thanks in advance!
I looked at the tophat command my collaborator used to generate the data and there's nothing weird about it... I also sorted all the bam by read name but found nothing in pair except multiple alignment cases... I'm really confused here