Tool: RNA-Seq paired-end sequencing tophat outputted SAM file: single line or double lines?
0
gravatar for lliu.hsph
3.5 years ago by
lliu.hsph0
United States
lliu.hsph0 wrote:

Hi I've been playing around a set of RNA-Seq data for several days and the short-term goal is to infer intron retention from this RNA-Seq data. But what confused me during the past few days is that the bam file (or sam file) outputted from tophat mapping contain only one line per paired-end mapping. The basic format is as follows:

GWZHISEQ01:207:C2337ACXX:3:1102:14008:23774     385     chr1    10001   0       101M    chrX

    155260214       0       TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC

CTAACCCTAACCCTAACCCTAACCCTAACCCTAACC   CCCFFFFFHHHHHJJJJJJJJJIIJJJJJJJJJJJJJJJJIIJJJIIJJJHIGG

GGIFCGHEHFFFFFFEBDC?CABBB?CD<<ACD<?BD?CDDD@ACBB   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:

i:0  MD:Z:101        YT:Z:UU NH:i:20 CC:Z:chr5       CP:i:10085      XS:A:+  HI:i:0

 

So basically my understanding is that this read is a combo between two ends, one mapped to chr1 and the other mapped to chrX. You might wonder why they mapped to different chromosomes. That's just a side issue of setting the parameter r in tophat. Let's not worry about it for now.

To my surprise (or because of my lack of knowledge), this data only contain this single line for this read instead of double lines for a pair of reads. I discussed with my friend and he told me this was the standard tophat output. Now I'm confused because I need to use HTSeq in python to get the coverage of each intron from this dataset. However, HTSeq doesn't recognize this sort of paired-end format and it only recognize paired-end sam format if each pair of reads have double lines in the same file with the same read name. Is there any way to get around this issue? Or am I missing some important points here? I did a lot of google searches but I didn't find any useful answer to this specific question. Maybe it's just me not getting the point?

Thanks in advance!

ADD COMMENTlink modified 3.5 years ago by Devon Ryan89k • written 3.5 years ago by lliu.hsph0
0
gravatar for Devon Ryan
3.5 years ago by
Devon Ryan89k
Freiburg, Germany
Devon Ryan89k wrote:

That alignment shows just one read. For paired-end reads, the alignment information for the mate is partly included, but the mate none-the-less has its own entry. If the mate's alignment isn't there then either you filtered the file or found a bug in tophat2 (or just haven't looked hard enough).

HTSeq will be fine with tophat's output unless you changed it.

ADD COMMENTlink written 3.5 years ago by Devon Ryan89k

I looked at the tophat command my collaborator used to generate the data and there's nothing weird about it... I also sorted all the bam by read name but found nothing in pair except multiple alignment cases... I'm really confused here

ADD REPLYlink written 3.5 years ago by lliu.hsph0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1406 users visited in the last hour