Question

Getting Pe-Read Strand Info From A Sam File Made With Tophat And Stranded Library?

0

Entering edit mode

10.7 years ago

JacobS ▴ 980

I am having difficulty navigating all of the considerations when thinking about strand-specific sequencing, and how to determine which actual strand a read comes from.

Here is the setup: I have paired-end Illumina sequence generated with a strand-specific library. It was put through Tophat using the first-strand option, and now I have a SAM file. While browsing the SAM file, I see my read pairs and am interpreting their flag stat, but my mind is going in circles trying to determine if the pair mapped to the actual (+) or (-) strand, which I need to know for distinguishing counts in cases of overlapping genes. I realize that reads spanning junctions have the XS field, but I want to find strandedness from just the flag stat.

Consider 3 pairs of reads that are properly paired in mapping:

READ_ID_A       99      Chr1    155388  50      76M     =       155531  219     TTCAGCTTCTTTGAATCTCTTGACGTTGTGTAGAAGCCATTTGTATGATTCATCTTTTCGGTCTTGACACGGATCG
READ_ID_A       147     Chr1    155531  50      76M     =       155388  -219    CACACGACACCGTTTCGTCTAGCTTCGGCAAGTGAAGCAGAAACGTGAGGACGTTGGCATTTGATGCATAGAAAAT


READ_ID_B       99      Chr1    180537  50      76M     =       180672  211     TGCGCTTGTGGTTGATCTTTCTTCTCTCCTTCCTTCTTATCGCCACCTTCTTTCTTCTCTTCTTCCTTCTTCGGTG
READ_ID_B       147     Chr1    180672  50      76M     =       180537  -211    CCACCACCTTCCTTCTTCGGCTCCTCCTTCTTCTCCTTTTCCGGCTCTTTCGCAGGTCCCACTAGTACGATATCCG


READ_ID_C       163     Chr1    285419  50      75M     =       285474  130     GGCTACTATGGCCCTGTCTCTGGCTGAAGCTAGGCTTCAGGTTATAGTGGAATCACTTGAAGCTGGTGCAGGGAA
READ_ID_C       83      Chr1    285474  50      75M     =       285419  -130    CTTGAAGCTGGTGCAGGGAATGATATTCCACATGTGTCTGAAGAGACTGAGGAAACAATAGATGTTAATGATAAA

Looking at just the first two reads in READ_ID_A, we can decipher the flag stat to find out that the first read is R1, and is from the (-) strand. Likewise, the second read is R2, and is from the (+) strand.

For READ_ID_B, the first read is R1 and (-), and the second is R2 and (+).

However, for READ_ID_C, the first read is R2 (-) and the second is R1 (+).

So, from this information, how can I tell which of these pairs of reads should be assigned to the (+) or (-) strand of the reference genome? I was thinking perhaps I should just check the strandedness of R1, and use that. So, if R1 turns out to be (+), then I will assign the R1/R2 pair to the (+) of the reference. Is this wrong?

EDIT: Perhaps I interpreted the flag stat incorrectly? Does anyone reach the same conclusion as me?

sam tophat strand • 3.4k views

ADD COMMENT • link updated 10.7 years ago by Devon Ryan 104k • written 10.7 years ago by JacobS ▴ 980

score 0 · Answer 1 · 2013-08-14

You're misreading the flags a bit. READ_ID_A R1 is on the + strand and R2 is on the - strand (this is the same for READ_ID_B). READ_ID_C is the opposite of this. To determine which original strand this corresponds to, you have to know how the library was made. Frankly, the easiest way to determine these sorts of things (if you don't know the details of the library prep) is to open the sorted BAM file in IGV and just color reads by the strand of read1. If you spot check a few genes and they and read1 are on the same strand, then you know you don't have to reverse everything in your analysis.