Question: How Are Paired-End Read Insert Sizes Inferred For Reporting In A Sam File?
3
gravatar for JacobS
5.8 years ago by
JacobS890
Cleveland, Ohio
JacobS890 wrote:

It seems like this should be obvious, but after carefully looking at some .SAM lines, I'm having a difficult time getting the numbers to crunch.

For example, here are a few lines from my SAM file:

READ_ID_A:99419       99      Chr1    45474   50      76M     =       45556   244     GTCTTTGCAGCAAAAGCAGAACAGTTGGTTTACGACTCACTCTTCTCGATACCTTCTCTGACGATGATTCTGCGAC
READ_ID_A:99419       147     Chr1    45556   50      4M86N72M        =       45474   -244    ATTGTGTTCCATTGAATGATAAAGCCGCATCACGTTCTTCACCGCTTGTAAAAGAAAGAAAGGCAAAGACTCTGTT

READ_ID_B:27674       99      Chr1    155388  50      76M     =       155531  219     TTCAGCTTCTTTGAATCTCTTGACGTTGTGTAGAAGCCATTTGTATGATTCATCTTTTCGGTCTTGACACGGATCG
READ_ID_B:27674       147     Chr1    155531  50      76M     =       155388  -219    CACACGACACCGTTTCGTCTAGCTTCGGCAAGTGAAGCAGAAACGTGAGGACGTTGGCATTTGATGCATAGAAAAT

READ_ID_C:17835       99      Chr1    180537  50      76M     =       180672  211     TGCGCTTGTGGTTGATCTTTCTTCTCTCCTTCCTTCTTATCGCCACCTTCTTTCTTCTCTTCTTCCTTCTTCGGTG
READ_ID_C:17835       147     Chr1    180672  50      76M     =       180537  -211    CCACCACCTTCCTTCTTCGGCTCCTCCTTCTTCTCCTTTTCCGGCTCTTTCGCAGGTCCCACTAGTACGATATCCG

Now, according to the SAM format specification, fields 4 and 8 are the leftmost starting points of the R1 and R2 reads, respectively. So, if R1 for READ_ID_A starts at 45474, and the read is 76bp long, the end point should be 45474 + 76 = 45550. Then, the R2 leftmost starting point is 45556, which is only 6 bases away from the R1 ending point! It seems to me that the insert would be 6 bases (!!), but field 9 specifies this insert as 244.

I'm sure there is some fundamental error in my logic here, so I'm hoping someone can point it out for me. Thanks!

EDIT: Any ideas? All suggestions/comments appreciated!

ngs mapping • 8.9k views
ADD COMMENTlink modified 5.8 years ago by Ashutosh Pandey11k • written 5.8 years ago by JacobS890
10
gravatar for Ashutosh Pandey
5.8 years ago by
Philadelphia
Ashutosh Pandey11k wrote:

From the SAM Format Manual: The value in the ninth column is TLEN that is comparable to insert length but not exactly the same. See below for the defn.

TLEN: signed observed Template LENgth. If all segments are mapped to the same reference, the unsigned observed template length equals the number of bases from the leftmost mapped base to the rightmost mapped base. The leftmost segment has a plus sign and the rightmost has a minus sign. The sign of segments in the middle is unde ned. It is set as 0 for single-segment template or when the information is unavailable.

In your case the leftmost position is 45474. And the rightmost position is not 45556 but it is 45556 + 4M86N72M (4+86+72) OR 45556+162 = 45718

Now the difference between leftmost and rightmost position is = 45718-45474 = 244

Hope it makes it clear.

ADD COMMENTlink written 5.8 years ago by Ashutosh Pandey11k

IMHO one of the shortcomings of the SAM format is that it only reports the rightmost coordinate and one has to do all that cumbersome little parsing to figure out the rightmost end, thus rendering column oriented tools unusable.

ADD REPLYlink written 5.8 years ago by Istvan Albert ♦♦ 80k

@ashutoshmits Thanks for the great explanation, that certainly makes more sense now. So, you compared the leftmost read coordinate (start of R1) with the rightmost read coordinate (end of R2) for a total of 244 bases. Since R1 and R2 each have 76 bp, could I then be sure the insert size was 244 - 76 - 76 = 92 bases?

ADD REPLYlink written 5.8 years ago by JacobS890

Yes. Tools like Tophat ask you to mention the inner distance between mate pairs. It will be ~100 bp in your case. Tools like BWA estimate the insert size based on the first few thousand alignments of the mate-pairs/paired-ends.

ADD REPLYlink written 5.8 years ago by Ashutosh Pandey11k

Thanks for all of the help, I am much more familiar with SAM now!

ADD REPLYlink written 5.8 years ago by JacobS890

Thanks for the explanation. i was trying to understand answers from this post: C: Bowtie2 classification of discordantly mapped pairs. Your post was helpful in it.

ADD REPLYlink written 3.6 years ago by cpad011211k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1061 users visited in the last hour