It seems like this should be obvious, but after carefully looking at some .SAM lines, I'm having a difficult time getting the numbers to crunch.
For example, here are a few lines from my SAM file:
READ_ID_A:99419       99      Chr1    45474   50      76M     =       45556   244     GTCTTTGCAGCAAAAGCAGAACAGTTGGTTTACGACTCACTCTTCTCGATACCTTCTCTGACGATGATTCTGCGAC
READ_ID_A:99419       147     Chr1    45556   50      4M86N72M        =       45474   -244    ATTGTGTTCCATTGAATGATAAAGCCGCATCACGTTCTTCACCGCTTGTAAAAGAAAGAAAGGCAAAGACTCTGTT
READ_ID_B:27674       99      Chr1    155388  50      76M     =       155531  219     TTCAGCTTCTTTGAATCTCTTGACGTTGTGTAGAAGCCATTTGTATGATTCATCTTTTCGGTCTTGACACGGATCG
READ_ID_B:27674       147     Chr1    155531  50      76M     =       155388  -219    CACACGACACCGTTTCGTCTAGCTTCGGCAAGTGAAGCAGAAACGTGAGGACGTTGGCATTTGATGCATAGAAAAT
READ_ID_C:17835       99      Chr1    180537  50      76M     =       180672  211     TGCGCTTGTGGTTGATCTTTCTTCTCTCCTTCCTTCTTATCGCCACCTTCTTTCTTCTCTTCTTCCTTCTTCGGTG
READ_ID_C:17835       147     Chr1    180672  50      76M     =       180537  -211    CCACCACCTTCCTTCTTCGGCTCCTCCTTCTTCTCCTTTTCCGGCTCTTTCGCAGGTCCCACTAGTACGATATCCG
Now, according to the SAM format specification, fields 4 and 8 are the leftmost starting points of the R1 and R2 reads, respectively. So, if R1 for READ_ID_A starts at 45474, and the read is 76bp long, the end point should be 45474 + 76 = 45550. Then, the R2 leftmost starting point is 45556, which is only 6 bases away from the R1 ending point! It seems to me that the insert would be 6 bases (!!), but field 9 specifies this insert as 244.
I'm sure there is some fundamental error in my logic here, so I'm hoping someone can point it out for me. Thanks!
EDIT: Any ideas? All suggestions/comments appreciated!
IMHO one of the shortcomings of the SAM format is that it only reports the rightmost coordinate and one has to do all that cumbersome little parsing to figure out the rightmost end, thus rendering column oriented tools unusable.
@ashutoshmits Thanks for the great explanation, that certainly makes more sense now. So, you compared the leftmost read coordinate (start of R1) with the rightmost read coordinate (end of R2) for a total of 244 bases. Since R1 and R2 each have 76 bp, could I then be sure the insert size was 244 - 76 - 76 = 92 bases?
Yes. Tools like Tophat ask you to mention the inner distance between mate pairs. It will be ~100 bp in your case. Tools like BWA estimate the insert size based on the first few thousand alignments of the mate-pairs/paired-ends.
Thanks for all of the help, I am much more familiar with SAM now!
Thanks for the explanation. i was trying to understand answers from this post: C: Bowtie2 classification of discordantly mapped pairs. Your post was helpful in it.
Thank you so much, I've been looking for this answer for a long time.