Hello all,
I just wanted to verify that I got it correct: In an sequencing experiment, when I have the SAM file, I get these kind of lines:
read1    16      reference  7695    255     36M15D69M       *       0       0        GATAGCATTGGGAGATATACCTAATGCTAGATGACGGGGTGAACATTAGTGGGTGCAGCGCACAAGCATGGCACATGTATACATATGTAACTAACCTGCACAATG       HHHHHHHHHHHHHHHHHHGHHHHHHHHHHHHGGGGGFHHHHHHHHHHHGGGHHHGGGGGHHHHHHHHHHHHHHHHHHHHHHHHGGGGGGGGGGCFFFFFFCCCCC       NH:i:1  HI:i:1  AS:i:65 nM:i:3  NM:i:18 MD:Z:36^TCCATACTGAGAATC0A0T2T64 jM:B:c,-1       jI:B:i,-1
  read2    16      reference  7695    255     35M33S  *       0       0       GATAGCATTGGGAGATATACCTAATGCTAGATGACACGAGTAACATTAGTGGGTGCAGCGCACAAGCA    HHHHHHHHHGBHHHHHGD5FHDEFGHHHGHHFFGFCEHDGHHEHHHGGGGFGGGGGBDA5C4FA>>3>    NH:i:1  HI:i:1  AS:i:34 nM:i:0  NM:i:0  MD:Z:35 jM:B:c,-1       jI:B:i,-1
   read3    16      reference  7751    255     41S39M  *       0       0       GATAGCATTGGGAGGTATACCTAATGCTAGATGACCTTACGAACATTAGTGGGTGCAGCGCACAAGCATGGCACATGTAT        HHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHHHGHGHHGGHHHHHHHHGGGHHHGGGGGGGGGGGGGGFFFFFFFBBBBB        NH:i:1  HI:i:1  AS:i:38 nM:i:0  NM:i:0  MD:Z:39 jM:B:c,-1       jI:B:i,-1
I wanted to clarify 2 things: (a) the left-most position of my alignment against the reference is the 3rd column, correct? (b) If I want to know the right-most position, i.e. where the alignment ended, can I just add the numbers in my CIGAR strings? So, for example, for read1 it would be 36+15+69, read2 35+33 and read3 41+39? To me it makes sense because for read2 and read3 the numbers are actually equal to the read length (68 and 80 respectively), while, for read1, the numbers in the CIGAR string add up to 105, the read is 120, but I know I have a deletion there so it is fine.
I would be grateful if someone can tell me if I am doing things correctly.
Many thanks!
Many thanks Devon (yes I meant to write column 4 :) )
I also saw this one:
and the read is
so here the total length is 35+9 or 35+1349+9? I am bit confused..
Yes, it's a spliced read.
So it is 35 + 9, the 1349 is ignored, right? Or?
1349 is not ignored. If it helps, have a look at the file in IGV.
Ok then I only extract I, S and H as you initially wrote, then, in this example the alignment was 35bases at some point and then 9 more after 1349 bases that are omitted. But then is it correct for me to say that the alignment start e.g. at position 7751 and finishes at position 7751 + 35 + 1349 +9?
7751 + 35 + 1349 + 9 - 1, since otherwise you're double counting a base.Many thanks for your help!