SAM file CIGAR string clarification
1
0
Entering edit mode
6.3 years ago
bioplanet ▴ 60

Hello all,

I just wanted to verify that I got it correct: In an sequencing experiment, when I have the SAM file, I get these kind of lines:

read1    16      reference  7695    255     36M15D69M       *       0       0        GATAGCATTGGGAGATATACCTAATGCTAGATGACGGGGTGAACATTAGTGGGTGCAGCGCACAAGCATGGCACATGTATACATATGTAACTAACCTGCACAATG       HHHHHHHHHHHHHHHHHHGHHHHHHHHHHHHGGGGGFHHHHHHHHHHHGGGHHHGGGGGHHHHHHHHHHHHHHHHHHHHHHHHGGGGGGGGGGCFFFFFFCCCCC       NH:i:1  HI:i:1  AS:i:65 nM:i:3  NM:i:18 MD:Z:36^TCCATACTGAGAATC0A0T2T64 jM:B:c,-1       jI:B:i,-1
  read2    16      reference  7695    255     35M33S  *       0       0       GATAGCATTGGGAGATATACCTAATGCTAGATGACACGAGTAACATTAGTGGGTGCAGCGCACAAGCA    HHHHHHHHHGBHHHHHGD5FHDEFGHHHGHHFFGFCEHDGHHEHHHGGGGFGGGGGBDA5C4FA>>3>    NH:i:1  HI:i:1  AS:i:34 nM:i:0  NM:i:0  MD:Z:35 jM:B:c,-1       jI:B:i,-1
   read3    16      reference  7751    255     41S39M  *       0       0       GATAGCATTGGGAGGTATACCTAATGCTAGATGACCTTACGAACATTAGTGGGTGCAGCGCACAAGCATGGCACATGTAT        HHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHHHGHGHHGGHHHHHHHHGGGHHHGGGGGGGGGGGGGGFFFFFFFBBBBB        NH:i:1  HI:i:1  AS:i:38 nM:i:0  NM:i:0  MD:Z:39 jM:B:c,-1       jI:B:i,-1

I wanted to clarify 2 things: (a) the left-most position of my alignment against the reference is the 3rd column, correct? (b) If I want to know the right-most position, i.e. where the alignment ended, can I just add the numbers in my CIGAR strings? So, for example, for read1 it would be 36+15+69, read2 35+33 and read3 41+39? To me it makes sense because for read2 and read3 the numbers are actually equal to the read length (68 and 80 respectively), while, for read1, the numbers in the CIGAR string add up to 105, the read is 120, but I know I have a deletion there so it is fine.

I would be grateful if someone can tell me if I am doing things correctly.

Many thanks!

RNA-Seq • 4.0k views
ADD COMMENT
2
Entering edit mode
6.3 years ago
  1. The left-most (well, 5'-most on the + strand) position is column 4
  2. Sometimes yes, but it depends on the cigar operation. I, S, and H operations shouldn't be counted, so for read 2 it'd be 7695 + 35 - 1 and for read3 it'd be 7751 + 39 - 1. BTW, for read1 your read length is 105, but the region covered by it is 120 bases (due to a 15 base deletion).
ADD COMMENT
0
Entering edit mode

Many thanks Devon (yes I meant to write column 4 :) )

I also saw this one:

35M1349N9M

and the read is

GATAGCATTGGGAGATATACCTAATGCTAGATGACAACAGGAAC

so here the total length is 35+9 or 35+1349+9? I am bit confused..

ADD REPLY
1
Entering edit mode

Yes, it's a spliced read.

ADD REPLY
0
Entering edit mode

So it is 35 + 9, the 1349 is ignored, right? Or?

ADD REPLY
1
Entering edit mode

1349 is not ignored. If it helps, have a look at the file in IGV.

ADD REPLY
0
Entering edit mode

Ok then I only extract I, S and H as you initially wrote, then, in this example the alignment was 35bases at some point and then 9 more after 1349 bases that are omitted. But then is it correct for me to say that the alignment start e.g. at position 7751 and finishes at position 7751 + 35 + 1349 +9?

ADD REPLY
1
Entering edit mode

7751 + 35 + 1349 + 9 - 1, since otherwise you're double counting a base.

ADD REPLY
0
Entering edit mode

Many thanks for your help!

ADD REPLY

Login before adding your answer.

Traffic: 2164 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6