Question: Sam And Reverse Complements
2
gravatar for ilonpilaaja
7.4 years ago by
ilonpilaaja50
Helsinki, Finland
ilonpilaaja50 wrote:

Assume a read aligner (such as e.g. BWA) is fed with a pattern P = GACT. Now, the text that BWA has preprocssed is NNNAGTCNNN. The aligner would not find the pattern, and then would fall back to reverse-complementing it to AGTC and searching again. The question is, what it would report in the SAM-file? Would it output POS = 3, CIGAR=4M and FLAG with a bit 0x10 set? EDIT: To make the question complete and unambiguous: what if the text indexed is T = NNNAGTTNNN, so that actually we have to align the reverse complement with error? Is it so that CIGAR = 3M1S?

sam • 5.4k views
ADD COMMENTlink modified 7.4 years ago by Vikas Bansal2.3k • written 7.4 years ago by ilonpilaaja50
5
gravatar for Vikas Bansal
7.4 years ago by
Vikas Bansal2.3k
Berlin, Germany
Vikas Bansal2.3k wrote:

The sam output would be POS=4 and cigar=4m. I think your example should give-

P    16     chrname     4       Mappingquality      4M   .    .    .    AGCT       .   .   .   .   .

So the first column is your query name, 16 will tell you that read mapped to reverse strand (0 is for forward strand) , your reference sequence name (eg chromosome), position, mapping quality, cigar (4 alignments) , other 3 columns for mate reference, position and Tlen and your query sequence mentioned here will not be GACT , it will be the reverse compliment of it i:e AGCT . All fields for sam are described here.

EDIT: I am not getting your question. But I tried.

Ok, so I just ran BWA (default parameters) with genome file-

>vik 
NNNNNAGTTNNNNN

Reads -

>p 
GACT
>p2
AGTC

SAM out put :

@SQ     SN:vik  LN:14
p       16      vik     6       25      4M      *       0       0       AGTC    *       XT:A:U  NM:i:1  X0:i:1  X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:3T0
p2      0       vik     6       25      4M      *       0       0       AGTC    *       XT:A:U  NM:i:1  X0:i:1  X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:3T0

But why would you expect CIGAR = 3M1S . I think S is for soft clipping, like if you have aaaaaGCATCATG read, then some aligners will ignore them and CIGAR will start with 5S...

ADD COMMENTlink modified 7.4 years ago • written 7.4 years ago by Vikas Bansal2.3k

Although this is an Q&A place rather than a forum, still I've updated my question to be more comprehensive and complete. Thank you.

ADD REPLYlink written 7.4 years ago by ilonpilaaja50

I tried. See my edit.

ADD REPLYlink written 7.4 years ago by Vikas Bansal2.3k

Ah, something has short-circuited in my mind and I was assuming S is for "substitution". Now fixed.

ADD REPLYlink written 7.4 years ago by ilonpilaaja50
2
gravatar for ilonpilaaja
7.4 years ago by
ilonpilaaja50
Helsinki, Finland
ilonpilaaja50 wrote:

[Inspired by Vik's reply above] You may find these examples useful: Reference:

>vik
NNNNNNGTCCATGCAATTTTAAGACTTGAACCTGTGATCTGAAACCCCCTTGACTGATTACAGTCAGTNNNNN`

Query file:

>p
CATTACA
>q
TTAATC
>r
TGCTAATC
>s
GCATTACA
>t
CAATTTTTAAGACTTGAA
>u
GCAATTTAGACTTGAACCTGTGATCTG
>v
TCAATTTAGACTTGAACCTGTGATCTG
>w
AAGTCAAAATTGC

and run script ./bwa index ref.fa\\ ./bwa aln -n 100 -O 1 -E 1 ref.fa reads1.fq.gz > suff.sai\\ ./bwa samse ref.fa suff.sai reads1.fq.gz > aln.sam The output is (of which only the POS, SEQ, FLAG and CIGAR fields are of interest )

@SQ SN:vik  LN:73
@PG ID:bwa  PN:bwa  VN:0.6.1-r104
p   0   vik 56  20  7M  *   0   0   CATTACA *   XT:A:U  NM:i:1  X0:i:1  X1:i:2  XM:i:1  XO:i:0  XG:i:0  MD:Z:0G6    XA:Z:vik,-12,7M,2;vik,-33,7M,2;
q   16  vik 56  14  6M  *   0   0   GATTAA  *   XT:A:U  NM:i:1  X0:i:1  X1:i:8  XM:i:1  XO:i:0  XG:i:0  MD:Z:5C0
r   0   vik 12  0   8M  *   0   0   TGCTAATC    *   XT:A:R  NM:i:3  X0:i:4  X1:i:14 XM:i:3  XO:i:0  XG:i:0  MD:Z:3A1T1T0
s   0   vik 55  16  8M  *   0   0   GCATTACA    *   XT:A:U  NM:i:2  X0:i:1  X1:i:5  XM:i:2  XO:i:0  XG:i:0  MD:Z:0T0G6
t   0   vik 14  37  3M1I14M *   0   0   CAATTTTTAAGACTTGAA  *   XT:A:U  NM:i:1  X0:i:1  X1:i:0  XM:i:0  XO:i:1  XG:i:1  MD:Z:17
u   0   vik 13  37  7M2D20M *   0   0   GCAATTTAGACTTGAACCTGTGATCTG *   XT:A:U  NM:i:2  X0:i:1  X1:i:0  XM:i:0  XO:i:1  XG:i:2  MD:Z:7^TA20
v   0   vik 13  37  7M2D20M *   0   0   TCAATTTAGACTTGAACCTGTGATCTG *   XT:A:U  NM:i:3  X0:i:1  X1:i:0  XM:i:1  XO:i:1  XG:i:2  MD:Z:0G6^TA20
w   16  vik 13  37  8M2D5M  *   0   0   GCAATTTTGACTT   *   XT:A:U  NM:i:2  X0:i:1  X1:i:0  XM:i:0  XO:i:1  XG:i:2  MD:Z:8^AA5

To sum up: for reverse patterns only the FLAG field is affected. In the SEQ fields one reports the actual pattern (which is the pattern queried for for the "direct" case, and the pattern's reverse complement (rc) for the rc case)

ADD COMMENTlink modified 7.4 years ago by Istvan Albert ♦♦ 81k • written 7.4 years ago by ilonpilaaja50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 895 users visited in the last hour