Newbie CIGAR question
2
0
Entering edit mode
7.6 years ago
michael • 0

My interest is genetic genealogy, so I'm not well-versed in sequencing. I also like to reinvent the wheel, so to speak, so I'm dealing with the readable SAM file to better conceptualize the process. My interest is variant calling Y reads and have downloaded several BAMs from FTDNA. I've written several perl scripts that look for known SNPs and filter out potential SNPs that are found inside STRs and palindromes. The number of potential variants left over is far too large. I've decided I'm not interpreting the CIGAR correctly.

FTDNA SAM uses = and X. Easy enough. Each one occurring sequentially increments the position. I'm assuming that a deletion increments the counter so that the next position is properly aligned and that an insertion is not counted. What about soft clips? Are they ignored or does the position increment? I'm guessing the former. If so, I'd think something like this would happen, start reads at position 1:

positions: 123456789
    CIGAR: SSD=II=XS
  New_pos: 001222344
  Alleles: ATTTGACTA

That would realign the single mismatch to position 4.

Right or wrong, I get some screwy stuff. For example, at position 2651356 I have 8 clean reads: AAAAAAAA, but at position 2651367 I'm getting these reads: AAAAAAATT. That sure looks like poor alignment to me.

Any help would be appreciated.

Thanks, Michael

SNP alignment sequence • 1.7k views
ADD COMMENT
1
Entering edit mode

Why exactly are you reinventing the wheel (samtools mpileup will give you a format that's trivial to parse)? Why aren't you just using one of the many many many many variant callers?

ADD REPLY
1
Entering edit mode
7.6 years ago

It is not entirely clear what you mean in your example, in fact it does not look correct. You have soft clips followed by a deletion:

SSD

But a soft clip is a deletion (at the beginning or end) and has a lower penalty than a deletion. There is no reason for the aligner to stop clipping just to insert a deletion.

Also the word poor alignment is not clear either, the alignment may still be optimal but perhaps it cannot indicate what you want it to, but that is may not be an alignment error but just a limitation of what it can do. See this also:

http://read.biostarhandbook.com/align/misleading-alignments.html

Finally the position is always computed from a the first aligned base, hence the soft/hard clipped sections are ignored. The sum of operations adds up to the alignment lenght.

ADD COMMENT
0
Entering edit mode

Thanks, Istvan. I'm not surprised it's a poor example. I just made something up to try to illustrate the point. You've answered the question regarding soft clips.

Here's a short piece of output. I'm looking only for mismatches, where the CIGAR indicates an X -- all ChrY:

calc'd_pos,  allele, flags, mapq, qual, start_pos
6350287 C 163 3 ' 6350225
6350344 G 163 5 F 6350276
6350353 A 163 7 F 6350341
6350438 A 163 7 0 6350341
6350418 C 147 7 ' 6350344
6350422 G 147 7 F 6350344
6350485 A 83 13 F 6350370
6350410 C 147 13 0 6350377

It seems straight forward. But the indels were screwing up the calculated positions. My last look at a one position finally yielded consistent results with this:

  foreach $base (@bases)
  {
    if ($cigararray[$x] eq 'I') {$pos++;}
    if ($cigararray[$x] eq 'D') {$pos--;}
    .
    .
    .

Previously, the only inconsistency were in reads that had a deletion indicated in the CIGAR. I haven't looked yet to see if the inserts are correct.

That's all I'm really trying to accomplish right now -- assign the proper positions to the bases. Before I realized I had a problem with my data, I'd written scripts that found all the known SNPs and filtered out variants that sat in the middle of STRs and palindromes. I haven't really looked yet at mapq and qual, but have started to throw out stuff based on a couple of flags.

Thanks, Michael

ADD REPLY
1
Entering edit mode
7.6 years ago
d-cameron ★ 2.9k

Each one occurring sequentially increments the position. I'm assuming that a deletion increments the counter so that the next position is properly aligned and that an insertion is not counted. What about soft clips? Are they ignored or does the position increment?

Page 5 of the SAM specifications defines the CIGAR operations. In terms of traversing an alignment, some operators increment the read base position, some the reference base position, some both, and some neither.

Right or wrong, I get some screwy stuff. For example, at position 2651356 I have 8 clean reads: AAAAAAAA, but at position 2651367 I'm getting these reads: AAAAAAATT. That sure looks like poor alignment to me.

Are you saying that at 2651356 your reads indicate homozygous A, and a few bases further along at 2651367 your reads indicate that position is heterozygous A/T, or that you have reads starting at those positions looking like

ref: AAAAAAAAAANAAAAAATT 
     AAAAAAAAAA          (starting at 2651356)
                AAAAAATT (starting at 2651367)

There are a number of BAM visualisation tools (eg IGV) available which you can use to get a feel of the type and size of the data you are dealing with. Visual inspect of alignments can help considerably in determining whether an alignment is poor or not and why.

My interest is variant calling Y reads ... I also like to reinvent the wheel ... I've written several perl scripts that look for known SNPs and filter out potential SNPs that are found inside STRs and palindromes. The number of potential variants left over is far too large. I've decided I'm not interpreting the CIGAR correctly.

If your interest is in developing a variant caller then I strongly urge to you review the considerable body of literature surrounding variant calling. The number of potential variants is indeed huge and people have spent considerable effort developing and refining variant callers. If your interest is in the downstream analysis of variants then I urge you to use an existing variant caller and concentrate on the downstream analysis of interest.

ADD COMMENT
0
Entering edit mode

Thanks for your response.

I'm dealing only with the Y chromosome. The values at each pass should be the same. I know that some of what I've looked at have been fixed due to a better algorithm,

Yes, most do not understand my learning style, but it is what it is. :)

-Michael

ADD REPLY

Login before adding your answer.

Traffic: 1287 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6