My interest is genetic genealogy, so I'm not well-versed in sequencing. I also like to reinvent the wheel, so to speak, so I'm dealing with the readable SAM file to better conceptualize the process. My interest is variant calling Y reads and have downloaded several BAMs from FTDNA. I've written several perl scripts that look for known SNPs and filter out potential SNPs that are found inside STRs and palindromes. The number of potential variants left over is far too large. I've decided I'm not interpreting the CIGAR correctly.
FTDNA SAM uses = and X. Easy enough. Each one occurring sequentially increments the position. I'm assuming that a deletion increments the counter so that the next position is properly aligned and that an insertion is not counted. What about soft clips? Are they ignored or does the position increment? I'm guessing the former. If so, I'd think something like this would happen, start reads at position 1:
positions: 123456789
CIGAR: SSD=II=XS
New_pos: 001222344
Alleles: ATTTGACTA
That would realign the single mismatch to position 4.
Right or wrong, I get some screwy stuff. For example, at position 2651356 I have 8 clean reads: AAAAAAAA, but at position 2651367 I'm getting these reads: AAAAAAATT. That sure looks like poor alignment to me.
Any help would be appreciated.
Thanks, Michael
Why exactly are you reinventing the wheel (
samtools mpileup
will give you a format that's trivial to parse)? Why aren't you just using one of the many many many many variant callers?