Hello all,
I'm having a bit of difficulty wrapping my head around a task involving extracting the total number of SNPs from an alignment via creating a string parser/grep command which would be able to extract only the SNPs and ignoring indels.
I am currently using a python script utilising pysam in order to extract mapped reads from an alignment:
# Import BAM file
bamfile = pysam.AlignmentFile(bamName, "rb")
for read in bamfile.fetch():
if read.is_unmapped:
pass
else:
mdstring = read.get_tag("MD")
I am writing this MD string into a txt file and attempting to grep and sum only the SNPs and disregard any indels, however, I am unsure as to how to go about this.
As an example, here is an MD string from one of the reads:
9^C14T8^C5^C23^C5C5^C5^C5C11^C7C5^TA1C5^C16A6^C10^A11A5^A0A3A5A5A19^C17C18^AC30C9^A5A5A5A6A11A13C10C3^A1C4A6C5^C5C3A5T0A5T5T2C11C5^C41^G20^C45^ACC4^A1A5^GC0A26G40^T0T2A0A46^G1A1^GCGC2A15G1^C18G23G2^G10G4^G7^G6G8G13A1A0T2T6^T2G2T7T16G10^C0C3G21C4T7T19^AG8A3C2C1G0C0G10^G19^G17A14A10^GG5^GGG12G20^G0C10^C2G1G6G1^C0G0C0C4A0C4^A12^A1^G1T3G0C7^A1A1^G3A7A8^A1A0A12C1^CC6^TT2A18C15^TG92^C0A21A40G16^T3G13A33^TT1^CT7^TAC50A26^AA5A22G34^C32^CA1A10^TT51^T25^T72^C24^A1A36C4C45^TG30C0G42G49G0C27^A11^GA109^G5G6^AA25G31^A5^C3^C7^C19^T0G17^A1^C19G22A0G15G2^TT24A74^TG12A33A0G20A1G2^G20^CA3G4A0G10^C62^A55^GA33^C14^G3C2G1G10^C7G2^TT11^CA10A4^C21^CTG25^C11^C20^C6^A56^C4^GCC16^G10^A1^AA15T0G2A1C51^GGA4G10^A3^T1A17^C149C15C24G5^G54^T8^C41^T50^CA1C0A6^G1C75A0G24^C16^C34^AG169C41A22^CC39^C9G0G1G9T22T107A1T0T1^A10A10^A0C2C1A16^AA1G1^CCA20^T9^C119^CTT3^C22^G20^C6A16A57G54^A28^AGA39G0C3T0T0G1A8^CCA73^A1C3G18G0G3^A3T0G1C13C10^C1T0C0C1C2^C31^AG4^G22^A10C33G0A8^G33^C7^CC28A15^G6A0A0G11^G10G1^C0C2^G5T6^G0C17T2^A19^C45^G34^C10^C9A2^C53^C35T37^GC23A2^GG12^AG50^G3A112^C65^C39G0A11^CA16^AGGA19G5C3^G10^G1C1G25^GG14^G0C4^A6^GG37^GG35^A17A2T0T0G70^C7^T0T18^G7A11G3A18^CC1T2^A8G18^A36^TAC4T2A14^G19A1G0A7^T1^C3^G4^G2C20^T46^TG43^A23^CT38T38C36A41A0T0G3^AA41^GA27
I realise, as per the samtools documentation, all strings containing a hat represent a deletion ([0-9]+[A-Z]\^[A-Z]+[0-9]+)
However, would it be possible to extract only the SNPs from this MD string? If so, how would I go about this? I seem to be grepping deletions as well as SNPs no matter how I implement this.
Thanks in advance!