Extracting Number of SNPs via parsing MD tags
0
0
Entering edit mode
2.4 years ago

Hello all,

I'm having a bit of difficulty wrapping my head around a task involving extracting the total number of SNPs from an alignment via creating a string parser/grep command which would be able to extract only the SNPs and ignoring indels.

I am currently using a python script utilising pysam in order to extract mapped reads from an alignment:

# Import BAM file
bamfile = pysam.AlignmentFile(bamName, "rb")

for read in bamfile.fetch():
if read.is_unmapped:
        pass
    else:
        mdstring = read.get_tag("MD")

I am writing this MD string into a txt file and attempting to grep and sum only the SNPs and disregard any indels, however, I am unsure as to how to go about this.

As an example, here is an MD string from one of the reads:

9^C14T8^C5^C23^C5C5^C5^C5C11^C7C5^TA1C5^C16A6^C10^A11A5^A0A3A5A5A19^C17C18^AC30C9^A5A5A5A6A11A13C10C3^A1C4A6C5^C5C3A5T0A5T5T2C11C5^C41^G20^C45^ACC4^A1A5^GC0A26G40^T0T2A0A46^G1A1^GCGC2A15G1^C18G23G2^G10G4^G7^G6G8G13A1A0T2T6^T2G2T7T16G10^C0C3G21C4T7T19^AG8A3C2C1G0C0G10^G19^G17A14A10^GG5^GGG12G20^G0C10^C2G1G6G1^C0G0C0C4A0C4^A12^A1^G1T3G0C7^A1A1^G3A7A8^A1A0A12C1^CC6^TT2A18C15^TG92^C0A21A40G16^T3G13A33^TT1^CT7^TAC50A26^AA5A22G34^C32^CA1A10^TT51^T25^T72^C24^A1A36C4C45^TG30C0G42G49G0C27^A11^GA109^G5G6^AA25G31^A5^C3^C7^C19^T0G17^A1^C19G22A0G15G2^TT24A74^TG12A33A0G20A1G2^G20^CA3G4A0G10^C62^A55^GA33^C14^G3C2G1G10^C7G2^TT11^CA10A4^C21^CTG25^C11^C20^C6^A56^C4^GCC16^G10^A1^AA15T0G2A1C51^GGA4G10^A3^T1A17^C149C15C24G5^G54^T8^C41^T50^CA1C0A6^G1C75A0G24^C16^C34^AG169C41A22^CC39^C9G0G1G9T22T107A1T0T1^A10A10^A0C2C1A16^AA1G1^CCA20^T9^C119^CTT3^C22^G20^C6A16A57G54^A28^AGA39G0C3T0T0G1A8^CCA73^A1C3G18G0G3^A3T0G1C13C10^C1T0C0C1C2^C31^AG4^G22^A10C33G0A8^G33^C7^CC28A15^G6A0A0G11^G10G1^C0C2^G5T6^G0C17T2^A19^C45^G34^C10^C9A2^C53^C35T37^GC23A2^GG12^AG50^G3A112^C65^C39G0A11^CA16^AGGA19G5C3^G10^G1C1G25^GG14^G0C4^A6^GG37^GG35^A17A2T0T0G70^C7^T0T18^G7A11G3A18^CC1T2^A8G18^A36^TAC4T2A14^G19A1G0A7^T1^C3^G4^G2C20^T46^TG43^A23^CT38T38C36A41A0T0G3^AA41^GA27

I realise, as per the samtools documentation, all strings containing a hat represent a deletion ([0-9]+[A-Z]\^[A-Z]+[0-9]+)

However, would it be possible to extract only the SNPs from this MD string? If so, how would I go about this? I seem to be grepping deletions as well as SNPs no matter how I implement this.

Thanks in advance!

pysam samtools MD bam SNP • 703 views
ADD COMMENT

Login before adding your answer.

Traffic: 2542 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6