Parsing Base String From Samtools Mpileup
2
1
Entering edit mode
10.8 years ago
Noushin N ▴ 600

Hello everyone,

I am trying to construct B-allele frequency plots from NGS data; and for that I am using samtools mpileup output at positions of interest.

Unfortunately, I am not able to establish a one-to-one relationship between the bases in base string column [column 4], and the base quality string [column 5] for lines with indels.

Here is an example from samtools sourceforge page:

seq2 156 A 11  .$......+2AG.+2AG.+2AGGG    <975;:<<<<<

According to columns 4 and 6, there are 11 reads aligned to this position. But when trying to parse the base string column, I get

  1. 9 reads aligned to reference base (A)
  2. 3 reads with an insertion of AG
  3. 2 reads with base G

Can anyone please explain what is going on?

Thank you in advance!

samtools mpileup • 7.2k views
ADD COMMENT
2
Entering edit mode

The only possible explanation I can think of is the fact that ".+2AG" accounts for an insertion of "AG" bases, as opposed to "+2AG". This way, the numbers match up.

.|.|.|.|.|.|.+2AG|.+2AG|.+2AG|G|G

ADD REPLY
1
Entering edit mode

I think you are correct. I've always interpreted insertions as something that's between the current base and the next base. So that means there is still a match to the current base.

ADD REPLY
0
Entering edit mode

Thank you for the information and the pointer to your script. I have also come across the following example in one of my mpileups:

chr3    128988392       G       110
c$CCCCCC.CCcccC,Cc.,C,C...c.C.C...cccCcCcCC..cC..,...CCCCCcccC.CCC.C,CCCC,CCcCccC.cC,C,c.CccCcC,CCc.CCcCC,+1ccc+1c.c^],

Here, there is an instance of ",+1c" and one of "+1c". What is your interpretation for each of them?

Thanks again

ADD REPLY
1
Entering edit mode

",+1c" is a match on the opposite strand and a single insertion of c between the current base and next base. "+1c" should probably be "c+1c" meaning there is a read with a mismatch c in the current position and a single insertion of c between the current base and next base.

ADD REPLY
2
Entering edit mode

I wrote a script for this a while back. I am not sure if it works 100% correctly though. You should definitely spot check it: http://blog.nextgenetics.net/?e=56

ADD REPLY
0
Entering edit mode
10.1 years ago
Noushin N ▴ 600

Since no completely correct answer was posted to this question, I would like to point out that Damian's script looks pretty functional. After spot checking, it looks to me that it does not catch the ',' or '.' bases preceding an insertion/deletion. Thus, it will overestimate the number of reference bases at such sites.

I wrote a quick parser, which corrects the issue above and is accessible at: https://github.com/noushin6/NGSTools/blob/master/baseParser.py

I have spot checked it quite a bit; but feel free to do so yourself and use at your own risk.

ADD COMMENT
0
Entering edit mode

@ Noushin N Thanks a lot

ADD REPLY
0
Entering edit mode
7.8 years ago
kartong88 • 0

The "+2AGGG" in the last part of the pileup string should be interpreted as "+2AG", "G", "G". The "+2" indicates that only the following two nucleotides are part of the insertion. The last 2 "G" characters indicate two variant "G" nucleotides and are not part of the insertion string. If you view it this way, everything matches up.

Edit: Noticed this has been addressed in the comments section by author of original post

ADD COMMENT

Login before adding your answer.

Traffic: 1845 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6