**500**wrote:

Hello everyone,

I am trying to construct B-allele frequency plots from NGS data; and for that I am using samtools mpileup output at positions of interest.

Unfortunately, I am not able to establish a one-to-one relationship between the bases in base string column [column 4], and the base quality string [column 5] for lines with indels.

Here is an example from samtools sourceforge page:

```
seq2 156 A 11 .$......+2AG.+2AG.+2AGGG <975;:<<<<<
```

According to columns 4 and 6, there are 11 reads aligned to this position. But when trying to parse the base string column, I get

- 9 reads aligned to reference base (A)
- 3 reads with an insertion of AG
- 2 reads with base G

Can anyone please explain what is going on?

Thank you in advance!

The only possible explanation I can think of is the fact that ".+2AG" accounts for an insertion of "AG" bases, as opposed to "+2AG". This way, the numbers match up.

`.|.|.|.|.|.|.+2AG|.+2AG|.+2AG|G|G`

500I think you are correct. I've always interpreted insertions as something that's between the current base and the next base. So that means there is still a match to the current base.

14kThank you for the information and the pointer to your script. I have also come across the following example in one of my mpileups:

Here, there is an instance of ",+1c" and one of "+1c". What is your interpretation for each of them?

Thanks again

500",+1c" is a match on the opposite strand and a single insertion of c between the current base and next base. "+1c" should probably be "c+1c" meaning there is a read with a mismatch c in the current position and a single insertion of c between the current base and next base.

14kI wrote a script for this a while back. I am not sure if it works 100% correctly though. You should definitely spot check it: http://blog.nextgenetics.net/?e=56

14k