Question: Calculate The Frequency Of Nucleotides At Each Position In An Mpileup File
8
gravatar for komal.rathi
5.8 years ago by
komal.rathi3.5k
Children's Hospital of Philadelphia, Philadelphia, PA
komal.rathi3.5k wrote:

Hi everyone,

I have a number of Human Heart RNASeq samples. I have generated a mpileup file on each of them using a bed file containing a list of positions for which I want the mpileup to be made. I have included the -B option to turn off the BAQ computation in mpileup.

Single sample mpileup:

samtools mpileup -uf hg19.fa -B -l snps_chr17.bed sample1.bam > sample1.mpileup

Multiple sample mpileup (all samples):

samtools mpileup -uf hg19.fa -B -l snps_chr17.bed -b bam_files_list.bam > all_samples.mpileup

Now, I want to output the frequency of nucleotides A, T, G and C at each of those positions. There is a perl script pileup2baseindel which generates exactly the output that I want, only that it is incorrect (inconsistent with what is seen in IGV). This program takes a single/multi-sample mpileup file and produces the following output. But like I said, the output is incorrect.

I have searched a lot but can't seem to find any program that does the same. Does anyone know any tool/script that would give me such an output with/without taking the mpileup file as an input. Suggestions on alternatives/options are welcomed.

[UPDATE] There was a bug in the perl script and I have updated the code, it works fine now. I have contacted the author and as soon as the author responds, I will send it to him so that he can update it. I would still like to get suggestions and try out other programs.

[UPDATE] Thanks everyone who helped me with this problem. Each of your suggestions worked but I accepted Chris Miller's answer recommending the use of bam-readcount because of its easy of use and the elaborate output that it generates. It directly takes, as input, a bam file as well as a list of positions in bed format for which you want the nucleotide frequency. So you don't need to create an intermediate mpileup file - as in pileup2baseindel or run the program against positions that you are not interested in - as in pysamstats.

Thanks!

mpileup • 21k views
ADD COMMENTlink modified 4.9 years ago by inesdesantiago160 • written 5.8 years ago by komal.rathi3.5k
4
gravatar for Chris Miller
5.8 years ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:

Another option is just to run bam-readcount, giving it the bams and specified positions. It's more straightforward than creating an mpileup and parsing that ugly string.

ADD COMMENTlink modified 3 months ago by RamRS25k • written 5.8 years ago by Chris Miller21k

Thanks! I'll give it a try and let you know!

ADD REPLYlink written 5.8 years ago by komal.rathi3.5k

I'm using Bam-readcount, and I notice that the Depth calculated for each position does not correspond at all to the coverage that I observe on IGV. It is much lower. Could someone tell me where does the error? thank you

ADD REPLYlink written 3.9 years ago by saad.chadi0

Hi Saad.chadi, did you resolve the problem? I'm trying to run Bam-readcount, but i have the same problem. If you have any solution, please let me know. Thank in you in advance for your time. Regards

ADD REPLYlink written 3.5 years ago by gmarocena0

I was able to run bam-readcount, but in addition to counts of A, C, G, T, N, got lines like these with additional count at the end. What do these fields mean? Thanks!

chr6    1135939 T   1311    =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  A:536:59.97:37.46:59.97:268:268:0.44:0.01:50.05:268:0.51:115.43:0.34    C:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  G:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  T:775:60.00:37.71:60.00:387:388:0.43:0.00:1.53:387:0.56:106.57:0.38 N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  +A:2:60.00:0.00:60.00:1:1:0.36:0.01:0.00:1:0.79:72.00:0.49
chr2    239563579   G   1528    =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  A:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  C:685:60.00:37.66:60.00:303:382:0.47:0.02:75.21:303:0.75:104.37:0.52    G:841:60.00:37.77:60.00:386:455:0.46:0.00:1.38:386:0.75:100.58:0.51 T:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  -G:2:60.00:0.00:60.00:1:1:0.38:0.04:62.00:1:0.79:80.00:0.49
chr12   6945914 C   2365    =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  A:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  C:1135:60.00:37.51:60.00:733:402:0.51:0.00:1.75:733:0.66:111.12:0.58    G:1228:60.00:37.55:60.00:786:442:0.56:0.01:38.74:786:0.65:107.84:0.57   T:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  -C:2:60.00:0.00:60.00:1:1:0.41:0.02:0.00:1:0.17:54.00:0.48
ADD REPLYlink modified 3 months ago by RamRS25k • written 3.3 years ago by weichen0

They are insertions and deletions, IIRC. Easy enough to pull up your reads in IGV and verify that.

ADD REPLYlink written 3.3 years ago by Chris Miller21k
4
gravatar for favero.francesco
5.8 years ago by
favero.francesco120 wrote:

I wrote a tool to do exactly this; the function is called pileup2acgt, is written in python2.7 (compatible with the much faster pypy-2*). If you want to give it a try it works with no dependencies a part python2.7 core libraries.

The tool is part of Sequenza, a program to detect copy number and variants from tumor DNA-seq. The script, sequenza-utils.py, is downloadable from here, or from CRAN, at the sequenza page and following the instruction on the manual.

The general usage is available on an (already outdated) wiki.

I usually generate the pileup on the fly with samtools, this also allows to select a specific region of the BAM/SAM using samtools view (here with a slightly unnecessary samtools pipe querying TP53 site):

samtools view -f 2 -u my.bam chr17:7570000–7590000 | samtools rmdup - - | samools mpileup -f hg19.fa -q 20 -  |  sequenza-utils.py pileup2acgt - | gzip > result.txt.gz

a quick usage tip, for your purpose:

cat s.mpileup
1    95369073    G    254    C$.$.$c$,$,$CCccccCCC.CCC..,cccccCc,CCC,c,..,c.c.cCCc.$,,c.Cc,Cc,CCc,ccC.C,c,c,,ccCCcCCC,,cc,CCc,c.Ccc,,CC,C,c,,CCCCc.c,,.c,,,cC.Cc,cc,cc.c,.C,ccc,C,C,,cCCCcc.,,,c,cc,cccc,cccCcc,,cc,cc,,,cCcccccc.ccc,ccccc.,c,cc,,,cCccc.ccc,c,c,,c.ccCc,,,c,ccc,c,cccc,,,cc,,c,c    BA?GHGB-*,**8:9<><=CBG88888>8H=8@H8HD<H8A8B8<;8#HH8B?8H;8A:=8H88:>9H8F9HH88?8999?HH98H?=<H;F<::GH:=H<H8HH:?<:8B8HHB8HCH8>H88H8>G:8F8H@8H888H9H8=H8;:888EHHG:H:8F9:88H:88888HG:8H88HHH98<:;8;8H88<H;;8;8HH8F9:DHE88889B>68H<E<HG;E:88:HFH8E?>8H;H><9;FHH:>?@:9- 

python2.7 sequenza-utils.py pileup2acgt s.mpileup
chr    n_base    ref_base    read.depth    A    C    G    T    strand
1    95369073    G    254    0    152    95    0    0:47:22:0

It also check the quality of the bases, for example, if you want to lower the default threshold of quality (20):

python sequenza/exec/sequenza-utils.py pileup2acgt -q 10 s.mpileup
chr    n_base    ref_base    read.depth    A    C    G    T    strand
1    95369073    G    254    0    155    95    0    0:48:22:0

or anyway

python sequenza/exec/sequenza-utils.py pileup2acgt -h

will describe the available optional arguments.

It is reasonably fast as it is, it is also implemented with the multiprocessing library, so if it's needed it could use more processes to speed up a little bit. Used with pypy is faster than everything else that aims for a similar output, as far as I know.

the strand field, counts the reads on the forward strand for the specific base (A:C:G:T, respectively.)

The read depth is left unchanged from the mpileup, while the sum of ACGT is the number of reads above the quality filter (so it's <=read.depth).

It also handle lines with indels (although it passes on them without store them anywhere), it accepts .gz files or input from STDIN (using "-" as file name).

I actually use it to roughty call variants, and it's consistent with IGV.

Hope you find it useful.

ADD COMMENTlink modified 3 months ago by RamRS25k • written 5.8 years ago by favero.francesco120

Hi favero!

I calculated each base as your methods as follwing:

samtools view -Duf my.bam | samtools rmdup - - | samools mpileup -f genome.fa -q 20 -  |  sequenza-utils.py pileup2acgt - | gzip > result.txt.gz

But I got some odd results and couldn't figure out why?

n_base    ref_base    read.depth    A    C    G    T    strand
16768     A           257           257  0    0    0    1:0:0:0
16769     C           253           0    252  0    0    0:1:0:0
16770     G           231           0    1    230  0    0:0:0:0
16771     G           219           0    0    219  0    0:0:0:0
16772     G           210           0    0    209  0    0:0:0:0
16773     A           179           76   0    0    0    0:0:0:0
16774     A           97            50   0    0    0    0:0:0:0
16775     C           57            0    38   0    0    0:0:0:0

as the results showed, the sum of the ACGT count on each base is not equal to the read.depth and the strand is 0:0:0:0. e.g. pos 16775, the sum of ACGT is 38, is less than the read.depth=57 and the strand is also 0:0:0:0

can you figure out why? thanks

ADD REPLYlink modified 3 months ago by RamRS25k • written 5.7 years ago by Lilian0

Hi Lilian, this is awkward, I've just seen your message, almost one year after.

The depth and the sum of the bases differs because the depth is the read depth measured by samtools mpileup, while the bases recorded by the python script are only those passing the default quality score (which is adjustable using the -q argument).

My apologies for the late reply.

ADD REPLYlink modified 3 months ago by RamRS25k • written 4.8 years ago by favero10

Hi Favero, I'm having the same problem than Lilian. I modified the -q score, but it doesn't work. What do you think about it? Thank you in advance for your time. Regards

ADD REPLYlink written 3.5 years ago by gmarocena0

Hi Lilian, I'm having the same problem. I modified the -q score, but it doesn't work. How do you resolve this? Thank you in advance for your help. Regards

ADD REPLYlink written 3.5 years ago by gmarocena0

You can check with IGV if in the specific positions are present indels or other things that can justify the mismatch from depth and nucleotide counts.

ADD REPLYlink written 3.4 years ago by favero.francesco120
2
gravatar for Ashutosh Pandey
5.8 years ago by
Philadelphia
Ashutosh Pandey12k wrote:


import sys,re,fileinput

Argument = []
Argument = sys.argv[1:] 

if (len(Argument)) < 3:
        print "Usage: Input_pileup_file Column_with_information_about_read_bases(usually_column_9_in_pileup) Output_file" 
        sys.exit()

File_Pileup = Argument[0]
index = int(Argument[1])-1
output = open(str(Argument[2]),"w")


nucleotides = ["A","T","C","G","a","t","c","g"]
complement = {'a':'T','g':'C','t':'A','c':'G'}

output.write("Chromosome\tCoordinate\tReference_base")

Frequency_bases = {"A":0,"T":0,"C":0,"G":0}

for base in sorted(Frequency_bases.keys()):
        output.write("\t"+str(base))

output.write("\n")

for line in fileinput.input([File_Pileup]):
        rowlist = []
        rowlist = (line.rstrip("\n")).split('\t')
        rowlist[2] = rowlist[2].upper()

        Frequency = {"A":0,"T":0,"C":0,"G":0}

        if line.startswith("#"):
                continue
        else:
                output.write(str(rowlist[0])+"\t"+str(rowlist[1])+"\t"+str(rowlist[2]))
                for i in rowlist[index]:
                        if i == ".":
                                Frequency[rowlist[2]] = Frequency[rowlist[2]] + 1
                                continue
                        if i == ",":
                                Frequency[rowlist[2]] = Frequency[rowlist[2]] + 1
                                continue

                        if i in nucleotides:
                                if i.isupper():
                                        Frequency[i] = Frequency[i]+1

                                else:
                                        Frequency[complement[i]] = Frequency[complement[i]] + 1

        for base in sorted(Frequency.keys()):
                output.write("\t"+str(Frequency[base]))

        output.write("\n")

output.close()

ADD COMMENTlink modified 5.8 years ago • written 5.8 years ago by Ashutosh Pandey12k

The A,T,C,G counts are based on forward strand of the reference genome. In other words, if a read that was aligned on the reverse strand has base "T" for a position, then the frequency of "A" will be incremented as the reverse compliment of that read will render "A" to the forward strand.

ADD REPLYlink written 5.8 years ago by Ashutosh Pandey12k

I think this script will not handle correctly insertions and deletions. E.g. in mpileup line

seq3 200 A 20 ,,,,,..,.-4CACC.-4CACC....,.,,.^~. ==<<<<<<<<<<<::<;2<<

CACC will be counted as bases not as deletion.

Also, the character following ^ in the pileup should be skipped as it reports the mapping quality (which could be A, C, G, T).

ADD REPLYlink modified 3 months ago by RamRS25k • written 5.8 years ago by dariober10k

Thanks! Looks good! I will give this a try too! Did you just write this code?

ADD REPLYlink written 5.8 years ago by komal.rathi3.5k
1

Yup, just wrote it. Please read my other comment to make sure that the code is behaving the way you want.

ADD REPLYlink written 5.8 years ago by Ashutosh Pandey12k

Yes I read it, although its not exactly what I want I can edit it in case I use it. I am currently trying to figure out pysamstats as suggested by dariober and bam-readcount as suggested by Chris Miller. Both the tools take bam file directly without creating an mpileup file.

ADD REPLYlink modified 3 months ago by RamRS25k • written 5.8 years ago by komal.rathi3.5k

Use Rsamtools pileup.

ADD REPLYlink modified 3 months ago by RamRS25k • written 4.9 years ago by inesdesantiago160
1
gravatar for dariober
5.8 years ago by
dariober10k
WCIP | Glasgow | UK
dariober10k wrote:

Hi- I used pysamstats briefly and it seems pretty handy.

With this command:

pysamstats \
    --fasta ref.fa \
    --type variation \
    aln.bam

You get a table with the following columns (self explanatory, _pp stands for properly paired):

chrom 
pos
ref
reads_all
reads_pp
matches
matches_pp
mismatches
mismatches_pp
deletions 
deletions_pp
insertions
insertions_pp
A
A_pp
C
C_pp
T
T_pp
G
G_pp
N
N_pp
ADD COMMENTlink modified 5.8 years ago • written 5.8 years ago by dariober10k

That looks promising! I'll give it a try and let you know. Thanks!

ADD REPLYlink written 5.8 years ago by komal.rathi3.5k

dariober: I used pysamstats and works very well. But even when using just a single chromosome using --chromosome, --start and --end options, it takes a long time. So I subset my bam file using a bed file which contains my positions of interest:

samtools view -hb sample.bam -L snp_pos.bed > sample_subset.bam

I have no clue why, but pysamstats is giving me an error:

File "/usr/local/bin/pysamstats", line 123, in <module>
window_offset=options.window_offset)
File "pysamstats.pyx", line 842, in pysamstats.write_variation (pysamstats.c:15103)
write_stats(stat_variation, fieldnames, *args, **kwargs)
File "pysamstats.pyx", line 3329, in pysamstats.write_stats (pysamstats.c:44847)
for rec in recs:
File "pysamstats.pyx", line 174, in stat_pileup_withref_default (pysamstats.c:6983)
it = samfile.pileup(reference=chrom, start=start, end=end, truncate=truncate)
File "csamtools.pyx", line 1254, in pysam.csamtools.Samfile.pileup (pysam/csamtools.c:14071)
ValueError: no index available for pileup

When I use the whole bam file/whole bam file with --chromosome, --start and --end it works fine. But when I subset my bam file and use it, it shows me an error. Have you ever dealt with such an error?

ADD REPLYlink modified 3 months ago by RamRS25k • written 5.8 years ago by komal.rathi3.5k
2

As I said, I used pysamstats only once or twice, so I'm just guessing... Looking at the last line of the error ValueError: no index available for pileup I suspect sample_subset.bam should be indexed with samtools index sample_subset.bam.

ADD REPLYlink written 5.8 years ago by dariober10k
1
gravatar for inesdesantiago
4.9 years ago by
United Kingdom
inesdesantiago160 wrote:

You can use pileup and this function in R (with Rsamtools):

pileupFreq <- function(pileupres) {
    nucleotides <- levels(pileupres$nucleotide)
    res <- split(pileupres, pileupres$seqnames)
    res <- lapply(res, function (x) {split(x, x$pos)})
    res <- lapply(res, function (positionsplit) {
        nuctab <- lapply(positionsplit, function(each) {
                        chr = as.character(unique(each$seqnames))
                        pos = as.character(unique(each$pos))
                        tablecounts <- sapply(nucleotides, function (n) {sum(each$count[each$nucleotide == n])})
                        c(chr,pos, tablecounts)
                    })
        nuctab <- data.frame(do.call("rbind", nuctab),stringsAsFactors=F)
        rownames(nuctab) <- NULL
        nuctab
    })
    res <- data.frame(do.call("rbind", res),stringsAsFactors=F)
    rownames(res) <- NULL
    colnames(res) <- c("seqnames","start",levels(pileupres$nucleotide))
    res[3:ncol(res)] <- apply(res[3:ncol(res)], 2, as.numeric)
    res
}

Example:

library(VariantAnnotation)
library(Rsamtools)
vcf <- readVcf("vcfExample_chr17.vcf", "hg19")
bf <- BamFile("Pol2ChIP_chr17.bam")
res <- pileup(bf, scanBamParam=ScanBamParam(which=rowData(vcf))
pileupFreq(res)

 

Also, read more about it here:https://seqqc.wordpress.com/2015/03/10/calculate-nucelotide-frequency-with-rsamtools-pileup/

 

ADD COMMENTlink written 4.9 years ago by inesdesantiago160

inesdesantiago Can you get the pileup string, reference base, insertions and deletions at a particular position using Rsamtools?

ADD REPLYlink modified 8 weeks ago by RamRS25k • written 4.2 years ago by komal.rathi3.5k
0
gravatar for Chris Miller
5.8 years ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:

If the output is inconsistent with what is seen in IGV, be sure that you're taking into account any mpileup params that might cause it to exclude reads (-Q is a likely candidate, which excludes bases with mapping qualities less than 13 by default)

ADD COMMENTlink written 5.8 years ago by Chris Miller21k

I forgot to mention that I used -B to turn off the BAQ computation when using samtools mpileup. I have edited my code to reflect that.

ADD REPLYlink modified 3 months ago by RamRS25k • written 5.8 years ago by komal.rathi3.5k

-Q is still set to 13 by default. Are there a bunch of low-quality bases that show up in IGV, but wouldn't be considered by samtools?

ADD REPLYlink modified 3 months ago by RamRS25k • written 5.8 years ago by Chris Miller21k

I haven't seen any low quality bases showing up in IGV when I was checking it. Although the problem is not only in the counts, it is also in the nucleotide itself. I went through the output of the perl script thorougly, there were positions that were consistent with IGV but there were also positions where sometimes the counts did not agree and sometimes even the nucleotide was different. For e.g. if IGV showed 5 Gs and 2 Ts, my output would show 5 As and 2 Cs and it wasn't even a pattern, it was just random.

I also tried it again by using the -B (turn off BAQ computation) and -Q 0 (skip bases with baseQ/BAQ smaller than 0) but nothing changed. So it has something to do with the perl script.

ADD REPLYlink modified 3 months ago by RamRS25k • written 5.8 years ago by komal.rathi3.5k
0
gravatar for Ashutosh Pandey
5.8 years ago by
Philadelphia
Ashutosh Pandey12k wrote:

I am not sure if this is the case with you but the default settings for mpileup is to filter reads with bitwise flag 0X704. So for pileup generation the following reads will not considered at all from the bam files:

  1. 0x0400 (aka 1024 or "d") duplicate
  2. 0x0200 (aka 512 or "f") failed QC
  3. 0x0100 (aka 256 or "s") non primary alignment
  4. 0x0004 (aka 4 or "u") unmapped

May be IGV doesnt filter out some of these flags. As a result, you may be seeing extra alignments in IGV but not in your count file.

ADD COMMENTlink modified 3 months ago by RamRS25k • written 5.8 years ago by Ashutosh Pandey12k

Actually there was an error in the perl script. I can't reach the author but I have updated the code and it works fine. But I am still looking for alternatives because there are very few tools out there for this purpose.

ADD REPLYlink written 5.8 years ago by komal.rathi3.5k
0
gravatar for liuyutingbio
5.1 years ago by
liuyutingbio0 wrote:

Could you sent me the  pileup2baseindel script that you changed? or tell me how can I find the bug?

ADD COMMENTlink written 5.1 years ago by liuyutingbio0

Here it is:

https://www.dropbox.com/s/a4qkmsb1i59xzrg/pileup2baseindel.pl?dl=0

ADD REPLYlink written 5.1 years ago by komal.rathi3.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1128 users visited in the last hour