Calculate The Frequency Of Nucleotides At Each Position In An Mpileup File
8
9
Entering edit mode
7.9 years ago
komal.rathi ★ 3.9k

Hi everyone,

I have a number of Human Heart RNASeq samples. I have generated a mpileup file on each of them using a bed file containing a list of positions for which I want the mpileup to be made. I have included the -B option to turn off the BAQ computation in mpileup.

Single sample mpileup:

samtools mpileup -uf hg19.fa -B -l snps_chr17.bed sample1.bam > sample1.mpileup


Multiple sample mpileup (all samples):

samtools mpileup -uf hg19.fa -B -l snps_chr17.bed -b bam_files_list.bam > all_samples.mpileup


Now, I want to output the frequency of nucleotides A, T, G and C at each of those positions. There is a perl script pileup2baseindel which generates exactly the output that I want, only that it is incorrect (inconsistent with what is seen in IGV). This program takes a single/multi-sample mpileup file and produces the following output. But like I said, the output is incorrect.

I have searched a lot but can't seem to find any program that does the same. Does anyone know any tool/script that would give me such an output with/without taking the mpileup file as an input. Suggestions on alternatives/options are welcomed.

[UPDATE] There was a bug in the perl script and I have updated the code, it works fine now. I have contacted the author and as soon as the author responds, I will send it to him so that he can update it. I would still like to get suggestions and try out other programs.

[UPDATE] Thanks everyone who helped me with this problem. Each of your suggestions worked but I accepted Chris Miller's answer recommending the use of bam-readcount because of its easy of use and the elaborate output that it generates. It directly takes, as input, a bam file as well as a list of positions in bed format for which you want the nucleotide frequency. So you don't need to create an intermediate mpileup file - as in pileup2baseindel or run the program against positions that you are not interested in - as in pysamstats.

Thanks!

mpileup • 26k views
4
Entering edit mode
7.9 years ago

Another option is just to run bam-readcount, giving it the bams and specified positions. It's more straightforward than creating an mpileup and parsing that ugly string.

0
Entering edit mode

Thanks! I'll give it a try and let you know!

0
Entering edit mode

I'm using Bam-readcount, and I notice that the Depth calculated for each position does not correspond at all to the coverage that I observe on IGV. It is much lower. Could someone tell me where does the error? thank you

0
Entering edit mode

Hi Saad.chadi, did you resolve the problem? I'm trying to run Bam-readcount, but i have the same problem. If you have any solution, please let me know. Thank in you in advance for your time. Regards

0
Entering edit mode

I was able to run bam-readcount, but in addition to counts of A, C, G, T, N, got lines like these with additional count at the end. What do these fields mean? Thanks!

chr6    1135939 T   1311    =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  A:536:59.97:37.46:59.97:268:268:0.44:0.01:50.05:268:0.51:115.43:0.34    C:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  G:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  T:775:60.00:37.71:60.00:387:388:0.43:0.00:1.53:387:0.56:106.57:0.38 N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  +A:2:60.00:0.00:60.00:1:1:0.36:0.01:0.00:1:0.79:72.00:0.49
chr2    239563579   G   1528    =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  A:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  C:685:60.00:37.66:60.00:303:382:0.47:0.02:75.21:303:0.75:104.37:0.52    G:841:60.00:37.77:60.00:386:455:0.46:0.00:1.38:386:0.75:100.58:0.51 T:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  -G:2:60.00:0.00:60.00:1:1:0.38:0.04:62.00:1:0.79:80.00:0.49
chr12   6945914 C   2365    =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  A:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  C:1135:60.00:37.51:60.00:733:402:0.51:0.00:1.75:733:0.66:111.12:0.58    G:1228:60.00:37.55:60.00:786:442:0.56:0.01:38.74:786:0.65:107.84:0.57   T:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  -C:2:60.00:0.00:60.00:1:1:0.41:0.02:0.00:1:0.17:54.00:0.48

0
Entering edit mode

They are insertions and deletions, IIRC. Easy enough to pull up your reads in IGV and verify that.

4
Entering edit mode
7.8 years ago

I wrote a tool to do exactly this; the function is called pileup2acgt, is written in python2.7 (compatible with the much faster pypy-2*). If you want to give it a try it works with no dependencies a part python2.7 core libraries.

The tool is part of Sequenza, a program to detect copy number and variants from tumor DNA-seq. The script, sequenza-utils.py, is downloadable from here, or from CRAN, at the sequenza page and following the instruction on the manual.

The general usage is available on an (already outdated) wiki.

I usually generate the pileup on the fly with samtools, this also allows to select a specific region of the BAM/SAM using samtools view (here with a slightly unnecessary samtools pipe querying TP53 site):

samtools view -f 2 -u my.bam chr17:7570000–7590000 | samtools rmdup - - | samools mpileup -f hg19.fa -q 20 -  |  sequenza-utils.py pileup2acgt - | gzip > result.txt.gz


a quick usage tip, for your purpose:

cat s.mpileup
1    95369073    G    254    C$.$.$c$,$,$CCccccCCC.CCC..,cccccCc,CCC,c,..,c.c.cCCc.,,c.Cc,Cc,CCc,ccC.C,c,c,,ccCCcCCC,,cc,CCc,c.Ccc,,CC,C,c,,CCCCc.c,,.c,,,cC.Cc,cc,cc.c,.C,ccc,C,C,,cCCCcc.,,,c,cc,cccc,cccCcc,,cc,cc,,,cCcccccc.ccc,ccccc.,c,cc,,,cCccc.ccc,c,c,,c.ccCc,,,c,ccc,c,cccc,,,cc,,c,c BA?GHGB-*,**8:9<><=CBG88888>8H=8@H8HD<H8A8B8<;8#HH8B?8H;8A:=8H88:>9H8F9HH88?8999?HH98H?=<H;F<::GH:=H<H8HH:?<:8B8HHB8HCH8>H88H8>G:8F8H@8H888H9H8=H8;:888EHHG:H:8F9:88H:88888HG:8H88HHH98<:;8;8H88<H;;8;8HH8F9:DHE88889B>68H<E<HG;E:88:HFH8E?>8H;H><9;FHH:>?@:9- python2.7 sequenza-utils.py pileup2acgt s.mpileup chr n_base ref_base read.depth A C G T strand 1 95369073 G 254 0 152 95 0 0:47:22:0  It also check the quality of the bases, for example, if you want to lower the default threshold of quality (20): python sequenza/exec/sequenza-utils.py pileup2acgt -q 10 s.mpileup chr n_base ref_base read.depth A C G T strand 1 95369073 G 254 0 155 95 0 0:48:22:0  or anyway python sequenza/exec/sequenza-utils.py pileup2acgt -h  will describe the available optional arguments. It is reasonably fast as it is, it is also implemented with the multiprocessing library, so if it's needed it could use more processes to speed up a little bit. Used with pypy is faster than everything else that aims for a similar output, as far as I know. the strand field, counts the reads on the forward strand for the specific base (A:C:G:T, respectively.) The read depth is left unchanged from the mpileup, while the sum of ACGT is the number of reads above the quality filter (so it's <=read.depth). It also handle lines with indels (although it passes on them without store them anywhere), it accepts .gz files or input from STDIN (using "-" as file name). I actually use it to roughty call variants, and it's consistent with IGV. Hope you find it useful. ADD COMMENT 0 Entering edit mode Hi favero! I calculated each base as your methods as follwing: samtools view -Duf my.bam | samtools rmdup - - | samools mpileup -f genome.fa -q 20 - | sequenza-utils.py pileup2acgt - | gzip > result.txt.gz  But I got some odd results and couldn't figure out why? n_base ref_base read.depth A C G T strand 16768 A 257 257 0 0 0 1:0:0:0 16769 C 253 0 252 0 0 0:1:0:0 16770 G 231 0 1 230 0 0:0:0:0 16771 G 219 0 0 219 0 0:0:0:0 16772 G 210 0 0 209 0 0:0:0:0 16773 A 179 76 0 0 0 0:0:0:0 16774 A 97 50 0 0 0 0:0:0:0 16775 C 57 0 38 0 0 0:0:0:0  as the results showed, the sum of the ACGT count on each base is not equal to the read.depth and the strand is 0:0:0:0. e.g. pos 16775, the sum of ACGT is 38, is less than the read.depth=57 and the strand is also 0:0:0:0 can you figure out why? thanks ADD REPLY 0 Entering edit mode Hi Lilian, this is awkward, I've just seen your message, almost one year after. The depth and the sum of the bases differs because the depth is the read depth measured by samtools mpileup, while the bases recorded by the python script are only those passing the default quality score (which is adjustable using the -q argument). My apologies for the late reply. ADD REPLY 0 Entering edit mode Hi Favero, I'm having the same problem than Lilian. I modified the -q score, but it doesn't work. What do you think about it? Thank you in advance for your time. Regards ADD REPLY 0 Entering edit mode Hi Lilian, I'm having the same problem. I modified the -q score, but it doesn't work. How do you resolve this? Thank you in advance for your help. Regards ADD REPLY 0 Entering edit mode You can check with IGV if in the specific positions are present indels or other things that can justify the mismatch from depth and nucleotide counts. ADD REPLY 2 Entering edit mode 7.9 years ago Hi- I used pysamstats briefly and it seems pretty handy. With this command: pysamstats \ --fasta ref.fa \ --type variation \ aln.bam  You get a table with the following columns (self explanatory, _pp stands for properly paired): chrom pos ref reads_all reads_pp matches matches_pp mismatches mismatches_pp deletions deletions_pp insertions insertions_pp A A_pp C C_pp T T_pp G G_pp N N_pp  ADD COMMENT 0 Entering edit mode That looks promising! I'll give it a try and let you know. Thanks! ADD REPLY 0 Entering edit mode dariober: I used pysamstats and works very well. But even when using just a single chromosome using --chromosome, --start and --end options, it takes a long time. So I subset my bam file using a bed file which contains my positions of interest: samtools view -hb sample.bam -L snp_pos.bed > sample_subset.bam  I have no clue why, but pysamstats is giving me an error: File "/usr/local/bin/pysamstats", line 123, in <module> window_offset=options.window_offset) File "pysamstats.pyx", line 842, in pysamstats.write_variation (pysamstats.c:15103) write_stats(stat_variation, fieldnames, *args, **kwargs) File "pysamstats.pyx", line 3329, in pysamstats.write_stats (pysamstats.c:44847) for rec in recs: File "pysamstats.pyx", line 174, in stat_pileup_withref_default (pysamstats.c:6983) it = samfile.pileup(reference=chrom, start=start, end=end, truncate=truncate) File "csamtools.pyx", line 1254, in pysam.csamtools.Samfile.pileup (pysam/csamtools.c:14071) ValueError: no index available for pileup  When I use the whole bam file/whole bam file with --chromosome, --start and --end it works fine. But when I subset my bam file and use it, it shows me an error. Have you ever dealt with such an error? ADD REPLY 2 Entering edit mode As I said, I used pysamstats only once or twice, so I'm just guessing... Looking at the last line of the error ValueError: no index available for pileup I suspect sample_subset.bam should be indexed with samtools index sample_subset.bam. ADD REPLY 2 Entering edit mode 7.9 years ago  import sys,re,fileinput Argument = [] Argument = sys.argv[1:] if (len(Argument)) < 3: print "Usage: Input_pileup_file Column_with_information_about_read_bases(usually_column_9_in_pileup) Output_file" sys.exit() File_Pileup = Argument[0] index = int(Argument[1])-1 output = open(str(Argument[2]),"w") nucleotides = ["A","T","C","G","a","t","c","g"] complement = {'a':'T','g':'C','t':'A','c':'G'} output.write("Chromosome\tCoordinate\tReference_base") Frequency_bases = {"A":0,"T":0,"C":0,"G":0} for base in sorted(Frequency_bases.keys()): output.write("\t"+str(base)) output.write("\n") for line in fileinput.input([File_Pileup]): rowlist = [] rowlist = (line.rstrip("\n")).split('\t') rowlist[2] = rowlist[2].upper() Frequency = {"A":0,"T":0,"C":0,"G":0} if line.startswith("#"): continue else: output.write(str(rowlist[0])+"\t"+str(rowlist[1])+"\t"+str(rowlist[2])) for i in rowlist[index]: if i == ".": Frequency[rowlist[2]] = Frequency[rowlist[2]] + 1 continue if i == ",": Frequency[rowlist[2]] = Frequency[rowlist[2]] + 1 continue if i in nucleotides: if i.isupper(): Frequency[i] = Frequency[i]+1 else: Frequency[complement[i]] = Frequency[complement[i]] + 1 for base in sorted(Frequency.keys()): output.write("\t"+str(Frequency[base])) output.write("\n") output.close()  ADD COMMENT 0 Entering edit mode The A,T,C,G counts are based on forward strand of the reference genome. In other words, if a read that was aligned on the reverse strand has base "T" for a position, then the frequency of "A" will be incremented as the reverse compliment of that read will render "A" to the forward strand. ADD REPLY 0 Entering edit mode I think this script will not handle correctly insertions and deletions. E.g. in mpileup line seq3 200 A 20 ,,,,,..,.-4CACC.-4CACC....,.,,.^~. ==<<<<<<<<<<<::<;2<<  CACC will be counted as bases not as deletion. Also, the character following ^ in the pileup should be skipped as it reports the mapping quality (which could be A, C, G, T). ADD REPLY 0 Entering edit mode Thanks! Looks good! I will give this a try too! Did you just write this code? ADD REPLY 1 Entering edit mode Yup, just wrote it. Please read my other comment to make sure that the code is behaving the way you want. ADD REPLY 0 Entering edit mode Yes I read it, although its not exactly what I want I can edit it in case I use it. I am currently trying to figure out pysamstats as suggested by dariober and bam-readcount as suggested by Chris Miller. Both the tools take bam file directly without creating an mpileup file. ADD REPLY 0 Entering edit mode ADD REPLY 1 Entering edit mode 6.9 years ago You can use pileup and this function in R (with Rsamtools): pileupFreq <- function(pileupres) { nucleotides <- levels(pileupresnucleotide)
res <- split(pileupres, pileupres$seqnames) res <- lapply(res, function (x) {split(x, x$pos)})
res <- lapply(res, function (positionsplit) {
nuctab <- lapply(positionsplit, function(each) {
chr = as.character(unique(each$seqnames)) pos = as.character(unique(each$pos))
tablecounts <- sapply(nucleotides, function (n) {sum(each$count[each$nucleotide == n])})
c(chr,pos, tablecounts)
})
nuctab <- data.frame(do.call("rbind", nuctab),stringsAsFactors=F)
rownames(nuctab) <- NULL
nuctab
})
res <- data.frame(do.call("rbind", res),stringsAsFactors=F)
rownames(res) <- NULL
colnames(res) <- c("seqnames","start",levels(pileupres\$nucleotide))
res[3:ncol(res)] <- apply(res[3:ncol(res)], 2, as.numeric)
res
}

Example:

library(VariantAnnotation)
library(Rsamtools)
bf <- BamFile("Pol2ChIP_chr17.bam")
res <- pileup(bf, scanBamParam=ScanBamParam(which=rowData(vcf))
pileupFreq(res)

0
Entering edit mode

inesdesantiago Can you get the pileup string, reference base, insertions and deletions at a particular position using Rsamtools?

0
Entering edit mode
7.9 years ago

If the output is inconsistent with what is seen in IGV, be sure that you're taking into account any mpileup params that might cause it to exclude reads (-Q is a likely candidate, which excludes bases with mapping qualities less than 13 by default)

0
Entering edit mode

I forgot to mention that I used -B to turn off the BAQ computation when using samtools mpileup. I have edited my code to reflect that.

0
Entering edit mode

-Q is still set to 13 by default. Are there a bunch of low-quality bases that show up in IGV, but wouldn't be considered by samtools?

0
Entering edit mode

I haven't seen any low quality bases showing up in IGV when I was checking it. Although the problem is not only in the counts, it is also in the nucleotide itself. I went through the output of the perl script thorougly, there were positions that were consistent with IGV but there were also positions where sometimes the counts did not agree and sometimes even the nucleotide was different. For e.g. if IGV showed 5 Gs and 2 Ts, my output would show 5 As and 2 Cs and it wasn't even a pattern, it was just random.

I also tried it again by using the -B (turn off BAQ computation) and -Q 0 (skip bases with baseQ/BAQ smaller than 0) but nothing changed. So it has something to do with the perl script.

0
Entering edit mode
7.9 years ago

I am not sure if this is the case with you but the default settings for mpileup is to filter reads with bitwise flag 0X704. So for pileup generation the following reads will not considered at all from the bam files:

1. 0x0400 (aka 1024 or "d") duplicate
2. 0x0200 (aka 512 or "f") failed QC
3. 0x0100 (aka 256 or "s") non primary alignment
4. 0x0004 (aka 4 or "u") unmapped

May be IGV doesnt filter out some of these flags. As a result, you may be seeing extra alignments in IGV but not in your count file.

0
Entering edit mode

Actually there was an error in the perl script. I can't reach the author but I have updated the code and it works fine. But I am still looking for alternatives because there are very few tools out there for this purpose.

0
Entering edit mode
7.1 years ago

Could you sent me the  pileup2baseindel script that you changed? or tell me how can I find the bug?

0
Entering edit mode
0
Entering edit mode

Hello Komal, could you send the pileup2baseindel script that you changed again? the link posted here doesn't work. appreciate your help!