Coverage In Bam File - Bases And Overall Count
2
6
Entering edit mode
9.6 years ago

This might be a bit duplicate question, but previous posts did not help me.

I have a reference sequence, I map reads on it and get .bam file. Now I would like to get number of reads overlapping every single base in my reference sequence (=coverage per base).

Secondly, I would like to get the number of bases at these positions, like:

pos A C G T
1 A:2 C:3 G:0 T:1
2 A:2 C:3 G:0 T:1
3 A:0 C:0 G:0 T:7
4 ..
..


Thanks a lot!

bam coverage • 17k views
0
Entering edit mode

what did you try ?

0
Entering edit mode

I have found that samtools mpileup -f ref.fasta aln_sorted.bam does the first part for me.

1
Entering edit mode

check my answer here, I think this is what you are looking for.

0
Entering edit mode

I have had problem installing Bio::DB::Sam, so let me write my solution here in case someone is interested; First, recompile samtools with CSFLAG -fPIC, as suggested here: http://cpansearch.perl.org/src/LDS/Bio-SamTools-1.36/README This way you get those libbam.a and bam.h files. Then use just perl -MCPAN -e 'install Bio::DB::Sam' and as a path write way to newly compiled samtools. Enjoy library :-)

0
Entering edit mode

to do the first part, samtools depth is a direct way to get each position's coverage. much faster than processing the output of samtools mpileup. to do the second part, I would also go for samtools mpileup as mentioned here. if both things are needed I would then try to get all the information from a single samtools mpileup output parsing.

6
Entering edit mode
9.6 years ago
Farhat ★ 2.9k

Both the coverage per base and the number of a particular base at each position can be obtained using samtools mpileup in.bam. This results in a file something like this:

Contig18263     110     N       1       T       e
Contig18263     111     N       1       C       c
Contig18263     112     N       1       T       e
Contig18263     113     N       1       T       d
Contig18263     114     N       1       T$f Contig18263 132 N 3 ^$T^*T^$T gfh Contig18263 133 N 5 TTT^$T^\$T       hhhhf
Contig18263     134     N       7       AAAAA^!A^!A     hhhhfeg
Contig18263     135     N       7       TTTTTTT hhhhfgf
Contig18263     136     N       7       GGGGGGG hhhhfgg
Contig18263     137     N       7       TTTTTTT hghhfhh


The 4th column gives you the per base coverage. Processing the 5th column can give you the number of bases at each position with something like (in Python) Not the best code but you get the idea:

for line in in.pileup:
linsplit=line.split()
print linsplit[0], linsplit[1], "A:"+str(linsplit[4].count('A')+linsplit[4].count('a')), "C:"+str(linsplit[4].count('C')+linsplit[4].count('c')), "G:"+str(linsplit[4].count('G')+linsplit[4].count('g')), "T:"+str(linsplit[4].count('T')+linsplit[4].count('t'))

0
Entering edit mode

0
Entering edit mode

One more question:) Now I use samtools mpileup -f reference.fasta -r contig aln.bam >mpileup command. However, results not always start from 1st base (which is desired behaviour in my case). Please, how could I fix this? Right now, I do not see any logic in where it starts reporting.

1
Entering edit mode

It will start reporting from the position where there are alignments. If there are no reads aligned to the first position, it will not report them.

0
Entering edit mode

Well, that what I would expect, but unfortunately there are reads on those first positions (I can see them in .bam file). I guess I will try to contact authors of samtools.

1
Entering edit mode
6.1 years ago
Srihari ▴ 30

I find bam-readcount to be very useful here - it directly works with the bam file (no need for mpileup). Use it like so:

bam-readcount -f reference.fa file.bam


What is also nice is that bam-readcount can take in a file with only the positions/regions you're interested in (eg -l file.bed) and spit out the split of bases/insertions/deletions at each position.