How To Get Number Of Reads In Bam File Efficiently In Python?

15

Entering edit mode

14.9 years ago

User 9996 ▴ 840

how can I get the number of reads total in a BAM file, using Pysam (or something else Pythonic) in Python? I want to avoid iterating through the file and counting each read since it takes a very long time and is inefficient.

thank you.

python samtools next-gen sequencing • 41k views

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 14.9 years ago by User 9996 ▴ 840

0

Entering edit mode

I am not sure about doing this on Python but guess we can get the nuber of reads in BAM file using FASTQC.

ADD REPLY • link 6.8 years ago by Arindam Ghosh ▴ 550

8

Entering edit mode

13.9 years ago

Marvin ▴ 900

Straight forward without iterating through the whole file:

samtools idxstats file.bam

Naturally, that requires the file to be indexed. I don't know about a Python API to do that, but a pipe should work.

ADD COMMENT • link updated 6.8 years ago by Ram 45k • written 13.9 years ago by Marvin ▴ 900

1

Entering edit mode

This is the preferred solution when the BAM is coordinate sorted.

ADD REPLY • link 13.9 years ago by lh3 33k

1

Entering edit mode

+1, this is the fastest solution when your bam is sorted and indexed.

ADD REPLY • link 13.8 years ago by Leszek 4.2k

0

Entering edit mode

and aggregate to obtain mapped + unmapped reads

samtools idxstats file.bam | awk -F '\t' '{s+=$3+$4}END{print s}'

ADD REPLY • link 8.6 years ago by pauca • 0

0

Entering edit mode

I think different versions of idxstats produces different outputs

ADD REPLY • link 8.6 years ago by John 13k

7

Entering edit mode

14.9 years ago

Jts ★ 1.5k

I don't think you can get the number of reads from pysam without iterating through the file. See this post to the samtools mailing list about doing the same with the java api.

I would make an external call to samtools flagstat as suggested.

ADD COMMENT • link updated 6.8 years ago by Ram 45k • written 14.9 years ago by Jts ★ 1.5k

4

Entering edit mode

Perhaps this is obvious, but if you're going to be getting these read counts many times, it makes sense to save the counts so that you don't have to iterate through multiple times. This can be as simple as counting the reads once and dumping those counts to a file in the same directory as the BAM files.

ADD REPLY • link 14.9 years ago by Chris Miller 22k

0

Entering edit mode

I would use samtools idxstats as it's the fastest solution...

ADD REPLY • link updated 6.8 years ago by Ram 45k • written 13.8 years ago by Leszek 4.2k

7

Entering edit mode

13.0 years ago

Chris Penkett ▴ 490

Short python/pysam answer:

import pysam
filename = 'test.bam'
print reduce(lambda x, y: x + y, [ eval('+'.join(l.rstrip('\n').split('\t')[2:]) ) for l in pysam.idxstats(filename) ])

ADD COMMENT • link updated 6.8 years ago by Ram 45k • written 13.0 years ago by Chris Penkett ▴ 490

0

Entering edit mode

Cool! That's a fast way for pysam.

If you want mapped reads, you can switch the third line as:

print reduce(lambda x, y: x + y, [ int(l.rstrip('\n').split('\t')[2]) for l in pysam.idxstats(filename) ])

and for unmapped read number:

print reduce(lambda x, y: x + y, [ int(l.rstrip('\n').split('\t')[3]) for l in pysam.idxstats(filename) ])

ADD REPLY • link updated 6.8 years ago by Ram 45k • written 11.3 years ago by Pengfei Yu ▴ 40

3

Entering edit mode

13.9 years ago

Marcelo ▴ 30

The straightforward way:

samtools view -c file.bam

ADD COMMENT • link updated 6.8 years ago by Ram 45k • written 13.9 years ago by Marcelo ▴ 30

2

Entering edit mode

Unfortunately, that prints the number of alignments in the file, which can be (and usually is) distinct from the number of reads (because each read can align more than once).

ADD REPLY • link updated 2.7 years ago by Ram 45k • written 9.9 years ago by Isaac Joseph ▴ 170

1

Entering edit mode

14.9 years ago

Ying ▴ 10

Multiple different ways (including flagstat suggestion) here

ADD COMMENT • link updated 6.8 years ago by Ram 45k • written 14.9 years ago by Ying ▴ 10

1

Entering edit mode

11.1 years ago

Matt Shirley 10k

I like Chris's answer, but if you don't have pysam:

	from subprocess import Popen PIPE

	def bam_read_count(bamfile):
	""" Return a tuple of the number of mapped and unmapped reads in a bam file """
	p = Popen(['samtools', 'idxstats', bamfile], stdout=PIPE)
	mapped = 0
	unmapped = 0
	for line in p.stdout:
	rname, rlen, nm, nu = line.rstrip().split()
	mapped += int(nm)
	unmapped += int(nu)
	return (mapped, unmapped)

view raw readcount.py hosted with ❤ by GitHub

This function will do the same thing and takes the same amount of time regardless of the number of reads in your file as long as it's indexed.

ADD COMMENT • link 11.1 years ago by Matt Shirley 10k

0

Entering edit mode

10.4 years ago

John 13k

There is unfortunately no efficient way to read a whole file without reading a whole file - there are however shortcuts with a time/accuracy tradeoff. For the SeQC project, we calculate a read count estimate with:

import os
import subprocess
import pysam
def bamCheck(fileName):
    xamfile = open(fileName, 'rb')
    try:
        for i in range(10):
            section = xamfile.read(2048)
            if '\0' in section: return True
            if len(section) < 2048: break
    finally: xamfile.close()
    return False

def guessReads(fileName,samtools):
    DEVNULL = open(os.devnull, 'wb')
    bytesUsed = 10000000
    sizeInBytes = int(subprocess.check_output("ls -ln '" + str(fileName) + "' | awk '{print $5}'", stderr=subprocess.STDOUT, shell=True))
    if bamCheck(fileName):
        if bytesUsed*2 < sizeInBytes:
            readsIn100MB = int(subprocess.check_output('head -c ' + str(bytesUsed) + ' "' + str(fileName) + '" | "' + samtools + '" view - | wc -l', stderr=DEVNULL, shell=True))
            estimatedReads = (readsIn100MB/float(bytesUsed))*sizeInBytes
            estimatedReads = estimatedReads * 1.1 #basically, because of the BAM header (or worse, SAM header) we typically underestimate the number of reads by between 8 and 12 percent. So adding an extra 10% on to this number gets it closer to accurate :) Of course, this all depends on the bytesUsed to sample the file - the bigger the number, the more accurate our guess, but the slower it will take to estimate read count.
        else:
            estimatedReads = int(subprocess.check_output('"' + samtools + '" view -c "' + str(fileName) + '"', stderr=DEVNULL, shell=True)) # because if we can read the whole file in twice the time as it took to sample it, we might as well just read the whole file!
    else:
        bytesIn10000Lines = int(subprocess.check_output('head -11000 "' + str(fileName) + '" | tail -10000 | wc -c', stderr=DEVNULL, shell=True))
        estimatedReads = ( sizeInBytes/float(bytesIn10000Lines) )*10000
        estimatedReads = int(estimatedReads * 1.05) # because we skipped the header by head/tailing the file, we have a much more accurate value here for the total number of reads, so we only need to increase by about 5% to make sure our estimates are conservative/unreachable ;)
    return int(estimatedReads)

>>> guessReads('44_Mm01_WEAd_C23_H3K27me3_F.bam','/package/samtools/samtools')
237033936

Hope that helps someone - I originally wrote it to estimate read counts for use in a progress bar - so i actually much preferred a slight over estimate than an underestimate (causing 101% of file read..) - but you can tweak all that kind of stuff pretty obviously.

The only other possible answer to the OP with regards to 'efficient without reading whole file' would be to find the total read count as a byproduct of reading the whole file during some other process. People have given the example of using the index (which is created by reading the whole file), as a source for getting the total read count for little extra work - but actually any process which reads the whole file (say, MD5'ing the file) could also be used where chunked bytes are also piped to pysam and linecount++'d

Since you're IO bound anyway, using pysam in such a situation is, strictly speaking, even more efficient than using the index ;) hahah

ADD COMMENT • link updated 6.8 years ago by Ram 45k • written 10.4 years ago by John 13k

0

Entering edit mode

4.2 years ago

ali • 0

  import subprocess 
    def bam_read_count(bamfile):
        """ Return  the number of reads in a bam file """
        cc=subprocess.call(['samtools view -c' + bamfile], shell=True)
        return (cc)
    bam_read_count(bamfile)

ADD COMMENT • link 4.2 years ago by ali • 0

Login before adding your answer.