Question: Is there a way to run samtools sort, index, and view on multiple files at once?
gravatar for
3.0 years ago by
ely.j.fish0 wrote:

I have a large number of .bam files, all sorted and indexed using samtools, so that the folder contains files in .bam, .sorted.bam, and .sorted.bam.bai formats. I need to extract some data about number of reads, both total and mitochondrial. Basically, I need to run the following commands:

samtools view -c XYZ.bam

samtools view -c XYZ.bam chrM

samtools view -c -q 30 XYZ.bam

samtools view -c -q 30 XYZ.bam chrM

Where chrM is the mitochondrial chromosome. My question is: is there a script that will allow me to run these commands on all the files at once, and export the resulting read numbers into a libreoffice spreadsheet? Also, is there a way to sort and index a lot of bam files with a single command? I'm totally new to programming, so I apologize if these questions are either insanely easy or ridiculously complicated. I would just really prefer not to have to run these commands and copy-paste 300-odd times, which is what I've been doing thus far.

ADD COMMENTlink modified 3.0 years ago by Zev.Kronenberg11k • written 3.0 years ago by ely.j.fish0
gravatar for John
3.0 years ago by
John12k wrote:

To answer your question directly, you can put your commands in a shell script which loops over the files like so:

for f in /path/to/bams/*.bam
 echo "Processing $f"
 samtools view -c $f 
 samtools view -c $f chrM
 samtools view -c -q 30 $f 
 samtools view -c -q 30 $f chrM

However, this method has 4 downsides:

  1. You're not working in parallel - you'll only use 1 cpu to do all the processing. This can be fixed with a more complicated shell script though.
  2. You read over every file 4 times to calculate what could be gained in a single pass, so it's going to be 4x slower than ideal.
  3. The results you'll get out won't be formatted nicely for a libreoffice table.
  4. If there's another BAM file on your hard drive tomorrow, you'll have to copy/paste some stuff, which is always something to best avoid if you can.

Alternatively you can use a script i wrote called SeQC, but you'll have to use/learn a bit of SQL to get the results you want in a nice table. First download it from here:

Then run the command: python /path/to/ --analysis MAPQ RNAME --input /path/to/bam1.bam /path/to/bam2.bam /path/to/all/bams/*.bam --output /path/to/output/stats.sql

You can also pop in a --cpu X in the above if you like to run over however many CPUs you have on your system. This will calculate the reads per chromosome per MAPQ, so will look something like:

MAPQ        RNAME       counts
----------  ----------  ----------
3           chr16       704290
1           chr4        134562
255         chr9        505645
255         chr19       347598
1           chr17       188383
3           chr1        239878

Then you can sum the reads for chrM, or sum the reads for chrM where the MAPQ > 30, or whatever you want, and it will be near-instantaneous after the sql table has been made. If you decide to go down that route, i'll help you with the SQL.

ADD COMMENTlink written 3.0 years ago by John12k
gravatar for Zev.Kronenberg
3.0 years ago by
United States
Zev.Kronenberg11k wrote:


ADD COMMENTlink written 3.0 years ago by Zev.Kronenberg11k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 998 users visited in the last hour