Run Samtools on multiple files
5
2
Entering edit mode
4.4 years ago
V ▴ 320

I'm trying to run samtools sort in a folder containing about 380 bam files.

I've tried the following command and nothing happens. No errors, it just returns blank. Can anyone help me with what I've done wrong? Thank you :)

samtools sort -@ 7 *.bam > *.sorted.bam

samtools rnaseq • 11k views
11
Entering edit mode
4.4 years ago
Medhat 9.0k

Try this;

for f in *.bam; do filename="${f%%.*}"; samtools sort -@ 7$f ${filename}.sorted.bam; done  Or for f in *.bam; do samtools sort -@ 7$f   ${f/.bam/sorted.bam}; done  ADD COMMENT 0 Entering edit mode Seems to be working, hasn't thrown out an error! thanks alot!! ADD REPLY 0 Entering edit mode I modified a bit based on Medhat's script. Seems better. for f in *.bam; do samtools sort -@ 7$f ${f/bam/sorted}; done ADD REPLY 8 Entering edit mode 4.4 years ago Using gnu parallel in its simplest form: ls *.bam | parallel -j 8 'samtools sort -@ 7 {} > {.}_sorted.bam'  To show progress and expected time of completion ls *.bam | parallel --progress --eta -j 8 'samtools sort -@ 7 {} > {.}_sorted.bam'  ADD COMMENT 4 Entering edit mode --eta implies --progress, so --progress is not needed. Also try the newer --bar. If you want one per cpu core, leave out -j8. ADD REPLY 1 Entering edit mode Thanks! I am huge fan of parallel, but so many options :o ADD REPLY 4 Entering edit mode 4.4 years ago create the following 'Makefile' SHELL=/bin/bash BAMS=$(shell ls -1 *.bam)
define run

$$(addsuffix .sorted.bam,$$(basename $(1))):$(1)
samtools sort $$<$$(basename $$@) samtools index$$@
endef

all: $(addsuffix .sorted.bam,$(basename ${BAMS}))$(eval $(foreach B,${BAMS},$(call run,${B})))


and run it in parallel for you 380 bam using the option -j for the number of parallel operations

make -j 16

2
Entering edit mode

On a separate note, this is really cool, suggestions to learn how to use "make" (tutorial links) would be helpful for all.

0
Entering edit mode

May be I am asking silly question, does that "16" means the no. of threads?

0
Entering edit mode

does that "16" means the no. of threads?

yes

2
Entering edit mode
4.4 years ago
Charles Plessy ★ 2.7k

The * wildcard character can only expand to already existing files (*.bam in your example). The command-line interpreter can not guess your intent and provide contents for *.sorted.bam. Moreover, samtools sort is not designed to run on multiple files in parallel.

However, there are generic tools for doing what you want. See for instance the two following links.

0
Entering edit mode

Hi Charles,

Sorry My bad, I seem to have explained what I want to do wrong. I didn't mean I want to run it on all of them in parallel. I meant I want samtools to sort them out one by one. Usually what I would do would be:

samtools sort -@ 7 File1.bam File1.sorted.bam
samtools sort -@ 7 File2.bam File2.sorted.bam


and then as soon as it finishes with file1 it would go automatically to file2. But having 380 or so files, you can see the issue with having to write all of them out manually :)

0
Entering edit mode

Indeed, a for loop better answers to your question. Nevertheless, if you run parallel and limit the number of concurrent jobs to 1 (--jobs 1) you will also be able to run the commands sequentially, basically like in a for loop. And you will have learned a new tool :)

0
Entering edit mode
4.4 years ago

try this:

for in in *.bam; do sample_name=echo $i | awk -F "." '{print$1}'; samtools sort -@ 7 $i >${sample_name}.sorted.bam ; done

1
Entering edit mode

Just a comment, when you are defining output filename, it's always better to use basename in order to remove just the extension and add output format. Something like outFile=$(basename "$BAM" .bam | awk '{print $1".sorted.bam"}'). Why? Because sometimes the filenames include several . which explain a lot about the file. E.g. ICGC_File1.cohortA.rmdup.bam. With your approach you'll get an output which looses some information ICGC_File1.sorted.bam. ADD REPLY 0 Entering edit mode +1 Thanks venu for the suggestion, though I am clueless why the loop I suggested does not work for V ADD REPLY 0 Entering edit mode Hello, Thanks. I've tried it but keep getting this error appearing multiple times: awk: read error (Is a directory) Usage: samtools sort [options] <in.bam> <out.prefix> Options: -n sort by read name -f use <out.prefix> as full file name instead of prefix -o final output to stdout -l INT compression level, from 0 to 9 [-1] -@ INT number of sorting and compression threads [1] -m INT max memory per thread; suffix K/M/G recognized [768M]  ADD REPLY 0 Entering edit mode try this: Removed ">" and its for i in not for in in ; i edited for i in *.bam; do sample_name=echo$i | awk -F "." '{print $1}'; samtools sort -@ 7$i \${sample_name}.sorted; done


I assume you run it for a directory having files:

sample1.bam
sample2.bam
sample3.bam


and so on

0
Entering edit mode

Same Error coming up :/

0
Entering edit mode

yes, thats correct. They aren't named "sample" they have a serial number but yes all bam