Run Samtools on multiple files
5
2
Entering edit mode
4.4 years ago
V ▴ 320

I'm trying to run samtools sort in a folder containing about 380 bam files.

I've tried the following command and nothing happens. No errors, it just returns blank. Can anyone help me with what I've done wrong? Thank you :)

samtools sort -@ 7 *.bam > *.sorted.bam
samtools rnaseq • 11k views
ADD COMMENT
11
Entering edit mode
4.4 years ago
Medhat 9.0k

Try this;

for f in *.bam; do filename="${f%%.*}"; samtools sort -@ 7 $f ${filename}.sorted.bam; done

Or

for f in *.bam; do  samtools sort -@ 7 $f   ${f/.bam/sorted.bam}; done
ADD COMMENT
0
Entering edit mode

Seems to be working, hasn't thrown out an error! thanks alot!!

ADD REPLY
0
Entering edit mode

I modified a bit based on Medhat's script. Seems better. for f in *.bam; do samtools sort -@ 7 $f ${f/bam/sorted}; done

ADD REPLY
8
Entering edit mode
4.4 years ago

Using gnu parallel in its simplest form:

ls *.bam | parallel -j 8 'samtools sort -@ 7  {} > {.}_sorted.bam'

To show progress and expected time of completion

ls *.bam | parallel --progress --eta -j 8 'samtools sort -@ 7  {} > {.}_sorted.bam'
ADD COMMENT
4
Entering edit mode

--eta implies --progress, so --progress is not needed. Also try the newer --bar. If you want one per cpu core, leave out -j8.

ADD REPLY
1
Entering edit mode

Thanks! I am huge fan of parallel, but so many options :o

ADD REPLY
4
Entering edit mode
4.4 years ago

create the following 'Makefile'

SHELL=/bin/bash
BAMS=$(shell ls -1 *.bam)
define run

$$(addsuffix .sorted.bam,$$(basename $(1))): $(1)
    samtools sort $$< $$(basename $$@)
    samtools index $$@
endef

all: $(addsuffix .sorted.bam,$(basename ${BAMS}))

$(eval $(foreach B,${BAMS},$(call run,${B})))

and run it in parallel for you 380 bam using the option -j for the number of parallel operations

make -j 16
ADD COMMENT
2
Entering edit mode

On a separate note, this is really cool, suggestions to learn how to use "make" (tutorial links) would be helpful for all.

ADD REPLY
0
Entering edit mode

May be I am asking silly question, does that "16" means the no. of threads?

ADD REPLY
0
Entering edit mode

does that "16" means the no. of threads?

yes

ADD REPLY
2
Entering edit mode
4.4 years ago
Charles Plessy ★ 2.7k

The * wildcard character can only expand to already existing files (*.bam in your example). The command-line interpreter can not guess your intent and provide contents for *.sorted.bam. Moreover, samtools sort is not designed to run on multiple files in parallel.

However, there are generic tools for doing what you want. See for instance the two following links.

ADD COMMENT
0
Entering edit mode

Hi Charles,

Sorry My bad, I seem to have explained what I want to do wrong. I didn't mean I want to run it on all of them in parallel. I meant I want samtools to sort them out one by one. Usually what I would do would be:

samtools sort -@ 7 File1.bam File1.sorted.bam
samtools sort -@ 7 File2.bam File2.sorted.bam

and then as soon as it finishes with file1 it would go automatically to file2. But having 380 or so files, you can see the issue with having to write all of them out manually :)

ADD REPLY
0
Entering edit mode

Indeed, a for loop better answers to your question. Nevertheless, if you run parallel and limit the number of concurrent jobs to 1 (--jobs 1) you will also be able to run the commands sequentially, basically like in a for loop. And you will have learned a new tool :)

ADD REPLY
0
Entering edit mode
4.4 years ago

try this:

for in in *.bam; do sample_name=`echo $i | awk -F "." '{print $1}'`; samtools sort -@ 7 $i >  ${sample_name}.sorted.bam ; done
ADD COMMENT
1
Entering edit mode

Just a comment, when you are defining output filename, it's always better to use basename in order to remove just the extension and add output format. Something like outFile=$(basename "$BAM" .bam | awk '{print $1".sorted.bam"}').

Why? Because sometimes the filenames include several . which explain a lot about the file. E.g. ICGC_File1.cohortA.rmdup.bam. With your approach you'll get an output which looses some information ICGC_File1.sorted.bam.

ADD REPLY
0
Entering edit mode

+1 Thanks venu for the suggestion, though I am clueless why the loop I suggested does not work for V

ADD REPLY
0
Entering edit mode

Hello,

Thanks. I've tried it but keep getting this error appearing multiple times:

awk: read error (Is a directory)

Usage:   samtools sort [options] <in.bam> <out.prefix>

Options: -n        sort by read name
         -f        use <out.prefix> as full file name instead of prefix
         -o        final output to stdout
         -l INT    compression level, from 0 to 9 [-1]
         -@ INT    number of sorting and compression threads [1]
         -m INT    max memory per thread; suffix K/M/G recognized [768M]
ADD REPLY
0
Entering edit mode

try this: Removed ">" and its for i in not for in in ; i edited

for i in *.bam; do sample_name=`echo $i | awk -F "." '{print $1}'`; samtools sort -@ 7 $i ${sample_name}.sorted; done

I assume you run it for a directory having files:

sample1.bam
sample2.bam
sample3.bam

and so on

ADD REPLY
0
Entering edit mode

Same Error coming up :/

ADD REPLY
0
Entering edit mode

yes, thats correct. They aren't named "sample" they have a serial number but yes all bam

ADD REPLY

Login before adding your answer.

Traffic: 2521 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6