Question: Run Samtools on multiple files
1
gravatar for V
3.6 years ago by
V230
UK/London
V230 wrote:

I'm trying to run samtools sort in a folder containing about 380 bam files.

I've tried the following command and nothing happens. No errors, it just returns blank. Can anyone help me with what I've done wrong? Thank you :)

samtools sort -@ 7 *.bam > *.sorted.bam
samtools rnaseq • 8.2k views
ADD COMMENTlink modified 3.6 years ago by WouterDeCoster45k • written 3.6 years ago by V230
10
gravatar for Medhat
3.6 years ago by
Medhat8.8k
Texas
Medhat8.8k wrote:

Try this;

for f in *.bam; do filename="${f%%.*}"; samtools sort -@ 7 $f ${filename}.sorted.bam; done

Or

for f in *.bam; do  samtools sort -@ 7 $f   ${f/.bam/sorted.bam}; done
ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by Medhat8.8k

Seems to be working, hasn't thrown out an error! thanks alot!!

ADD REPLYlink written 3.6 years ago by V230

I modified a bit based on Medhat's script. Seems better. for f in *.bam; do samtools sort -@ 7 $f ${f/bam/sorted}; done

ADD REPLYlink written 2.0 years ago by yolandaly0
5
gravatar for WouterDeCoster
3.6 years ago by
Belgium
WouterDeCoster45k wrote:

Using gnu parallel in its simplest form:

ls *.bam | parallel -j 8 'samtools sort -@ 7  {} > {.}_sorted.bam'

To show progress and expected time of completion

ls *.bam | parallel --progress --eta -j 8 'samtools sort -@ 7  {} > {.}_sorted.bam'
ADD COMMENTlink written 3.6 years ago by WouterDeCoster45k
2

--eta implies --progress, so --progress is not needed. Also try the newer --bar. If you want one per cpu core, leave out -j8.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by ole.tange4.0k
1

Thanks! I am huge fan of parallel, but so many options :o

ADD REPLYlink written 3.6 years ago by WouterDeCoster45k
4
gravatar for Pierre Lindenbaum
3.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum133k wrote:

create the following 'Makefile'

SHELL=/bin/bash
BAMS=$(shell ls -1 *.bam)
define run

$$(addsuffix .sorted.bam,$$(basename $(1))): $(1)
    samtools sort $$< $$(basename $$@)
    samtools index $$@
endef

all: $(addsuffix .sorted.bam,$(basename ${BAMS}))

$(eval $(foreach B,${BAMS},$(call run,${B})))

and run it in parallel for you 380 bam using the option -j for the number of parallel operations

make -j 16
ADD COMMENTlink written 3.6 years ago by Pierre Lindenbaum133k
2

On a separate note, this is really cool, suggestions to learn how to use "make" (tutorial links) would be helpful for all.

ADD REPLYlink written 3.6 years ago by lakhujanivijay5.4k

May be I am asking silly question, does that "16" means the no. of threads?

ADD REPLYlink written 3.6 years ago by lakhujanivijay5.4k

does that "16" means the no. of threads?

yes

ADD REPLYlink written 3.6 years ago by Pierre Lindenbaum133k
2
gravatar for Charles Plessy
3.6 years ago by
Charles Plessy2.7k
Japan
Charles Plessy2.7k wrote:

The * wildcard character can only expand to already existing files (*.bam in your example). The command-line interpreter can not guess your intent and provide contents for *.sorted.bam. Moreover, samtools sort is not designed to run on multiple files in parallel.

However, there are generic tools for doing what you want. See for instance the two following links.

ADD COMMENTlink written 3.6 years ago by Charles Plessy2.7k

Hi Charles,

Sorry My bad, I seem to have explained what I want to do wrong. I didn't mean I want to run it on all of them in parallel. I meant I want samtools to sort them out one by one. Usually what I would do would be:

samtools sort -@ 7 File1.bam File1.sorted.bam
samtools sort -@ 7 File2.bam File2.sorted.bam

and then as soon as it finishes with file1 it would go automatically to file2. But having 380 or so files, you can see the issue with having to write all of them out manually :)

ADD REPLYlink written 3.6 years ago by V230

Indeed, a for loop better answers to your question. Nevertheless, if you run parallel and limit the number of concurrent jobs to 1 (--jobs 1) you will also be able to run the commands sequentially, basically like in a for loop. And you will have learned a new tool :)

ADD REPLYlink written 3.6 years ago by Charles Plessy2.7k
0
gravatar for lakhujanivijay
3.6 years ago by
lakhujanivijay5.4k
India/Ahmedabad
lakhujanivijay5.4k wrote:

try this:

for in in *.bam; do sample_name=`echo $i | awk -F "." '{print $1}'`; samtools sort -@ 7 $i >  ${sample_name}.sorted.bam ; done
ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by lakhujanivijay5.4k
1

Just a comment, when you are defining output filename, it's always better to use basename in order to remove just the extension and add output format. Something like outFile=$(basename "$BAM" .bam | awk '{print $1".sorted.bam"}').

Why? Because sometimes the filenames include several . which explain a lot about the file. E.g. ICGC_File1.cohortA.rmdup.bam. With your approach you'll get an output which looses some information ICGC_File1.sorted.bam.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by venu6.7k

+1 Thanks venu for the suggestion, though I am clueless why the loop I suggested does not work for V

ADD REPLYlink written 3.6 years ago by lakhujanivijay5.4k

Hello,

Thanks. I've tried it but keep getting this error appearing multiple times:

awk: read error (Is a directory)

Usage:   samtools sort [options] <in.bam> <out.prefix>

Options: -n        sort by read name
         -f        use <out.prefix> as full file name instead of prefix
         -o        final output to stdout
         -l INT    compression level, from 0 to 9 [-1]
         -@ INT    number of sorting and compression threads [1]
         -m INT    max memory per thread; suffix K/M/G recognized [768M]
ADD REPLYlink written 3.6 years ago by V230

try this: Removed ">" and its for i in not for in in ; i edited

for i in *.bam; do sample_name=`echo $i | awk -F "." '{print $1}'`; samtools sort -@ 7 $i ${sample_name}.sorted; done

I assume you run it for a directory having files:

sample1.bam
sample2.bam
sample3.bam

and so on

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by lakhujanivijay5.4k

Same Error coming up :/

ADD REPLYlink written 3.6 years ago by V230

yes, thats correct. They aren't named "sample" they have a serial number but yes all bam

ADD REPLYlink written 3.6 years ago by V230
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1066 users visited in the last hour
_