Question

Executing a python script parallel for mutiple files in directory

0

Entering edit mode

20 months ago

osiemen ▴ 30

I am using a python script provided by the DEXSeq package to count exons. I have to execute the same python script on 50 bam files in my directory. Currently I am doing this using a for loop, by iterating one by one. However this step takes too long. Is there a easy way to execute the same python script parallel for each file separately, so that I don't have to wait for each file to finish. I know this should be possible in bash, but I don't have any experience with it.

I am currently using the following code and for each file to finish it takes around 1hour:

#!/bin/bash
#$ -cwd
#$ -o $HOME/exonCount.out
#$ -e $HOME/exonCount.err
#$ -V
#$ -q all.q

for i in $( ls -v /mnt/RNA_seq/*.bam )
do
 x="$(basename $i | cut -d'.' -f1 )"
 pathToFiles=$i
  #run python code to count exons
  python3.8 /mnt/python_scripts/dexseq_count.py  -p yes -r pos -s no -f bam \
   /mnt/gff/gencodev26_DEXSeq.gff \
   $pathToFiles \
   /mnt/dexseq/${x}_ExonCount.out
done

P.S I can run it on more cores.

Thanks in advance!

DEXSeq RNA-Seq bash parallel • 1.2k views

ADD COMMENT • link updated 20 months ago by Arup Ghosh 3.2k • written 20 months ago by osiemen ▴ 30

1

Entering edit mode

Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them

ADD REPLY • link 20 months ago by ATpoint 82k

0

Entering edit mode

I did not realize you were on a cluster. SLURM offers arrays, probably the scheduler you use has something similar. That is probably preferred here.

ADD REPLY • link 20 months ago by ATpoint 82k

0

Entering edit mode

Yes indeed but I have not being able to use SGE array job successfully yet

ADD REPLY • link 20 months ago by osiemen ▴ 30

0

Entering edit mode

Looks like you are using SGE. So the trick here would be to use the for loop to submit independent SGE job for each BAM file. You should be able to create a qsub command with the necessary parameters to do so. The jobs would start in parallel (to the extent of what is allowed for your account in terms of resources, rest would pend but then complete over time).

ADD REPLY • link 20 months ago by GenoMax 141k

score 0 · Answer 1 · 2022-08-23

0

Entering edit mode

20 months ago

Arup Ghosh 3.2k

If you want to distribute the jobs in multiple nodes you can use SGE Array Jobs. Mention the number of tasks based on the number of files you wish to process i.e. #$ -t 1-10 (for 10 files) and use the task id as an index to access the bam file name from a list.

ls /mnt/RNA_seq/*.bam > files.list

Example SGE script.

#!/bin/bash
#$ -N test
#$ -cwd
#$ -t 1-10
#$ -e logs/test.err 
#$ -o logs/test.out

#Get n th bam file name / path
bam=$(awk 'NR==$SGE_TASK_ID' files.list)

To run the job in a single node with multiple files processed in parallel use GNU Parallel as suggested by ATpoint .

ADD COMMENT • link 20 months ago by Arup Ghosh 3.2k

0

Entering edit mode

Hi thanks for the input!

I actually tried the following based on your code but this doesnt seem to work:

#!/bin/bash
#$ -N test
#$ -cwd
#$ -t 1-3
#$ -e $HOME/test.err
#$ -o $HOME/test.out
#$ -q all.q@bla

#Get n th bam file name / path
BAM=$( awk 'NR==$SGE_TASK_ID' /mnt/home1/project/bamFiles.list )
#run python code to count exons
python3.8 /mnt/DEXSeq/python_scripts/dexseq_count.py -p yes -r pos -s no -f bam \
/mnt/nochr_gencodev29.gff \
$BAM  \
/mnt/xomics/osmana/dexseq/humandata/countData/RNA.$SGE_TASK_ID

Here it seems like dexseq doesnt get all the correct parameters and i guess it is because of $SGE_TASK_ID.

The error: .../python_scripts/dexseq_count.py: Error: Please provide three arguments

Is it because how I use the awk variable or $BAM ? Providing just a file name without $BAM or $SGE_TASK_ID seems to work just fine..

ADD REPLY • link 20 months ago by osiemen ▴ 30

0

Entering edit mode

Can you try the following?

BAM=$(awk -v "line=$SGE_TASK_ID" 'NR==line {print $1}' /mnt/home1/project/bamFiles.list)

ADD REPLY • link 20 months ago by Arup Ghosh 3.2k