Question

How to correctly parallelise RSeQC scripts with GNU parallel?

1

Entering edit mode

5.6 years ago

Tain Luquez ▴ 10

Dear all,

I have a .bam, as output from STAR aligner, from which I need to extract some info using RSeQC while using all the computational resources available to increase speed. To accomplish that I am using GNU parallel. Currently, I am doing it like this:

Split the main .bam per chromosome, generating files with the suffix Aligned.sortedByCoord.out.REF
```
bamtools split -in Aligned.sortedByCoord.out.bam  -reference
```

Calculate the clipping profile

ls Aligned.sortedByCoord.out.REF_* | parallel --joblog clipping_profile.log --eta --bar --max-procs 20 --jobs 20 'clipping_profile.py -i {} -s "SE" -o RSeQC.clipping.profile/clipping_profile.{.}'

This will generate a .r, .pdfand a .xlxs for each chromosome, thus ending up with a lot of files. I am forced to split the .bam per chromosome as I am not able to find a --recstart nor a --recend in the compressed .bam.

Question 1

Is there a better way to split a .bam file using GNU parallel and RSeQC' programs (e.g. split_bam.py), such that one gets the outputs from a single input file?

Inspired by ole.tange's thread, I have also tried this, though unsuccessfully:

cat star_outputAligned.sortedByCoord.out.bam | parallel --block 1G -k --joblog splitbam3.log --max-procs 20 --jobs 20 --pipe 'split_bam.py -i {} -r ../../My_genomes/GCA.GRCh38/GRCh38_wgEncodeGencodeAttrsV28_all_rRNA.bed12 -o splitbam3'

Question 2

If using the split option in the first step, is there a way to merge all output to get the results of the whole file (as if never had been split)?

RNA-Seq RSeQC GNU parallel split file • 2.1k views

ADD COMMENT • link updated 5.6 years ago by ATpoint 81k • written 5.6 years ago by Tain Luquez ▴ 10

0

Entering edit mode

such that one gets the outputs from a single input file?

I assume you are referring to getting a single output file set. You can make your own judgement by looking at this but it would not be a good thing to do since you are looking to generate specific stats on individual chromosomes. You may end up with an interleaved file (assuming the writing to single file works).

It would be safe to merge the resulting files after all processing is done. The process is described here for Excel files (and there are other solutions you can find by searching).

ADD REPLY • link 5.6 years ago by GenoMax 141k

0

Entering edit mode

Hello @genomax

I assume you are referring to getting a single output file set.

Yes. The output should be as if I had run, but faster:

clipping_profile.py -i  Aligned.sortedByCoord.out.bam -s "SE" -o RSeQC.clipping.profile/clipping_profile

You may end up with an interleaved file (assuming the writing to single file works).

Right, then no merging while writing the output file.

The process is described here for Excel files

This will be a nice solution but, probably, the time I saved by using parallel will be spent in the merging step. Is there a way to use, for instance, parallel's blocks to do it in fewer steps?

Thanks for replying.

ADD REPLY • link 5.6 years ago by Tain Luquez ▴ 10

score 1 · Answer 1 · 2018-09-17

1

Entering edit mode

5.6 years ago

ATpoint 81k

I cannot comment on the RseQC, but splitting a BAM into its chromosomes using parallel could be done like this. It makes use of the samtools syntax samtools view options BAM <chromosome>

samtools idxstats in.bam | cut -f1 | grep -v '\*' | parallel "samtools view -b -o out_{}.Aligned.sortedByCoord.out.bam in.bam {}"

ADD COMMENT • link 5.6 years ago by ATpoint 81k

0

Entering edit mode

Actually this is a nice solution. Usually, I have to wait until bamtools split -in file.bam -reference finish, so there is some time that I can definitely save by using your code.

Thanks for replying.

ADD REPLY • link 5.6 years ago by Tain Luquez ▴ 10