Question

GNU parallel on multiple files to re-header bam files

1

Entering edit mode

6.2 years ago

pierre.justeau ▴ 10

Hi All,

The purpose of my question is to generate a new header for a bam files. For one sample, i defined two command lines and its works:

#filter header
samtools view -H file1.bam | awk '(/^@SQ/ && /chrM/) || (!/^@SQ/) {print $0} ' > file1_filtered.txt

#re-header
cat file1_filtered.txt <(samtools view file1.bam) | samtools view -hb - > file1_filtered_re-header.bam

I would like to do the same thing for a lot of samples

#filter header
ls *.bam | parallel -j 8 'samtools view -H -@ 7 {} > {.}_filtered.txt'

After this step i didn't find the solution, if someone has an idea? Many thanks in advance.

GNU parallel • 2.7k views

ADD COMMENT • link 6.2 years ago by pierre.justeau ▴ 10

1

Entering edit mode

Keep in mind that working with files in parallel may or may not be faster than doing each independently. If the processing is IO-bound, then running in parallel may not be faster (though not likely slower).

ADD REPLY • link 6.2 years ago by Sean Davis 26k

0

Entering edit mode

See also: https://oletange.wordpress.com/2015/07/04/parallel-disk-io-is-it-faster/

ADD REPLY • link 6.1 years ago by ole.tange ★ 4.4k

0

Entering edit mode

Curious if you have done any recent testing with new high performance file systems such as isilon/NetApp (versions with spinning disks and pure SSD's)? At some point one of the connections (pci-e bus) may become the bottleneck.

ADD REPLY • link 6.1 years ago by GenoMax 141k

0

Entering edit mode

I have not, but I am pretty sure the conclusion will be the same: It depends, and you should therefore test with different parallelization and measure.

ADD REPLY • link 6.1 years ago by ole.tange ★ 4.4k

0

Entering edit mode

Hi,

if it works with the command lines, why not putting these into a bash script and running this bash script with parallel?

Otherwise, you may use a tab-separated file where you have input and output variables stored like here.

Cheers, Michael

ADD REPLY • link 6.2 years ago by michael.ante ★ 3.8k

0

Entering edit mode

thanks for your help. I solve the issues for my first step "#filtered". I didn't find a solution to run the step "re-header" for a multiple files...

Many thanks for your help, Pierre

ADD REPLY • link 6.2 years ago by pierre.justeau ▴ 10

0

Entering edit mode

Hi all, Thanks a lot for your help and script, it's run perfectly. I better understand where was my mistake. I just discovered GNU parallel, I'm a bit confused with some syntaxes, i'm gonna to learn more about it.

Thanks again, Pierre

ADD REPLY • link 6.2 years ago by pierre.justeau ▴ 10

0

Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted (green check mark). Upvote|Bookmark|Accept

ADD REPLY • link 6.2 years ago by GenoMax 141k

score 6 · Answer 1 · 2018-02-12

doit() {
  file="$1"
  out="$2"
  (samtools view -H $file | awk '(/^@SQ/ && /chrM/) || (!/^@SQ/) {print $0} ';
   samtools view $file) | samtools view -hb - > $out
}
export -f doit
parallel doit {} {.}_filtered_re-header.bam ::: *.bam

Or if you want to use the header from the first file for all the rest:

samtools view -H file1.bam | awk '(/^@SQ/ && /chrM/) || (!/^@SQ/) {print $0} ' >header
doit() {
  file="$1"
  out="$2"
  (cat header;
   samtools view $file) | samtools view -hb - > $out
}
export -f doit
parallel doit {} {.}_filtered_re-header.bam ::: *.bam