Question: GNU parallel on multiple files to re-header bam files
1
gravatar for pierre.justeau
3 months ago by
pierre.justeau10 wrote:

Hi All,

The purpose of my question is to generate a new header for a bam files. For one sample, i defined two command lines and its works:

#filter header
samtools view -H file1.bam | awk '(/^@SQ/ && /chrM/) || (!/^@SQ/) {print $0} ' > file1_filtered.txt

#re-header
cat file1_filtered.txt <(samtools view file1.bam) | samtools view -hb - > file1_filtered_re-header.bam

I would like to do the same thing for a lot of samples

#filter header
ls *.bam | parallel -j 8 'samtools view -H -@ 7 {} > {.}_filtered.txt'

After this step i didn't find the solution, if someone has an idea? Many thanks in advance.

gnu parallel • 286 views
ADD COMMENTlink modified 3 months ago • written 3 months ago by pierre.justeau10
1

Keep in mind that working with files in parallel may or may not be faster than doing each independently. If the processing is IO-bound, then running in parallel may not be faster (though not likely slower).

ADD REPLYlink written 3 months ago by Sean Davis24k

See also: https://oletange.wordpress.com/2015/07/04/parallel-disk-io-is-it-faster/

ADD REPLYlink written 8 weeks ago by ole.tange3.0k

Curious if you have done any recent testing with new high performance file systems such as isilon/NetApp (versions with spinning disks and pure SSD's)? At some point one of the connections (pci-e bus) may become the bottleneck.

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by genomax48k

I have not, but I am pretty sure the conclusion will be the same: It depends, and you should therefore test with different parallelization and measure.

ADD REPLYlink written 7 weeks ago by ole.tange3.0k

Hi,

if it works with the command lines, why not putting these into a bash script and running this bash script with parallel?

Otherwise, you may use a tab-separated file where you have input and output variables stored like here.

Cheers, Michael

ADD REPLYlink modified 3 months ago • written 3 months ago by michael.ante2.5k

thanks for your help. I solve the issues for my first step "#filtered". I didn't find a solution to run the step "re-header" for a multiple files...

Many thanks for your help, Pierre

ADD REPLYlink written 3 months ago by pierre.justeau10

Hi all, Thanks a lot for your help and script, it's run perfectly. I better understand where was my mistake. I just discovered GNU parallel, I'm a bit confused with some syntaxes, i'm gonna to learn more about it.

Thanks again, Pierre

ADD REPLYlink written 3 months ago by pierre.justeau10

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted (green check mark). Upvote|Bookmark|Accept

ADD REPLYlink written 3 months ago by genomax48k
4
gravatar for ole.tange
3 months ago by
ole.tange3.0k
Denmark
ole.tange3.0k wrote:
doit() {
  file="$1"
  out="$2"
  (samtools view -H $file | awk '(/^@SQ/ && /chrM/) || (!/^@SQ/) {print $0} ';
   samtools view $file) | samtools view -hb - > $out
}
export -f doit
parallel doit {} {.}_filtered_re-header.bam ::: *.bam

Or if you want to use the header from the first file for all the rest:

samtools view -H file1.bam | awk '(/^@SQ/ && /chrM/) || (!/^@SQ/) {print $0} ' >header
doit() {
  file="$1"
  out="$2"
  (cat header;
   samtools view $file) | samtools view -hb - > $out
}
export -f doit
parallel doit {} {.}_filtered_re-header.bam ::: *.bam
ADD COMMENTlink modified 3 months ago • written 3 months ago by ole.tange3.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 935 users visited in the last hour