Question: BWA -> samtools pipe while filtering mapped & unmapped reads
2
gravatar for fiona.newberry
2.1 years ago by
fiona.newberry80 wrote:

I am using BWA to align files. I have four directories: seqtk_1, seqtk_2, seqtk_3, seqtk_4. Within each of those directories I have 10 subdirectories: subsample_1, subsample_2, subsample_3, etc. Within each of those subdirectories I have 20 paired-end reads (so from 10 genomes).

I want to put all the files from the all the directories through a pipe and into an output directory (BWA). The structure of this directory is the same as described above. So I have 800 input files and 400 output files.

I have written a script (below):

echo "[info] creating filenames";

for filename in ./Mock_Run/seqtk_*/subsample_*/*_1.fq.gz; 
do file=`echo $filename|sed 's/_1.fq.gz//'`; 
filenopath=`basename $file`; 
for i in $(seq 10 $END); 
do echo subsample_$i; 
for u in $(seq 4 $END); 
do eval outpath=BWA/seqtk_$u/subsample_$i; 
echo "[info] starting BWA alignment..."; 
bwa mem -v 0 combine_reference.fa.gz ${filenopath}_1.fq.gz ${filenopath}_2.fq.gz > ${outpath}/${filenopath}_BWA.sam; 
echo "[info] converting sam file to bam file";
samtools view -bS ${outpath}/${filenopath}_BWA.sam > ${outpath}/${filenopath}_BWA.bam;
echo "[info]filtering unmapped reads...."; 
samtools view -h -f 4 ${outpath}/${filenopath}_BWA.bam > ${outpath}/${filenopath}_unmapped.bam; 
echo "[info] filtering mapped reads..."; 
samtools view -h -F 4 ${outpath}/${filenopath}_BWA.bam > ${outpath}/${filenopath}_mapped.bam; 
echo "[info] sorting files"; 
samtools sort -o ${outpath}/${filenopath}_mapped_sorted.bam ${outpath}/${filenopath}_mapped.bam ; 
samtools sort -o ${outpath}/${filenopath}_unmapped_sorted.bam ${outpath}/${filenopath}_unmapped.bam; 
echo "[info] finished...no error to report"; 
done; 
done; 
done

It loops through all the files (like I wanted) and puts them into the right output subdirectory (like I wanted). It all seems to work, except it continues to loop. Once it has gone through all the files, it then starts again.

Any help would be appreciated.

ADD COMMENTlink modified 2.1 years ago by h.mon27k • written 2.1 years ago by fiona.newberry80
1

have a look at e.g. nextflow it will pay off after some learning as soon as the workflow gets more complex

ADD REPLYlink written 2.1 years ago by Ido Tamir5.0k

Thank you, I will definitely look at it

ADD REPLYlink written 2.1 years ago by fiona.newberry80
1

Just a remark: There is (in my opinion) no advantage in storing the SAM files, as they are not binary and by this take a lot of space. As most downstream applications require sorted BAM files anyway, better pipe BWA directly into SAMtools sort:

bwa mem ref.fa 1.fq 2.fq | samtools sort -o out.bam
ADD REPLYlink written 2.1 years ago by ATpoint23k

I tried to do that but couldn't. I'll change my script. Thank you

ADD REPLYlink written 2.1 years ago by fiona.newberry80
3
gravatar for h.mon
2.1 years ago by
h.mon27k
Brazil
h.mon27k wrote:

You should parse the seqtk folder and subsample folder directly from $filename - see bellow for a suggestion. Your second and third for loops are already contained in the first:

First loop:

for filename in ./Mock_Run/seqtk_*/subsample_*/*_1.fq.gz;

Second loop:

for i in $(seq 10 $END); 
do echo subsample_$i;

You are already globbing subsample_* and seqtk_*, so there is no need for for i ... do subsample_$i and for u do seqtk_$u, is there?

Try something like (untested):

SEQTK=$(echo $filename | sed "s|\.\/Mock_Run\/\(seqtk.*\)\/subsample_.*|\1|")
echo $SEQTK

Same for subsample.

ADD COMMENTlink modified 2.0 years ago • written 2.1 years ago by h.mon27k

Thank you. Do you mind explaining what the SEQTK= command is doing?

ADD REPLYlink written 2.1 years ago by fiona.newberry80
1

$(command) is the same as `command` - both mean "interpret the command placed inside".

I am using sed to capture \(pattern\), which captures the pattern inside the backquoted parenthesis, and then replacing everything by just the captured pattern \1.

Anyway, here is a simpler solution:

for filename in ./Mock_Run/seqtk_*/subsample_*/*_1.fq.gz; 
do 
  filenopath=`basename -s _1.fq.gz $filename`
  outpath=`dirname $filename | sed "s/Mock_Run/BWA/"`
  echo  "$filenopath $outpath"
done
ADD REPLYlink modified 2.0 years ago • written 2.1 years ago by h.mon27k

Okay, thank you. That makes sense

ADD REPLYlink written 2.0 years ago by fiona.newberry80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 865 users visited in the last hour