BWA -> samtools pipe while filtering mapped & unmapped reads
1
2
Entering edit mode
6.6 years ago

I am using BWA to align files. I have four directories: seqtk_1, seqtk_2, seqtk_3, seqtk_4. Within each of those directories I have 10 subdirectories: subsample_1, subsample_2, subsample_3, etc. Within each of those subdirectories I have 20 paired-end reads (so from 10 genomes).

I want to put all the files from the all the directories through a pipe and into an output directory (BWA). The structure of this directory is the same as described above. So I have 800 input files and 400 output files.

I have written a script (below):

echo "[info] creating filenames";

for filename in ./Mock_Run/seqtk_*/subsample_*/*_1.fq.gz; 
do file=`echo $filename|sed 's/_1.fq.gz//'`; 
filenopath=`basename $file`; 
for i in $(seq 10 $END); 
do echo subsample_$i; 
for u in $(seq 4 $END); 
do eval outpath=BWA/seqtk_$u/subsample_$i; 
echo "[info] starting BWA alignment..."; 
bwa mem -v 0 combine_reference.fa.gz ${filenopath}_1.fq.gz ${filenopath}_2.fq.gz > ${outpath}/${filenopath}_BWA.sam; 
echo "[info] converting sam file to bam file";
samtools view -bS ${outpath}/${filenopath}_BWA.sam > ${outpath}/${filenopath}_BWA.bam;
echo "[info]filtering unmapped reads...."; 
samtools view -h -f 4 ${outpath}/${filenopath}_BWA.bam > ${outpath}/${filenopath}_unmapped.bam; 
echo "[info] filtering mapped reads..."; 
samtools view -h -F 4 ${outpath}/${filenopath}_BWA.bam > ${outpath}/${filenopath}_mapped.bam; 
echo "[info] sorting files"; 
samtools sort -o ${outpath}/${filenopath}_mapped_sorted.bam ${outpath}/${filenopath}_mapped.bam ; 
samtools sort -o ${outpath}/${filenopath}_unmapped_sorted.bam ${outpath}/${filenopath}_unmapped.bam; 
echo "[info] finished...no error to report"; 
done; 
done; 
done

It loops through all the files (like I wanted) and puts them into the right output subdirectory (like I wanted). It all seems to work, except it continues to loop. Once it has gone through all the files, it then starts again.

Any help would be appreciated.

BWA samtools mapped files unmapped files • 3.2k views
ADD COMMENT
1
Entering edit mode

have a look at e.g. nextflow it will pay off after some learning as soon as the workflow gets more complex

ADD REPLY
0
Entering edit mode

Thank you, I will definitely look at it

ADD REPLY
1
Entering edit mode

Just a remark: There is (in my opinion) no advantage in storing the SAM files, as they are not binary and by this take a lot of space. As most downstream applications require sorted BAM files anyway, better pipe BWA directly into SAMtools sort:

bwa mem ref.fa 1.fq 2.fq | samtools sort -o out.bam
ADD REPLY
0
Entering edit mode

I tried to do that but couldn't. I'll change my script. Thank you

ADD REPLY
3
Entering edit mode
6.6 years ago
h.mon 35k

You should parse the seqtk folder and subsample folder directly from $filename - see bellow for a suggestion. Your second and third for loops are already contained in the first:

First loop:

for filename in ./Mock_Run/seqtk_*/subsample_*/*_1.fq.gz;

Second loop:

for i in $(seq 10 $END); 
do echo subsample_$i;

You are already globbing subsample_* and seqtk_*, so there is no need for for i ... do subsample_$i and for u do seqtk_$u, is there?

Try something like (untested):

SEQTK=$(echo $filename | sed "s|\.\/Mock_Run\/\(seqtk.*\)\/subsample_.*|\1|")
echo $SEQTK

Same for subsample.

ADD COMMENT
0
Entering edit mode

Thank you. Do you mind explaining what the SEQTK= command is doing?

ADD REPLY
1
Entering edit mode

$(command) is the same as `command` - both mean "interpret the command placed inside".

I am using sed to capture \(pattern\), which captures the pattern inside the backquoted parenthesis, and then replacing everything by just the captured pattern \1.

Anyway, here is a simpler solution:

for filename in ./Mock_Run/seqtk_*/subsample_*/*_1.fq.gz; 
do 
  filenopath=`basename -s _1.fq.gz $filename`
  outpath=`dirname $filename | sed "s/Mock_Run/BWA/"`
  echo  "$filenopath $outpath"
done
ADD REPLY
0
Entering edit mode

Okay, thank you. That makes sense

ADD REPLY

Login before adding your answer.

Traffic: 2574 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6