BWA mem on multiple samples
2
1
Entering edit mode
9.1 years ago
ruhirai ▴ 20

I have multiple samples with R1 and R2 reads in fastq.gz format (these files are complementary to each other) I want to run BWA mem paired end parallel on all the files once finished each R1 and R2 complementary file should produce one sam file. Right now I am making two sam file from the two reads

This is what I have come up with but it’s not doing what I need it to do

for i in find -maxdepth 2 -iname *fastq.gz -type f; do echo "bwa mem -t 12 /H.Sapiens/ucsc.hg19.fasta ${i}_R1_001.fastq.gz ${i}_R2_001.fastq.gz > ${i}_R1_R2.sam"; done

when it runs it looks like this

bwa mem -t 12 /H.Sapiens/ucsc.hg19.fasta ./Sample_0747/0747_CGG_L001_R2_001.fastq.gz_R1_001.fastq.gz ./Sample_0747/0747_CGG_L001_R2_001.fastq.gz_R2_001.fastq.gz > ./Sample_0747/0747_CGG_L001_R2_001.fastq.gz_R1_R2.sam

bwa mem -t 12 H.Sapiens/ucsc.hg19.fasta ./Sample_0748/0748_CCA_L001_R1_001.fastq.gz_R1_001.fastq.gz ./Sample_0748/0748_CCA_L001_R1_001.fastq.gz_R2_001.fastq.gz > ./Sample_0748/0748_CCA_L001_R1_001.fastq.gz_R1_R2.sam -bash-4.1$ I understand the problem is in iname but how do I fixit? Thank you so much

alignment • 18k views
ADD COMMENT
7
Entering edit mode
9.1 years ago

I would still process each fastq

$ find test -name "*.fastq.gz"
test/fastq/sample_1_02_R1.fastq.gz
test/fastq/sample_2_01_R2.fastq.gz
test/fastq/sample_1_01_R2.fastq.gz
test/fastq/sample_1_02_R2.fastq.gz
test/fastq/sample_2_01_R1.fastq.gz
test/fastq/sample_1_01_R1.fastq.gz

process each pair in paralllel using a Makefile (option -j) or GNU parallel:

$ find ./  -name "*.fastq.gz" | grep -v _R2.fastq.gz | sed 's/_R1.fastq.gz//' | parallel bwa mem test/ref/ref.fa {}_R1.fastq.gz {}_R2.fastq.gz '>' '{}'.sam
$ find test -name "*.sam"
test/fastq/sample_1_01.sam
test/fastq/sample_1_02.sam
test/fastq/sample_2_01.sam

and then merge all sam using picard

$ java -jarpicard.jar MergeSamFiles \
  I=test/fastq/sample_1_01.sam I=test/fastq/sample_1_02.sam \
  SO=coordinate O=test/fastq/sample_1.bam AS=false (...) 

INFO 2015-03-13 23:52:31 MergeSamFiles Sorting input files using temp directory [/tmp/lindenb]
INFO 2015-03-13 23:52:32 MergeSamFiles Finished reading inputs.
[Fri Mar 13 23:52:32 CET 2015] picard.sam.MergeSamFiles done. Elapsed time: 0.01 minutes.
ADD COMMENT
0
Entering edit mode

Hey Dr. Lindenbaum,

I have the same issue as the original question. I tried your pipeline, and I now have my sam files for my samples. When I try to merge the sam files together using picard, I get an error Cannot add sequence that already exists in SAMSequenceDictionary:, The sam files I am trying to merge are 4 lanes of the same sample, and each lane is paired end. For example the following lanes have a R1 and R2 read.

Sample 1 - Lane_01
Sample 1 - Lane_02
Sample 1 - Lane_03
Sample 1 - Lane_04

I ran BWA mem as you suggested, and I got Lane_01 - Lane_04 sam files. Since they are of the same initial sample, they should contain the same/very similar sequences. Picard won't allow me to merge these files because they share the same sequences. How do you recommend I advance from this issue?

ADD REPLY
0
Entering edit mode

I have 145 SAM files. How could I avoid to type the SAM file names as input to MergeSamFiles?

ADD REPLY
1
Entering edit mode

Google "file globbing".

ADD REPLY
0
Entering edit mode

and picard MergeSamFiles can read a list of files

I=my.list.of.145.files.txt
ADD REPLY
2
Entering edit mode
9.1 years ago
for i in find -maxdepth 2 -iname *_R1_001.fastq.gz
do
    i=${i%%_R1_001.fastq.gz}
    ...
done

or something along those lines. Given that there are probably 001.fastq.gz, 002.fastq.gz, etc. files, it's probably best to simply process all of $i in a single go (e.g., with process substitution in bash and <(cat $i_R1_???.fastq.gz)).

ADD COMMENT

Login before adding your answer.

Traffic: 1431 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6