Question

BWA mem on multiple samples

1

Entering edit mode

10.3 years ago

ruhirai ▴ 20

I have multiple samples with R1 and R2 reads in fastq.gz format (these files are complementary to each other) I want to run BWA mem paired end parallel on all the files once finished each R1 and R2 complementary file should produce one sam file. Right now I am making two sam file from the two reads

This is what I have come up with but it’s not doing what I need it to do

for i in find -maxdepth 2 -iname *fastq.gz -type f; do echo "bwa mem -t 12 /H.Sapiens/ucsc.hg19.fasta ${i}_R1_001.fastq.gz ${i}_R2_001.fastq.gz > ${i}_R1_R2.sam"; done

when it runs it looks like this

bwa mem -t 12 /H.Sapiens/ucsc.hg19.fasta ./Sample_0747/0747_CGG_L001_R2_001.fastq.gz_R1_001.fastq.gz ./Sample_0747/0747_CGG_L001_R2_001.fastq.gz_R2_001.fastq.gz > ./Sample_0747/0747_CGG_L001_R2_001.fastq.gz_R1_R2.sam

bwa mem -t 12 H.Sapiens/ucsc.hg19.fasta ./Sample_0748/0748_CCA_L001_R1_001.fastq.gz_R1_001.fastq.gz ./Sample_0748/0748_CCA_L001_R1_001.fastq.gz_R2_001.fastq.gz > ./Sample_0748/0748_CCA_L001_R1_001.fastq.gz_R1_R2.sam -bash-4.1$ I understand the problem is in iname but how do I fixit? Thank you so much

alignment • 20k views

ADD COMMENT • link updated 10.3 years ago by Pierre Lindenbaum 166k • written 10.3 years ago by ruhirai ▴ 20

Ram · Answer 1 · 2015-03-13

7

Entering edit mode

10.3 years ago

Pierre Lindenbaum 166k

I would still process each fastq

$ find test -name "*.fastq.gz"
test/fastq/sample_1_02_R1.fastq.gz
test/fastq/sample_2_01_R2.fastq.gz
test/fastq/sample_1_01_R2.fastq.gz
test/fastq/sample_1_02_R2.fastq.gz
test/fastq/sample_2_01_R1.fastq.gz
test/fastq/sample_1_01_R1.fastq.gz

process each pair in paralllel using a Makefile (option -j) or GNU parallel:

$ find ./  -name "*.fastq.gz" | grep -v _R2.fastq.gz | sed 's/_R1.fastq.gz//' | parallel bwa mem test/ref/ref.fa {}_R1.fastq.gz {}_R2.fastq.gz '>' '{}'.sam

$ find test -name "*.sam"
test/fastq/sample_1_01.sam
test/fastq/sample_1_02.sam
test/fastq/sample_2_01.sam

and then merge all sam using picard

$ java -jarpicard.jar MergeSamFiles \
  I=test/fastq/sample_1_01.sam I=test/fastq/sample_1_02.sam \
  SO=coordinate O=test/fastq/sample_1.bam AS=false (...) 

INFO 2015-03-13 23:52:31 MergeSamFiles Sorting input files using temp directory [/tmp/lindenb]
INFO 2015-03-13 23:52:32 MergeSamFiles Finished reading inputs.
[Fri Mar 13 23:52:32 CET 2015] picard.sam.MergeSamFiles done. Elapsed time: 0.01 minutes.

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 10.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Hey Dr. Lindenbaum,

I have the same issue as the original question. I tried your pipeline, and I now have my sam files for my samples. When I try to merge the sam files together using picard, I get an error Cannot add sequence that already exists in SAMSequenceDictionary:, The sam files I am trying to merge are 4 lanes of the same sample, and each lane is paired end. For example the following lanes have a R1 and R2 read.

Sample 1 - Lane_01
Sample 1 - Lane_02
Sample 1 - Lane_03
Sample 1 - Lane_04

I ran BWA mem as you suggested, and I got Lane_01 - Lane_04 sam files. Since they are of the same initial sample, they should contain the same/very similar sequences. Picard won't allow me to merge these files because they share the same sequences. How do you recommend I advance from this issue?

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 10.0 years ago by satshil.r ▴ 50

0

Entering edit mode

I have 145 SAM files. How could I avoid to type the SAM file names as input to MergeSamFiles?

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.8 years ago by Ric ▴ 440

1

Entering edit mode

Google "file globbing".

ADD REPLY • link 9.8 years ago by Devon Ryan 105k

0

Entering edit mode

and picard MergeSamFiles can read a list of files

I=my.list.of.145.files.txt

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.8 years ago by Pierre Lindenbaum 166k

score 2 · Answer 2 · 2015-03-13

for i in find -maxdepth 2 -iname *_R1_001.fastq.gz
do
    i=${i%%_R1_001.fastq.gz}
    ...
done

or something along those lines. Given that there are probably 001.fastq.gz, 002.fastq.gz, etc. files, it's probably best to simply process all of $i in a single go (e.g., with process substitution in bash and <(cat $i_R1_???.fastq.gz)).