error with BWA
0
0
Entering edit mode
3.7 years ago
storm1907 ▴ 30

Hello.

I'm encountering the problem with viral RNA ngs data alignement. I wrote two kinds of scripts; first:

INPATH=/home/dir/usr
OUTPATH=/home/dir/usr
reference=/dir/ref.fasta
cd /mnt/home/usr/tools/bwa-0.7.17
./bwa index $reference
for file1 in $INPATH/*.fastq.gz;
do
        bname=$(basename $file1 )
        sample=$( cut -d'_' -f 1,2,3<<<"$bname")
        echo $bname $sample
        input1=$INPATH/$sample"_R1_001.unpaired.fastq.gz"
        output1=$OUTPATH/$sample"_R1_001.unpaired.sai"
        input2=$INPATH/$sample"_R2_001.unpaired.fastq.gz"
        output2=$OUTPATH/$sample"_R2_001.unpaired.sai"
        outfile=$OUTPATH/$sample"_pe.sam"
        echo $input1 $output1 $input2 $output2
        ./bwa mem -t 4 $reference $input1 | gzip -3 > $output1
        ./bwa mem -t 4 $reference $input2 | gzip -3 > $output2
        ./bwa sampe -a $reference  $output1 $output2 $input1 $input2 > $outfile
        echo $outfile
done

gives output like:

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 1034 sequences (71414 bp)...
[M::mem_process_seqs] Processed 1034 reads in 0.058 CPU sec, 0.017 real sec
[gzclose] buffer error
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[gzclose] buffer error

Usage:   bwa sampe [options] <prefix> <in1.sai> <in2.sai> <in1.fq> <in2.fq>

Options: -a INT   maximum insert size [0]
         -o INT   maximum occurrences for one end [100000]
         -n INT   maximum hits to output for paired reads [3]
         -N INT   maximum hits to output for discordant pairs [10]
         -c FLOAT prior of chimeric rate (lower bound) [1.0e-05]
         -f FILE  sam file to output results to [stdout]
         -r STR   read group header line such as `@RG\tID:foo\tSM:bar' [null]
         -P       preload index into memory (for base-space reads only)
         -s       disable Smith-Waterman for the unmapped mate
         -A       disable insert size estimate (force -s)

Notes: 1. For SOLiD reads, <in1.fq> corresponds R3 reads and <in2.fq> to F3.
       2. For reads shorter than 30bp, applying a smaller -o is recommended to
          to get a sensible speed at the cost of pairing accuracy.

But second:

INPATH=/home/dir/usr
    OUTPATH=/home/dir/usr
    reference=/dir/ref.fasta
cd /mnt/home/usr/tools/bwa-0.7.17
./bwa index $reference
for file1 in $INPATH/*.fastq.gz;
do
        bname=$(basename $file1 )
        sample=$( cut -d'_' -f 1,2,3<<<"$bname")
        echo $bname $sample
        input1=$INPATH/$sample"_R1_001.unpaired.fastq.gz"
        output1=$OUTPATH/$sample"_R1_001.unpaired.sai"
        input2=$INPATH/$sample"_R2_001.unpaired.fastq.gz"
        output2=$OUTPATH/$sample"_R2_001.unpaired.sai"
        outfile=$OUTPATH/$sample"_pe.sam"
        echo $input1 $output1 $input2 $output2
        ./bwa mem $reference $input1 $input2 | gzip -3 > $outfile
        echo $outfile
done

has output like

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[W::bseq_read] the 2nd file has fewer sequences. 
[M::process] read 4726 sequences (334488 bp)... 
[W::bseq_read] the 2nd file has fewer sequences.
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 654, 734, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR... 
[M::mem_pestat] (25, 50, 75) percentile: (2337,4712, 7183) 
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 16875)
[M::mem_pestat] mean and std.dev: (4752.09,2805.37)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 21721)
[M::mem_pestat] analyzing insert size distribution for orientation RF... 
[M::mem_pestat] (25, 50, 75) percentile: (2388,4788, 7274) [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 17046) [M::mem_pestat] mean and std.dev: (4832.84, 2813.06)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 21932) [M::mem_pestat] skip orientation RR as there are not enough pairs [mem_sam_pe] paired reads have different names:
"M06108:35:000000000-J52R8:1:1101:9702:1150",
"M06108:35:000000000-J52R8:1:1101:19124:1560"

Is that somehow related with low quality of reads/wrong trimmommatic tool, or I just have to recheck the spelling of bname and sample variables?

rna-seq • 2.4k views
ADD COMMENT
0
Entering edit mode

Have you tried to use echo statements in multiple locations in this script to examine the command lines that are being generated? Did they all look ok?

The first issue: appears to be because of trying to use gzip via a pipe. Why do you need to do that? You can directly pipe output of bwa mem to samtools: https://biology.stackexchange.com/questions/59493/how-to-convert-bwa-mem-output-to-bam-format-without-saving-sam-file

For the second: It looks like your paired-end data files are out of sync. They don't have an equal number of reads. Did you scan/trim the data independently? PE data files need to be processed together.

ADD REPLY
0
Entering edit mode

The first script is not correct, discard it. mem does not work with sampe, the second script is what you need. Though, it says the 2nd file has fewer sequences. so make sure all fastq files of one sample have the identical number of reads. How did you trim, did you use trimmomatic in paired-end mode?

ADD REPLY
0
Entering edit mode

I used PE trimmomatics, yes

ADD REPLY
0
Entering edit mode

Your post formatting was messed up because you used block-quote instead of code-blocks. I've cleaned it up, but please be more careful in the future. The button you need for code formatting is the one highlighted in the image below:

code_formatting

ADD REPLY

Login before adding your answer.

Traffic: 2743 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6