Entering edit mode
3.7 years ago
storm1907
▴
30
Hello.
I'm encountering the problem with viral RNA ngs data alignement. I wrote two kinds of scripts; first:
INPATH=/home/dir/usr
OUTPATH=/home/dir/usr
reference=/dir/ref.fasta
cd /mnt/home/usr/tools/bwa-0.7.17
./bwa index $reference
for file1 in $INPATH/*.fastq.gz;
do
bname=$(basename $file1 )
sample=$( cut -d'_' -f 1,2,3<<<"$bname")
echo $bname $sample
input1=$INPATH/$sample"_R1_001.unpaired.fastq.gz"
output1=$OUTPATH/$sample"_R1_001.unpaired.sai"
input2=$INPATH/$sample"_R2_001.unpaired.fastq.gz"
output2=$OUTPATH/$sample"_R2_001.unpaired.sai"
outfile=$OUTPATH/$sample"_pe.sam"
echo $input1 $output1 $input2 $output2
./bwa mem -t 4 $reference $input1 | gzip -3 > $output1
./bwa mem -t 4 $reference $input2 | gzip -3 > $output2
./bwa sampe -a $reference $output1 $output2 $input1 $input2 > $outfile
echo $outfile
done
gives output like:
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 1034 sequences (71414 bp)...
[M::mem_process_seqs] Processed 1034 reads in 0.058 CPU sec, 0.017 real sec
[gzclose] buffer error
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[gzclose] buffer error
Usage: bwa sampe [options] <prefix> <in1.sai> <in2.sai> <in1.fq> <in2.fq>
Options: -a INT maximum insert size [0]
-o INT maximum occurrences for one end [100000]
-n INT maximum hits to output for paired reads [3]
-N INT maximum hits to output for discordant pairs [10]
-c FLOAT prior of chimeric rate (lower bound) [1.0e-05]
-f FILE sam file to output results to [stdout]
-r STR read group header line such as `@RG\tID:foo\tSM:bar' [null]
-P preload index into memory (for base-space reads only)
-s disable Smith-Waterman for the unmapped mate
-A disable insert size estimate (force -s)
Notes: 1. For SOLiD reads, <in1.fq> corresponds R3 reads and <in2.fq> to F3.
2. For reads shorter than 30bp, applying a smaller -o is recommended to
to get a sensible speed at the cost of pairing accuracy.
But second:
INPATH=/home/dir/usr
OUTPATH=/home/dir/usr
reference=/dir/ref.fasta
cd /mnt/home/usr/tools/bwa-0.7.17
./bwa index $reference
for file1 in $INPATH/*.fastq.gz;
do
bname=$(basename $file1 )
sample=$( cut -d'_' -f 1,2,3<<<"$bname")
echo $bname $sample
input1=$INPATH/$sample"_R1_001.unpaired.fastq.gz"
output1=$OUTPATH/$sample"_R1_001.unpaired.sai"
input2=$INPATH/$sample"_R2_001.unpaired.fastq.gz"
output2=$OUTPATH/$sample"_R2_001.unpaired.sai"
outfile=$OUTPATH/$sample"_pe.sam"
echo $input1 $output1 $input2 $output2
./bwa mem $reference $input1 $input2 | gzip -3 > $outfile
echo $outfile
done
has output like
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[W::bseq_read] the 2nd file has fewer sequences.
[M::process] read 4726 sequences (334488 bp)...
[W::bseq_read] the 2nd file has fewer sequences.
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 654, 734, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (2337,4712, 7183)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 16875)
[M::mem_pestat] mean and std.dev: (4752.09,2805.37)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 21721)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75) percentile: (2388,4788, 7274) [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 17046) [M::mem_pestat] mean and std.dev: (4832.84, 2813.06)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 21932) [M::mem_pestat] skip orientation RR as there are not enough pairs [mem_sam_pe] paired reads have different names:
"M06108:35:000000000-J52R8:1:1101:9702:1150",
"M06108:35:000000000-J52R8:1:1101:19124:1560"
Is that somehow related with low quality of reads/wrong trimmommatic tool, or I just have to recheck the spelling of bname and sample variables?
Have you tried to use
echo
statements in multiple locations in this script to examine the command lines that are being generated? Did they all look ok?The first issue: appears to be because of trying to use
gzip
via a pipe. Why do you need to do that? You can directly pipe output ofbwa mem
to samtools: https://biology.stackexchange.com/questions/59493/how-to-convert-bwa-mem-output-to-bam-format-without-saving-sam-fileFor the second: It looks like your paired-end data files are out of sync. They don't have an equal number of reads. Did you scan/trim the data independently? PE data files need to be processed together.
The first script is not correct, discard it.
mem
does not work withsampe
, the second script is what you need. Though, it saysthe 2nd file has fewer sequences.
so make sure all fastq files of one sample have the identical number of reads. How did you trim, did you use trimmomatic in paired-end mode?I used PE trimmomatics, yes
Your post formatting was messed up because you used block-quote instead of code-blocks. I've cleaned it up, but please be more careful in the future. The button you need for code formatting is the one highlighted in the image below: