Hi all,
Can somebody please help me in understanding why I am getting nearly 10 times higher depth of coverage using in-house script as compared to SNIPPY being used at Galaxy server? In brief, I have used the same fastq file for aligning on reference sequence using SNIPPY tool at Galaxy server. The average amplicon coverage is ~400 in this case. However, while I am using my in-house script for aligning, the same is giving me ~4000 average depth of coverage for the same amplicons (along with horn like projections on either side of coverage graph).
Can somebody please help me optmise the setting/stringency in the in-house script so that I can get depth of coverage similar to SNIPPY ?
The in-house code is as follows:
# Step 1: Quality Trimming and Adapter Removal with fastp
fastp \
--in1 M7_M_nextseq_151bp_R1.fastq.gz --in2 M7_M_nextseq_151bp_R2.fastq.gz \
--out1 trimmed_R1.fastq.gz --out2 trimmed_R2.fastq.gz \
--adapter_sequence AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
--adapter_sequence_r2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
--qualified_quality_phred 30 \
--unqualified_percent_limit 10 \
--n_base_limit 5 \
--length_required 50 \
--cut_front --cut_tail --cut_right \
--cut_window_size 4 \
--cut_mean_quality 30 \
--detect_adapter_for_pe \
--correction \
--overlap_len_require 30 \
--thread 8
# Step 2: Stringent Alignment with bwa mem
bwa mem \
-t 8 \
-M \
-T 60 \
-B 6 \
-O 12,12 \
-E 6,6 \
-L 40 \
reference.fasta \
trimmed_R1.fastq.gz trimmed_R2.fastq.gz \
> aligned_reads.sam
# Step 3: Post-Alignment Processing
samtools view -bS aligned_reads.sam | samtools sort -o sorted_reads.bam
samtools index sorted_reads.bam
picard MarkDuplicates \
I=sorted_reads.bam \
O=dedup_reads.bam \
M=dup_metrics.txt \
REMOVE_DUPLICATES=true \
ASSUME_SORTED=true \
VALIDATION_STRINGENCY=STRICT
samtools index dedup_reads.bam
samtools view -F 2308 -q 40 -b dedup_reads.bam > filtered_reads.bam
samtools index filtered_reads.bam
# Step 4: Calculate Per-Base Coverage
samtools depth -a filtered_reads.bam > per_base_coverage.txt
awk '{sum+=$3} END { print "Average = ",sum/NR}' per_base_coverage.txt
If the steps/program(s)/options in your in-house script are not identical to one used at Galaxy there is no logical way to compare the results. Since you have control over your in-house processing you should have complete trust/control in what you are doing there than galaxy.
Thanks for your suggestion and help. You are right in saying that having all the control is better. I am worried because I am new to amplicon based sequencing and I am not sure whether SNIPPY or in-house script is giving correct results. You can see the IGV screenshot, the results using in house script is having irregular horn like projections and very high depth of coverage. While same analysis using SNIPPY is presented well with smooth graph and filtered data. I understand both are different in there filtration strategies; I am just not sure if I proceed with my in-house script, I am doing wrong reporting with bad quality reads.
Thank you again for your help...
You may want to put some time in to understand what is happening in the two processes you are running. Bioinformatics is not just about running command lines and have them produce results. It is possible that both results are correct. They are perhaps using different thresholds to report the results.