Question

Effect of trimming on alignment and fastqc report.

1

Entering edit mode

5.1 years ago

kspata ▴ 80

Hi All,

I have MiSeq PE250 reads for a viral vector sample. I trimmed the raw reads with Trim Galore (length >= 200 and Q >= 30). I aligned these reads using both BWA MEM and Bowtie2 --local and found a AAA -> TTT variant in both alignment files at different frequencies 9% and 71% respectively. I further analysed reads containing variant and found that the reads with variant had base quality call score less than 30 for TTT bases.

To check the effect of trimming on variant call and coverage, I further trimmed the trim galore trimmed reads to remove all reads which had even a single base call quality less than 30. I repaired these reads using bbmap repair.sh and aligned again using bowtie2. For this alignment the above variant was found at <10% variant frequency. This was consistent with variant frequency reported with BWA-MEM. The coverage was also affected with almost 6.7% of the bases with coverage less than 10X for the super trimmed reads vs 0.1% for trim galore trimmed only reads.

I used freebayes for variant calling with same parameters (Min Coverage 10, Min Alternate Fraction 0.01, Min Alternate Count 4) for both BWA and Bowtie2 aligned reads.

Why was there such a large difference for variant frequency for Bowtie2 and BWA MEM?
The coverage difference between trim galore trimmed reads and super trimmed reads is vast. Should I still use super trimmed reads (Trim Galore + Further trimming) as it gives correct variant frequency.

The fastqc result for the trim galore trimmed reads and super trimmed reads were as follows:

Trim Galore Trimmed Reads

Super Trimmed Reads

The fastqc per base faultily fails for the super trimmed reads indicating a wrong Illumian Phred Score encoding version. Why did this happen?

Thanks!

alignment next gen Fastqc trim galore • 2.7k views

ADD COMMENT • link updated 5.1 years ago by Friederike 8.9k • written 5.1 years ago by kspata ▴ 80

1

Entering edit mode

That is a strange result. It looks like you took out all of the good data and left bad data behind?? Trimming at Q30 or more is pretty severe. Can you not filter your SNP's afterwards if you want to be stringent?

ADD REPLY • link 5.1 years ago by GenoMax 141k

0

Entering edit mode

Hi Genomax,

I checked again to confirm only reads with bases having Quality greater than 30 were selected. So I am confused about the fastqc results as well. I understand trimming is severe but less stringent trimming is leading to a false variant call at much higher frequency.

Can you please refer any peer reviewed papers which perform NGS analysis for viral vector samples especially SNP /INDEL analysis? This will help me try to streamline the analysis process.

ADD REPLY • link 5.1 years ago by kspata ▴ 80

1

Entering edit mode

You probably filtered out so many reads with the super trimmed such that fastqc could not accurately detect the encoding. I am not even sure why you are trimming your reads twice - you are likely throwing away too much information.

You should trim once only and it is fine to set cut-offs like:

reads > 70bp
BQs at read ends > 20

NB - BWA mem is designed for reads >70bp

ADD REPLY • link 5.1 years ago by Kevin Blighe 87k

0

Entering edit mode

what's the command you used for trimming?

ADD REPLY • link 5.1 years ago by Friederike 8.9k

0

Entering edit mode

I first performed trimming with trim galore

trim_galore -q 30 --length 200 -a AGATCGGAAGAGC --no_report_file --paired Sample.R1.fastq.gz Sample.R2.fastq.gz 2> Sample.trim_galore.err

Then I trimmed these reads using the following command for trimmed.R1.fq and trimmed.R2.fq

awk '{unit=unit $0 ORS} NR%4==0{if (/^[?@ABCDEFGHIJK]+$/) printf "%s", unit; unit=""}' trimmed.R1.fq > super_trimmed.R1.fq

 awk '{unit=unit $0 ORS} NR%4==0{if (/^[?@ABCDEFGHIJK]+$/) printf "%s", unit; unit=""}' trimmed.R2.fq > super_trimmed.R2.fq

https://stackoverflow.com/questions/55459982/remove-reads-that-have-low-quality-base-calls/55461995#55461995

I used repair.sh from BBmap to repair the reads.

  repair.sh in1=super_trimmed.R1.fq in2=super_trimmed.R2.fq out1=fixed.R1.fq out2=fixed.R2.fq outs=singletons.fq repair

With this command 1/4th of the reads were removed from trim galore trimmed reads.

ADD REPLY • link 5.1 years ago by kspata ▴ 80

0

Entering edit mode

I am not sure why you don't do all of this in BBMap suite? You can trim/align and keep all reads properly paired without leaving the suite. Filter you SNP's afterwards at level you wish. Only outline command provided below. Look at bbduk and bbmap guides for additional assistance.

bbduk.sh -Xmx4g in1=R1.fq.gz in2=R2.fq.gz out1=trim_R1.fq.gz out2=trim_R2.fq.gz literal=AGATCGGAAGAGC minq=30 
bbmap.sh in1=trim_R1.fq.gz in2=trim_R2.fq.gz out=aligned.bam maxindel=0 (if you don't want to allow splicing) ambig=random

ADD REPLY • link 5.1 years ago by GenoMax 141k

1

Entering edit mode

5.1 years ago

Devon Ryan 104k

Why was there such a large difference for variant frequency for Bowtie2 and BWA MEM?

If you're going to use local alignment with bowtie2 you'll want --very-sensitive-local to get decent results. The difference in defaults is the primary cause of the difference in outputs.

As others have said, your "super trimming" seems to have gone wrong (it's also quite over the top).

ADD COMMENT • link 5.1 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you!! Changing the parameter to --very-sensitive-local did not alter the variant call result. I also tried end_to_end but that also gave the variant at same frequency. I appreciate the help. I am beginning to understand how the alignment algorithms work more.

ADD REPLY • link 5.1 years ago by kspata ▴ 80

score 3 · Accepted Answer · 2019-04-01

3

Entering edit mode

5.1 years ago

Friederike 8.9k

The fastqc per base faultily fails for the super trimmed reads indicating a wrong Illumian Phred Score encoding version. Why did this happen?

Because with this line:

awk '{unit=unit $0 ORS} NR%4==0{if (/^[?@ABCDEFGHIJK]+$/) printf "%s", unit; unit=""}' trimmed.R1.fq > super_trimmed.R1.fq

you eliminated any reads containing any bases with a score below ?, which leads FastQC to the assumption that you're looking at files where the Phred scores have been encoded the Illumina1.3-way