Question: Effect of trimming on alignment and fastqc report.
1
gravatar for kspata
14 months ago by
kspata70
Chicago
kspata70 wrote:

Hi All,

I have MiSeq PE250 reads for a viral vector sample. I trimmed the raw reads with Trim Galore (length >= 200 and Q >= 30). I aligned these reads using both BWA MEM and Bowtie2 --local and found a AAA -> TTT variant in both alignment files at different frequencies 9% and 71% respectively. I further analysed reads containing variant and found that the reads with variant had base quality call score less than 30 for TTT bases.

To check the effect of trimming on variant call and coverage, I further trimmed the trim galore trimmed reads to remove all reads which had even a single base call quality less than 30. I repaired these reads using bbmap repair.sh and aligned again using bowtie2. For this alignment the above variant was found at <10% variant frequency. This was consistent with variant frequency reported with BWA-MEM. The coverage was also affected with almost 6.7% of the bases with coverage less than 10X for the super trimmed reads vs 0.1% for trim galore trimmed only reads.

I used freebayes for variant calling with same parameters (Min Coverage 10, Min Alternate Fraction 0.01, Min Alternate Count 4) for both BWA and Bowtie2 aligned reads.

  1. Why was there such a large difference for variant frequency for Bowtie2 and BWA MEM?
  2. The coverage difference between trim galore trimmed reads and super trimmed reads is vast. Should I still use super trimmed reads (Trim Galore + Further trimming) as it gives correct variant frequency.

The fastqc result for the trim galore trimmed reads and super trimmed reads were as follows:

Trim Galore Trimmed Reads Commercial Photography

Super Trimmed Reads

Commercial Photography

The fastqc per base faultily fails for the super trimmed reads indicating a wrong Illumian Phred Score encoding version. Why did this happen?

Thanks!

ADD COMMENTlink modified 14 months ago by Friederike5.7k • written 14 months ago by kspata70
1

That is a strange result. It looks like you took out all of the good data and left bad data behind?? Trimming at Q30 or more is pretty severe. Can you not filter your SNP's afterwards if you want to be stringent?

ADD REPLYlink written 14 months ago by genomax83k

Hi Genomax,

I checked again to confirm only reads with bases having Quality greater than 30 were selected. So I am confused about the fastqc results as well. I understand trimming is severe but less stringent trimming is leading to a false variant call at much higher frequency.

Can you please refer any peer reviewed papers which perform NGS analysis for viral vector samples especially SNP /INDEL analysis? This will help me try to streamline the analysis process.

ADD REPLYlink written 14 months ago by kspata70
1

You probably filtered out so many reads with the super trimmed such that fastqc could not accurately detect the encoding. I am not even sure why you are trimming your reads twice - you are likely throwing away too much information.

You should trim once only and it is fine to set cut-offs like:

  • reads > 70bp
  • BQs at read ends > 20

NB - BWA mem is designed for reads >70bp

ADD REPLYlink modified 14 months ago • written 14 months ago by Kevin Blighe59k

what's the command you used for trimming?

ADD REPLYlink written 14 months ago by Friederike5.7k

I first performed trimming with trim galore

trim_galore -q 30 --length 200 -a AGATCGGAAGAGC --no_report_file --paired Sample.R1.fastq.gz Sample.R2.fastq.gz 2> Sample.trim_galore.err

Then I trimmed these reads using the following command for trimmed.R1.fq and trimmed.R2.fq

awk '{unit=unit $0 ORS} NR%4==0{if (/^[?@ABCDEFGHIJK]+$/) printf "%s", unit; unit=""}' trimmed.R1.fq > super_trimmed.R1.fq

 awk '{unit=unit $0 ORS} NR%4==0{if (/^[?@ABCDEFGHIJK]+$/) printf "%s", unit; unit=""}' trimmed.R2.fq > super_trimmed.R2.fq

https://stackoverflow.com/questions/55459982/remove-reads-that-have-low-quality-base-calls/55461995#55461995

I used repair.sh from BBmap to repair the reads.

  repair.sh in1=super_trimmed.R1.fq in2=super_trimmed.R2.fq out1=fixed.R1.fq out2=fixed.R2.fq outs=singletons.fq repair

With this command 1/4th of the reads were removed from trim galore trimmed reads.

ADD REPLYlink written 14 months ago by kspata70

I am not sure why you don't do all of this in BBMap suite? You can trim/align and keep all reads properly paired without leaving the suite. Filter you SNP's afterwards at level you wish. Only outline command provided below. Look at bbduk and bbmap guides for additional assistance.

bbduk.sh -Xmx4g in1=R1.fq.gz in2=R2.fq.gz out1=trim_R1.fq.gz out2=trim_R2.fq.gz literal=AGATCGGAAGAGC minq=30 
bbmap.sh in1=trim_R1.fq.gz in2=trim_R2.fq.gz out=aligned.bam maxindel=0 (if you don't want to allow splicing) ambig=random
ADD REPLYlink modified 14 months ago • written 14 months ago by genomax83k
3
gravatar for Friederike
14 months ago by
Friederike5.7k
United States
Friederike5.7k wrote:

The fastqc per base faultily fails for the super trimmed reads indicating a wrong Illumian Phred Score encoding version. Why did this happen?

Because with this line:

awk '{unit=unit $0 ORS} NR%4==0{if (/^[?@ABCDEFGHIJK]+$/) printf "%s", unit; unit=""}' trimmed.R1.fq > super_trimmed.R1.fq

you eliminated any reads containing any bases with a score below ?, which leads FastQC to the assumption that you're looking at files where the Phred scores have been encoded the Illumina1.3-way

ADD COMMENTlink written 14 months ago by Friederike5.7k

Thank you for explanation. I have learned a lot through this post !!!

ADD REPLYlink written 14 months ago by kspata70
1
gravatar for Devon Ryan
14 months ago by
Devon Ryan95k
Freiburg, Germany
Devon Ryan95k wrote:

Why was there such a large difference for variant frequency for Bowtie2 and BWA MEM?

If you're going to use local alignment with bowtie2 you'll want --very-sensitive-local to get decent results. The difference in defaults is the primary cause of the difference in outputs.

As others have said, your "super trimming" seems to have gone wrong (it's also quite over the top).

ADD COMMENTlink written 14 months ago by Devon Ryan95k

Thank you!! Changing the parameter to --very-sensitive-local did not alter the variant call result. I also tried end_to_end but that also gave the variant at same frequency. I appreciate the help. I am beginning to understand how the alignment algorithms work more.

ADD REPLYlink modified 14 months ago • written 14 months ago by kspata70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1486 users visited in the last hour