Question

Interpreting pre- and post-trim_galore FastQC results

0

Entering edit mode

6.8 years ago

Anand Rao ▴ 630

I could use some help in understanding and interpreting FastQC results, when comparing pre Vs. post trim_galore input data: All data is from Illumina HiSeq4000 paired end reads, following which I performed adapter trimming and base quality-dependent trimming using "trim_galore" - which is a wrapper around 'FastQC' and 'cutadapt'. The syntax I used was:

trim_galore --fastqc --illumina --paired --retain_unpaired EthFoc-2.S282_L007.1.txt EthFoc-2.S282_L007.2.txt

I also have some general questions, listed below:

In FastQC results, when a tile is not blue, but orange or red etc, I am curious if tile-specific exclusion of Illumina reads from such tiles ever becomes necessary, and if yes, then what tools can perform such filtering / exclusion, if at all available.
My most pressing question - I am really surprised by the change in Per Base Sequence Content after trimming at ~ 150nt, where A content =0!!! Isn't this abnormal? Also, at positions 1-10nt, are these sequences worth trimming away, or can I / should I keep them? Why does this initial stretch not look similar to the rest with more or less equal A/T/G/C content?
The Per Sequence GC content is not discernibly different across the graphs in the composite image. For the fungal species being sequenced, overall GC content is commonly ~48-51%. I wonder if I should download Illumina files from NCBI SRA, for related fungal species, generated by other research groups, to check whether this deviation from the theoretical distribution is not uncommon. BTW, on basis of what genome reads is this theoretical curve plotted?
For Per Base N content, there is a minor bump at position 1. Does this mean that my trimming was not performed as well as it should have been?
For the Sequence Duplication Level graphs, I am not sure I understand the difference between the red and blue lines in the sub-panels. Interestingly the only bump is for repeats ~ > 10X, not sequences with fewer or more numbers of repeats. Could this be species specific? And I wonder if I should compare this to SRA reads for identical or similar species, sequenced by other research groups. Thoughts?
In terms of adapter content- this is what started it all, I saw FastQC return Illumina Universal Adapter content at multiple positions in the original reads, increasing all the way up to the read end. So I decided to run this trim_galore / cutadapt step. It seems totally normal that the adapter content would go away after this step. Correct?

THANK YOU!

fastqc illumina adapter trimming • 5.0k views

ADD COMMENT • link updated 6.8 years ago by GenoMax 141k • written 6.8 years ago by Anand Rao ▴ 630

score 2 · Answer 1 · 2017-07-23

2

Entering edit mode

6.8 years ago

GenoMax 141k

There is probably no need to worry about a few tiles showing up as color other than blue. If you are bothered by it then you could use filterbytile.sh from BBMap (Introducing FilterByTile: Remove Low-Quality Reads Without Adding Bias ).
Initial base pair bias is caused by the tagmentation enzyme/random primers. These reactions are supposed to be non-biased but they are not. This generally does not cause any problems with alignments.
Number 3 through 6 - Are normal observations. In #3 the red curve depicts amount of data that would remain if you were to dedupe the data.

FastQC has "limits" defined for the tests it does and they are based on normal genomic sequencing which causes "failure" (red X) if that is not what you are doing. These limits can be changed in a config file. I suggest that you move on to your actual analysis. If something does not work as expected, then back track.

ADD COMMENT • link 6.8 years ago by GenoMax 141k

0

Entering edit mode

Thank you genomax! Very useful replies indeed. For your answer #2 to my question #2, do you mean initial base pair bias ~ 1-10 alone? How about how nucleotide compositional changes ~ 140-150 before Vs. after the trim_galore step? I am trying to understand what scenario(s) can cause "A" content to go down to zero upon trimming! Could you please throw some light on it? THANKS.

ADD REPLY • link 6.8 years ago by Anand Rao ▴ 630

2

Entering edit mode

You may like to change the stringency of trim-galore. By default, trim galore removes even a single bp match to adapter seq ('AGATCGGAAGAGC'), which mean a terminal A will be always trimmed. See Fig 2 https://www.epigenesys.eu/images/stories/protocols/pdf/20120720103700_p57.pdf

ADD REPLY • link 6.8 years ago by Santosh Anand 5.7k

1

Entering edit mode

Sequencing bias in the first few bases are explained very clearly here https://sequencing.qcfail.com/articles/positional-sequence-bias-in-random-primed-libraries/

ADD REPLY • link 6.8 years ago by Santosh Anand 5.7k