Question: Interpreting pre- and post-trim_galore FastQC results
0
gravatar for Anand Rao
12 weeks ago by
Anand Rao100
United States
Anand Rao100 wrote:

I could use some help in understanding and interpreting FastQC results, when comparing pre Vs. post trim_galore input data: All data is from Illumina HiSeq4000 paired end reads, following which I performed adapter trimming and base quality-dependent trimming using "trim_galore" - which is a wrapper around 'FastQC' and 'cutadapt'. The syntax I used was:

trim_galore --fastqc --illumina --paired --retain_unpaired EthFoc-2.S282_L007.1.txt EthFoc-2.S282_L007.2.txt

I also have some general questions, listed below:

  1. In FastQC results, when a tile is not blue, but orange or red etc, I am curious if tile-specific exclusion of Illumina reads from such tiles ever becomes necessary, and if yes, then what tools can perform such filtering / exclusion, if at all available. Per Tile Sequence Quality_Composite

  2. My most pressing question - I am really surprised by the change in Per Base Sequence Content after trimming at ~ 150nt, where A content =0!!! Isn't this abnormal? Also, at positions 1-10nt, are these sequences worth trimming away, or can I / should I keep them? Why does this initial stretch not look similar to the rest with more or less equal A/T/G/C content? Per Base Sequence Content_Composite

  3. The Per Sequence GC content is not discernibly different across the graphs in the composite image. For the fungal species being sequenced, overall GC content is commonly ~48-51%. I wonder if I should download Illumina files from NCBI SRA, for related fungal species, generated by other research groups, to check whether this deviation from the theoretical distribution is not uncommon. BTW, on basis of what genome reads is this theoretical curve plotted? Per Sequence GC Content_Composite

  4. For Per Base N content, there is a minor bump at position 1. Does this mean that my trimming was not performed as well as it should have been? Per Base N Content_Composite

  5. For the Sequence Duplication Level graphs, I am not sure I understand the difference between the red and blue lines in the sub-panels. Interestingly the only bump is for repeats ~ > 10X, not sequences with fewer or more numbers of repeats. Could this be species specific? And I wonder if I should compare this to SRA reads for identical or similar species, sequenced by other research groups. Thoughts? Per Sequence Duplication Level_Composite

  6. In terms of adapter content- this is what started it all, I saw FastQC return Illumina Universal Adapter content at multiple positions in the original reads, increasing all the way up to the read end. So I decided to run this trim_galore / cutadapt step. It seems totally normal that the adapter content would go away after this step. Correct? Adapter Content_Composite

THANK YOU!

ADD COMMENTlink modified 12 weeks ago by genomax34k • written 12 weeks ago by Anand Rao100
2
gravatar for genomax
12 weeks ago by
genomax34k
United States
genomax34k wrote:
  1. There is probably no need to worry about a few tiles showing up as color other than blue. If you are bothered by it then you could use filterbytile.sh from BBMap (Introducing FilterByTile: Remove Low-Quality Reads Without Adding Bias ).

  2. Initial base pair bias is caused by the tagmentation enzyme/random primers. These reactions are supposed to be non-biased but they are not. This generally does not cause any problems with alignments.

  3. Number 3 through 6 - Are normal observations. In #3 the red curve depicts amount of data that would remain if you were to dedupe the data.

FastQC has "limits" defined for the tests it does and they are based on normal genomic sequencing which causes "failure" (red X) if that is not what you are doing. These limits can be changed in a config file. I suggest that you move on to your actual analysis. If something does not work as expected, then back track.

ADD COMMENTlink modified 12 weeks ago • written 12 weeks ago by genomax34k

Thank you genomax! Very useful replies indeed. For your answer #2 to my question #2, do you mean initial base pair bias ~ 1-10 alone? How about how nucleotide compositional changes ~ 140-150 before Vs. after the trim_galore step? I am trying to understand what scenario(s) can cause "A" content to go down to zero upon trimming! Could you please throw some light on it? THANKS.

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by Anand Rao100
1

You may like to change the stringency of trim-galore. By default, trim galore removes even a single bp match to adapter seq ('AGATCGGAAGAGC'), which mean a terminal A will be always trimmed. See Fig 2 https://www.epigenesys.eu/images/stories/protocols/pdf/20120720103700_p57.pdf

ADD REPLYlink written 12 weeks ago by Santosh Anand3.0k
1

Sequencing bias in the first few bases are explained very clearly here https://sequencing.qcfail.com/articles/positional-sequence-bias-in-random-primed-libraries/

ADD REPLYlink written 12 weeks ago by Santosh Anand3.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1288 users visited in the last hour