Question

Multiple fail points in fastqc file

0

Entering edit mode

8 months ago

dr.deepakkukkar • 0

Dear Community members, My fastqc report is showing the following failed/red points:

1] [FAIL]Per base sequence content 2] [FAIL]Per sequence GC content 3] [WARNING]Sequence Length Distribution 4] [FAIL]Sequence Duplication Levels 5] [FAIL]Overrepresented sequences

I have tried to optimize the trimmomatic command by varying all the available parameters but it is of no use. Can any one please provide a possible solution to this problem. Your suggestions will be much appreciated.

Thanks and regards,

Deepak

fastqc report • 1.1k views

ADD COMMENT • link 8 months ago by dr.deepakkukkar • 0

1

Entering edit mode

You left out a critical piece of information. What kind of data is this? Is there a fixed sequence tag at the beginning of all reads. Are these amplicons?

ADD REPLY • link 8 months ago by GenoMax 154k

0

Entering edit mode

Dear all,

Thank you so much for your valuable and critical inputs

BR

Deepak

ADD REPLY • link 8 months ago by dr.deepakkukkar • 0

score 1 · Answer 1 · 2025-02-14

No one can comment intelligently on this without knowing what it is. No one can understand what these graphs mean without the context of knowing what the samples are.

I would totally expect a sample of Plasmodium falciparum, or tubercuolsis bacterium to "fail" the GC content test. But that doesn't mean anything is wrong! It means the assumptions of the test are wrong on samples like that, and the automated flagging of "bad" results is nonsense, and should be ignored.

RNASeq also can "fail" the duplicate sequence test, because you might have some RNASeq molecules in rather high abundance.

You look like you are sequencing amplicons, so you probably expect most of the reads to look alike, so why are you dismayed to see that most of the reads look alike?

score 0 · Answer 2 · 2025-02-14

FastQC is a nice tool to let you have a peek at the overall sanity of your data, but one should use critical judgement over those failing reports. For example the Per base sequence content can be due to adapter sequences. They are more warnings to keep in mind than "no go" errors.

I would align those reads to the reference genome/transcriptome and check the percentage for aligned reads. If you think you are losing too many reads, check the unaligned reads and investigate why they are not aligning. In any case, what should not align to your reference will not align (like potential contamination at GC content level), duplicated reads will be flagged as such in the alignment file, etc...

Note : We cannot see anything on your images