whole genome sequencing data QC problem (different performance of different softwares)
0
0
Entering edit mode
8.3 years ago

Hi,

I have been puzzled by the different performance of QC softwares. I used to use NGSQCToolkit to filter my raw genome data into clean data. However, I found that the filtered fastq files from NGSQCToolkit couldn't satisfy the standards of fastqc which indicates warnings of "Per sequence GC content", "Adapter content(with Illumina universal Adapter existing at the end of reads)", "Kmer content(seems due to the 3' linkers at the end of reads)".

Illumina Hiseq3000 PE 2x150bp

5' Illumina universal Adapter AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT

3' with index sequence for samples GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG

I can use Trimmomatic to get rid of Illumina universal adapter, but the "Kmer content" in fastqc still failed.Hmm..still don't known how to deal with such a problem, can I just use such a fastx_toolkit command to solve the problem?

nohup fastx_clipper -a "3' linker sequence string" -D -n -v -i _.fastq -o _clipped.fastq &
sequencing genome • 2.1k views
ADD COMMENT
1
Entering edit mode

Items marked with a "red x" in FastQC do not automatically signify that the data is bad. As long as the data is free of adapter contamination you can move ahead with analysis. There is no rule that says that every item in FastQC has to have a green check mark.

You may also want to try BBDuk.sh from BBMap package to scan your data if you think the programs you have used are not doing an adequate job of cleaning your data.

ADD REPLY

Login before adding your answer.

Traffic: 1571 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6