Question

whole genome sequencing data QC problem (different performance of different softwares)

0

Entering edit mode

8.3 years ago

zhengyunchaosky ▴ 20

Hi,

I have been puzzled by the different performance of QC softwares. I used to use NGSQCToolkit to filter my raw genome data into clean data. However, I found that the filtered fastq files from NGSQCToolkit couldn't satisfy the standards of fastqc which indicates warnings of "Per sequence GC content", "Adapter content(with Illumina universal Adapter existing at the end of reads)", "Kmer content(seems due to the 3' linkers at the end of reads)".

Illumina Hiseq3000 PE 2x150bp

5' Illumina universal Adapter AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT

3' with index sequence for samples GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG

I can use Trimmomatic to get rid of Illumina universal adapter, but the "Kmer content" in fastqc still failed.Hmm..still don't known how to deal with such a problem, can I just use such a fastx_toolkit command to solve the problem?

nohup fastx_clipper -a "3' linker sequence string" -D -n -v -i _.fastq -o _clipped.fastq &

sequencing genome • 2.1k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.3 years ago by zhengyunchaosky ▴ 20

1

Entering edit mode

Items marked with a "red x" in FastQC do not automatically signify that the data is bad. As long as the data is free of adapter contamination you can move ahead with analysis. There is no rule that says that every item in FastQC has to have a green check mark.

You may also want to try BBDuk.sh from BBMap package to scan your data if you think the programs you have used are not doing an adequate job of cleaning your data.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by GenoMax 141k