Bad quality fastq files for analysis
2
0
Entering edit mode
13 months ago
Gene_MMP8 ▴ 240

I am working on a project that requires me to identify multiple fastq files with low quality. What can be a possible starting point for this sort of data search?

quality bad DNA-seq fastq alignment • 1.4k views
ADD COMMENT
0
Entering edit mode

Bad in what way?

ADD REPLY
0
Entering edit mode

Based on fatsqc scores. The boxplots produced by fastqc display quality scores on all bases. Usually, a score of 30 and above is considered good quality. Is there a way to extract multiple files (~100) that don't pass this threshold?

ADD REPLY
0
Entering edit mode

Do you already have data and want to identify bad ones or do you need to download files that are bad, e.g. from GEO? I do not really get that.

ADD REPLY
0
Entering edit mode

I don't have the data available. I want to identify such datasets. The overall aim is to determine what factors influence fastq data quality. For that, I already have a set of features available. All I need is labeled measurements from 100s-1000s of fatsq files.

ADD REPLY
2
Entering edit mode
13 months ago
GenoMax 141k

You can simply make simulated files/data with any features you like. Use randomreads.sh from BBTools or a similar tool.

Illumina quality parameters:
maxq=36         Upper bound of quality values.
midq=28         Approximate average of quality values.
minq=20         Lower bound of quality values.
q=              Sets maxq, midq, and minq to the same value.
adderrors=t     Add substitution errors based on quality values, 
                after mutations.
qv=4            Vary the base quality of reads by up to this much
                to simulate tile effects.
ADD COMMENT
0
Entering edit mode
13 months ago
shelkmike ★ 1.2k

You can run "seqkit stats" (https://bioinf.shenwei.me/seqkit/usage/#stats) for all these files. And, then, classify them into "bad" and "good" based on, for example, Q30.

ADD COMMENT

Login before adding your answer.

Traffic: 2601 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6