Question: Checking fastq is valid
2
gravatar for flyamer
3.4 years ago by
flyamer40
Russian Federation
flyamer40 wrote:

Hi, I have a suspicion one or more of my fastq files is corrupted: some reads have sequence and quality of different lengths, for example. Would this python script work to detect a problem?

import sys
from Bio import SeqIO
for record in SeqIO.parse(sys.stdin, "fastq"):
    pass

And I use it like this:

pigz -dc -p 3 serum-3_2.fq.gz | python check.py

Will it raise an error if one of the reads is non-valid? Or is there a better quick way to check this?

sequencing next-gen • 6.4k views
ADD COMMENTlink modified 23 days ago by clay.l.mcleod20 • written 3.4 years ago by flyamer40
2

why not try with test file?

ADD REPLYlink written 3.4 years ago by shenwei3565.2k

Ha, I don't know why not, good idea, thanks! I checked, seems like it at least picks up differences in sequence and quality lengths.

ADD REPLYlink written 3.4 years ago by flyamer40

A: Fastq Quality Read And Score Length Check

ADD REPLYlink written 3.4 years ago by Medhat8.7k
1
gravatar for flyamer
3.4 years ago by
flyamer40
Russian Federation
flyamer40 wrote:

Yes, it raises an error if sequence and quality strings have different length.

ADD COMMENTlink written 3.4 years ago by flyamer40
3
gravatar for apa@stowers
3.4 years ago by
apa@stowers470
Kansas City
apa@stowers470 wrote:

If you only care about read length differing from quality-score length, you could just run this:

zcat fastq.gz | paste - - - - | awk -F"\t" '{ if (length($2) != length($4)) print $0 }' | wc -l

That will give you a count of aberrant records.

ADD COMMENTlink written 3.4 years ago by apa@stowers470

thanks to @apa

this will give you those reads

zcat fastq.gz | paste - - - - | awk -F"\t" '{ if (length($2) != length($4)) print $0 }' |  tr '\t' '\n' > error_reads.fastq
ADD REPLYlink modified 12 months ago by RamRS27k • written 3.4 years ago by Medhat8.7k
2
gravatar for John
3.4 years ago by
John12k
Germany
John12k wrote:

It's very difficult to say what is and isn't an invalid FASTQ, as there is no definitive specification.

A better approach is to write some tests that you think should be true, for example, that the SEQ and QUAL values are the same length, that there is always 4 lines per entry, etc. Do not assume that if a bioinformatics tool/parser does not give errors, it means there are no errors. This is so very often not the case, particularly surrounding FASTA/FASTQ as the specs are so open to interpretation.

One you have determined that the data is not "nonsense" but rather "missense" to borrow a term from genetics, then you should perhaps analyse distributions of the data to see if it makes sense. You can use uQ to do this very quickly with the --peek parameter, for example the command:

python /Users/John/Desktop/uq.py -i /Users/John/Downloads/ENCFF000ZZU.fastq.txt --peek

outputs this:

http://pastebin.com/raw/6qTMwNTp

This also checks that SEQ/QUAL is the same length, and some other very very basic things. It's just a proof-of-concept for compressing FASTQ files to be as small as possible, but it's not intended to be used.

ADD COMMENTlink modified 12 months ago by RamRS27k • written 3.4 years ago by John12k
1
gravatar for guipagui
3.4 years ago by
guipagui10
guipagui10 wrote:

With this tool : FastQValidator

ADD COMMENTlink written 3.4 years ago by guipagui10

Provide a link, when you are referring to a specific program. This is important since software programs may have similar names and searching the web may sometimes lead one down an undesired path (e.g. malware etc).

I will include a link for FastQValidator this time.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by genomax83k

I will know it. Thanks.

ADD REPLYlink written 3.4 years ago by guipagui10
1
gravatar for clay.l.mcleod
23 days ago by
clay.l.mcleod20 wrote:

You can use fqlint, a Rust program that identifies a broad range of issues Illumina-based FASTQ files. To install it, you can do the following after installing Rust.

cargo install --git https://github.com/stjude/fqlib.git
ADD COMMENTlink written 23 days ago by clay.l.mcleod20
0
gravatar for YaGalbi
3.4 years ago by
YaGalbi1.5k
Biocomputing, MRC Harwell Institute, Oxford, UK
YaGalbi1.5k wrote:

Try FASTQC

ADD COMMENTlink modified 12 months ago by RamRS27k • written 3.4 years ago by YaGalbi1.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1703 users visited in the last hour