Checking fastq is valid
6
2
Entering edit mode
6.1 years ago
flyamer ▴ 60

Hi, I have a suspicion one or more of my fastq files is corrupted: some reads have sequence and quality of different lengths, for example. Would this python script work to detect a problem?

import sys
from Bio import SeqIO
for record in SeqIO.parse(sys.stdin, "fastq"):
pass


And I use it like this:

pigz -dc -p 3 serum-3_2.fq.gz | python check.py


Will it raise an error if one of the reads is non-valid? Or is there a better quick way to check this?

next-gen sequencing • 15k views
2
Entering edit mode

why not try with test file?

0
Entering edit mode

Ha, I don't know why not, good idea, thanks! I checked, seems like it at least picks up differences in sequence and quality lengths.

0
Entering edit mode
1
Entering edit mode
6.1 years ago
flyamer ▴ 60

Yes, it raises an error if sequence and quality strings have different length.

4
Entering edit mode
6.1 years ago
apa@stowers ▴ 580

If you only care about read length differing from quality-score length, you could just run this:

zcat fastq.gz | paste - - - - | awk -F"\t" '{ if (length($2) != length($4)) print $0 }' | wc -l  That will give you a count of aberrant records. ADD COMMENT 0 Entering edit mode thanks to @apa this will give you those reads zcat fastq.gz | paste - - - - | awk -F"\t" '{ if (length($2) != length($4)) print$0 }' |  tr '\t' '\n' > error_reads.fastq

3
Entering edit mode
2.8 years ago

You can use fqlint, a Rust program that identifies a broad range of issues Illumina-based FASTQ files. To install it, you can do the following after installing Rust.

cargo install --git https://github.com/stjude/fqlib.git

2
Entering edit mode
6.1 years ago
John 13k

It's very difficult to say what is and isn't an invalid FASTQ, as there is no definitive specification.

A better approach is to write some tests that you think should be true, for example, that the SEQ and QUAL values are the same length, that there is always 4 lines per entry, etc. Do not assume that if a bioinformatics tool/parser does not give errors, it means there are no errors. This is so very often not the case, particularly surrounding FASTA/FASTQ as the specs are so open to interpretation.

One you have determined that the data is not "nonsense" but rather "missense" to borrow a term from genetics, then you should perhaps analyse distributions of the data to see if it makes sense. You can use uQ to do this very quickly with the --peek parameter, for example the command:

python /Users/John/Desktop/uq.py -i /Users/John/Downloads/ENCFF000ZZU.fastq.txt --peek


outputs this:

http://pastebin.com/raw/6qTMwNTp

This also checks that SEQ/QUAL is the same length, and some other very very basic things. It's just a proof-of-concept for compressing FASTQ files to be as small as possible, but it's not intended to be used.

1
Entering edit mode
6.1 years ago
guipagui ▴ 10

With this tool : FastQValidator

0
Entering edit mode

Provide a link, when you are referring to a specific program. This is important since software programs may have similar names and searching the web may sometimes lead one down an undesired path (e.g. malware etc).

I will include a link for FastQValidator this time.

0
Entering edit mode

I will know it. Thanks.

0
Entering edit mode
6.1 years ago
BioinfGuru ★ 1.6k

Try FASTQC