Question: Checking fastq is valid
1
gravatar for flyamer
2.4 years ago by
flyamer30
Russian Federation
flyamer30 wrote:

Hi, I have a suspicion one or more of my fastq files is corrupted: some reads have sequence and quality of different lengths, for example. Would this python script work to detect a problem?

import sys
from Bio import SeqIO
for record in SeqIO.parse(sys.stdin, "fastq"):
    pass

And I use it like this:

pigz -dc -p 3 serum-3_2.fq.gz | python check.py

Will it raise an error if one of the reads is non-valid? Or is there a better quick way to check this?

sequencing next-gen • 3.8k views
ADD COMMENTlink modified 2.4 years ago by apa@stowers420 • written 2.4 years ago by flyamer30
2

why not try with test file?

ADD REPLYlink written 2.4 years ago by shenwei3564.6k

Ha, I don't know why not, good idea, thanks! I checked, seems like it at least picks up differences in sequence and quality lengths.

ADD REPLYlink written 2.4 years ago by flyamer30

A: Fastq Quality Read And Score Length Check

ADD REPLYlink written 2.4 years ago by Medhat8.3k
1
gravatar for flyamer
2.4 years ago by
flyamer30
Russian Federation
flyamer30 wrote:

Yes, it raises an error if sequence and quality strings have different length.

ADD COMMENTlink written 2.4 years ago by flyamer30
2
gravatar for John
2.4 years ago by
John12k
Germany
John12k wrote:

It's very difficult to say what is and isn't an invalid FASTQ, as there is no definitive specification.

A better approach is to write some tests that you think should be true, for example, that the SEQ and QUAL values are the same length, that there is always 4 lines per entry, etc. Do not assume that if a bioinformatics tool/parser does not give errors, it means there are no errors. This is so very often not the case, particularly surrounding FASTA/FASTQ as the specs are so open to interpretation.

One you have determined that the data is not "nonsense" but rather "missense" to borrow a term from genetics, then you should perhaps analyse distributions of the data to see if it makes sense. You can use uQ to do this very quickly with the --peek parameter, for example the command:

python /Users/John/Desktop/uq.py -i /Users/John/Downloads/ENCFF000ZZU.fastq.txt --peek

outputs this: http://pastebin.com/raw/6qTMwNTp

This also checks that SEQ/QUAL is the same length, and some other very very basic things. It's just a proof-of-concept for compressing FASTQ files to be as small as possible, but it's not intended to be used.

ADD COMMENTlink written 2.4 years ago by John12k
2
gravatar for apa@stowers
2.4 years ago by
apa@stowers420
Kansas City
apa@stowers420 wrote:

If you only care about read length differing from quality-score length, you could just run this:

zcat fastq.gz | paste - - - - | awk -F"\t" '{ if (length($2) != length($4)) print $0 }' | wc -l

That will give you a count of aberrant records.

ADD COMMENTlink written 2.4 years ago by apa@stowers420

thanks to @apa

this will give you those reads

zcat fastq.gz | paste - - - - | awk -F"\t" '{ if (length($2) != length($4)) print $0 }' | tr '\t' '\n' > error_reads.fastq

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Medhat8.3k
1
gravatar for guipagui
2.4 years ago by
guipagui10
guipagui10 wrote:

With this tool : FastQValidator

ADD COMMENTlink written 2.4 years ago by guipagui10

Provide a link, when you are referring to a specific program. This is important since software programs may have similar names and searching the web may sometimes lead one down an undesired path (e.g. malware etc).

I will include a link for FastQValidator this time.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by genomax67k

I will know it. Thanks.

ADD REPLYlink written 2.4 years ago by guipagui10
0
gravatar for YaGalbi
2.4 years ago by
YaGalbi1.4k
Biocomputing, MRC Harwell Institute, Oxford, UK
YaGalbi1.4k wrote:

Try FASTQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

ADD COMMENTlink written 2.4 years ago by YaGalbi1.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1141 users visited in the last hour