It's very difficult to say what is and isn't an invalid FASTQ, as there is no definitive specification.
A better approach is to write some tests that you think should be true, for example, that the SEQ and QUAL values are the same length, that there is always 4 lines per entry, etc. Do not assume that if a bioinformatics tool/parser does not give errors, it means there are no errors. This is so very often not the case, particularly surrounding FASTA/FASTQ as the specs are so open to interpretation.
One you have determined that the data is not "nonsense" but rather "missense" to borrow a term from genetics, then you should perhaps analyse distributions of the data to see if it makes sense. You can use uQ to do this very quickly with the
--peek parameter, for example the command:
python /Users/John/Desktop/uq.py -i /Users/John/Downloads/ENCFF000ZZU.fastq.txt --peek
This also checks that SEQ/QUAL is the same length, and some other very very basic things. It's just a proof-of-concept for compressing FASTQ files to be as small as possible, but it's not intended to be used.
modified 12 months ago
RamRS ♦ 27k
3.4 years ago by
John ♦ 12k