I'm aligning RNA-seq paired-end reads from ENCODE project, but some FASTQ has reads with Phred error lines longer than their sequence, what makes my pipeline fails, due to a FASQT format error. Here is an example of the conflictive reads:
The frequency of this "bad reads" isn't very high, but neither small enough to manually remove these from the big FASTQ files. Do you know a tool to identify reads with bad FASTQ fortmat in order to remove those from both paired-end FASTQ file?.
I tried to do a script in python, but I'm used to use SeqIO module from biopython libraries and it also fail due to the conflictive reads.
Beyond the length of the qual lines being odd, line 12 looks peculiar. It doesn't seem like you should have the identifier repeated as the first part of your qual string. I would try to re-download the raw data. Maybe you got a quality filtered file and the quality line was not trimmed correctly, but that does not explain the other anomalies.
Instead of trying to solve this problem with a custom script, you may be able to just find better formatted data.
After you made sure that the errors are in the original data downloaded from ENCODE, I think the most important step you have to take is to notify the ENCODE project about this problem. Be polite, show them the problematic areas like you did here and ask them whether they could fix it ("pretty please?").
It helps you to get what you want, it helps them by fixing errors in their pipelines and own high quality data ... and it helps the community as other will invariably trip over the same problem as you.
I cannot comment on ENCODE, but had some good experience with people from the NCBI Tracearchive who would respond to such inquiries quite quickly indeed.