Hello fellas,
A week ago I made another post regarding an error I was getting while I was trying to run BBDuk on a number of fastq files. In that case, there were lines that miss the "+" char.
After looking a bit more I found the following:
Not only there are records that miss the "+" char but also records that miss the "@" char. And moreover there are also mismatches in the lengths of sequence and quality values for specific records.
I assume that maybe all these are fixable with a script but my question is, can we trust these files? Is it worth spending time and effort to fix them?
Thanking you in advance!
Its not my data. An external researcher is visiting and she wanted from us to do some bioinformatics... The samples were not sequenced by us but by her institution.
Then tell her to bring proper data. In the lab you also don't use buffers with mold growing in it, and in bioinformatics we don't start with corrupted data. Just my take on that...
I'd have to agree there... while you can generally recover a fastq to the point that it is spec-compliant, you don't know how or why the file was corrupted. So, perhaps, two programs were writing to the same file at the same time. In that case, half of your reads might be from the wrong experiment, and thus any conclusion you draw would be false.
perhaps ask her about the origin of the data, how was the data postprocessed
FASTQ data is a bit more resilient than other types of data as each sequence is on a separate line, and a corrupted line does not affect all the rest,
in the end only the sequence is important, all the other header and qualities are not all that relevant
sometimes salvaging a hopeless data can make you feel like a hero, I like to feel like a hero
Thanks for the feedback fellas! I agree with the aforementioned!