Problematic fastq files...How can we trust them?
1
0
Entering edit mode
11 months ago
blackadder ▴ 30

Hello fellas,

A week ago I made another post regarding an error I was getting while I was trying to run BBDuk on a number of fastq files. In that case, there were lines that miss the "+" char.

After looking a bit more I found the following:

Not only there are records that miss the "+" char but also records that miss the "@" char. And moreover there are also mismatches in the lengths of sequence and quality values for specific records.

I assume that maybe all these are fixable with a script but my question is, can we trust these files? Is it worth spending time and effort to fix them?

Thanking you in advance!

fastq • 1.1k views
ADD COMMENT
3
Entering edit mode
11 months ago
ATpoint 85k

I assume that maybe all these are fixable with a script but my question is, can we trust these files? Is it worth spending time and effort to fix them?

No, definitely not trustable. This all should not happen, indicating that either corruption happened during file transfer or any processing step. I would delete everything and start from the earliest backup that exists (you have a backup, right, right??)

ADD COMMENT
0
Entering edit mode

Its not my data. An external researcher is visiting and she wanted from us to do some bioinformatics... The samples were not sequenced by us but by her institution.

ADD REPLY
4
Entering edit mode

Then tell her to bring proper data. In the lab you also don't use buffers with mold growing in it, and in bioinformatics we don't start with corrupted data. Just my take on that...

ADD REPLY
4
Entering edit mode

I'd have to agree there... while you can generally recover a fastq to the point that it is spec-compliant, you don't know how or why the file was corrupted. So, perhaps, two programs were writing to the same file at the same time. In that case, half of your reads might be from the wrong experiment, and thus any conclusion you draw would be false.

ADD REPLY
1
Entering edit mode

perhaps ask her about the origin of the data, how was the data postprocessed

FASTQ data is a bit more resilient than other types of data as each sequence is on a separate line, and a corrupted line does not affect all the rest,

in the end only the sequence is important, all the other header and qualities are not all that relevant

sometimes salvaging a hopeless data can make you feel like a hero, I like to feel like a hero

ADD REPLY
0
Entering edit mode

Thanks for the feedback fellas! I agree with the aforementioned!

ADD REPLY

Login before adding your answer.

Traffic: 835 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6