Question

Problematic fastq files...How can we trust them?

0

Entering edit mode

4 months ago

blackadder ▴ 30

Hello fellas,

A week ago I made another post regarding an error I was getting while I was trying to run BBDuk on a number of fastq files. In that case, there were lines that miss the "+" char.

After looking a bit more I found the following:

Not only there are records that miss the "+" char but also records that miss the "@" char. And moreover there are also mismatches in the lengths of sequence and quality values for specific records.

I assume that maybe all these are fixable with a script but my question is, can we trust these files? Is it worth spending time and effort to fix them?

Thanking you in advance!

fastq • 853 views

ADD COMMENT • link 4 months ago by blackadder ▴ 30

score 3 · Accepted Answer · 2023-12-01

3

Entering edit mode

4 months ago

ATpoint 82k

I assume that maybe all these are fixable with a script but my question is, can we trust these files? Is it worth spending time and effort to fix them?

No, definitely not trustable. This all should not happen, indicating that either corruption happened during file transfer or any processing step. I would delete everything and start from the earliest backup that exists (you have a backup, right, right??)

ADD COMMENT • link 4 months ago by ATpoint 82k

0

Entering edit mode

Its not my data. An external researcher is visiting and she wanted from us to do some bioinformatics... The samples were not sequenced by us but by her institution.

ADD REPLY • link 4 months ago by blackadder ▴ 30

4

Entering edit mode

Then tell her to bring proper data. In the lab you also don't use buffers with mold growing in it, and in bioinformatics we don't start with corrupted data. Just my take on that...

ADD REPLY • link 4 months ago by ATpoint 82k

4

Entering edit mode

I'd have to agree there... while you can generally recover a fastq to the point that it is spec-compliant, you don't know how or why the file was corrupted. So, perhaps, two programs were writing to the same file at the same time. In that case, half of your reads might be from the wrong experiment, and thus any conclusion you draw would be false.

ADD REPLY • link 4 months ago by Brian Bushnell 20k

1

Entering edit mode

perhaps ask her about the origin of the data, how was the data postprocessed

FASTQ data is a bit more resilient than other types of data as each sequence is on a separate line, and a corrupted line does not affect all the rest,

in the end only the sequence is important, all the other header and qualities are not all that relevant

sometimes salvaging a hopeless data can make you feel like a hero, I like to feel like a hero