Corrupted FASTq files with missing "+" under some sequences.
10 weeks ago
akh22 ▴ 50

Hi,

I have been trying to recover corrupted fastqs files. I had a decompression error;

invalid compressed data--crc error.

I got around the crc error by using gzrecover and then used a seqkit sana to fix sequence inconsistencies. Now, the issue is when I run FastQC, it complains that some sequences lack “+” under the sequence. I thought about using sed but am not sure how to add missing "+" to where it should be.

Any help will be appreciated.

Update:

I run ValidateFasta and found an issue;

INFO  [2021-05-21 16:13:40,878] [ValidateFastqanonfun$main$1] - 107300000 reads processed


I should be able to add "+" right below this line by sed ?

fastq RNAseq corruption recover • 313 views
This sounds like a lost cause. Trying to fix corrupt data is not good strategy. You can't be certain of results you will generate doing this. Please go back and re-download the data.

If this was your only copy and it is now corrupt then you learned a valuable lesson. Always keep backup copies of all data.

can you post a small extract of some of those corrupted lines ?

This is a output aronb the problem line;

gsed -n '429343624,429343626p' rec.test.fastq
FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
P�;8���>-�T��T
L_�:q����J{/�bh[�li3�=c�/>8�7���/w8Zd�7n�PPPPP�XQ@A00165:69:HKJ3YDMXX:1:1127:28203:31297 2:N:0:GTCCTTCT
GCCCTGAAAAACAACAGTAATGATATTGTAAATGCTATTATGGAATTAACAATGTAACTATTTGACAGCGAAGACAACTCCCCCTTTCCCC


I have to delete this garbage

P�;8���>-�T��T
L_�:q����J{/�bh[�li3�=c�/>8�7���/w8Zd�7n�PPPPP�XQ


gsed -n '429343623,429343628p' rec.test.fastq
+
FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00165:69:HKJ3YDMXX:1:1127:28203:31297 2:N:0:GTCCTTCT
GCCCTGAAAAACAACAGTAATGATATTGTAAATGCTATTATGGAATTAACAATGTAACTATTTGACAGCGAAGACAACTCCCCCTTTCCCC
+

FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFF

Yep, sorry seeing this I have to go with GenoMax 's point of view I'm afraid.

Moreover, you don't need to change it to a '+' you need to change it to the read header line , starting with @ and containing crucial info for correct processing of your fastq file.

better not to waste any more time on this. Those files are lost!

I noticed you changed your post, omitting the replacement with '+' .

How do you know the line should be: @A00165:69:HKJ3YDMXX:1:1127:28203:31297 2:N:0:GTCCTTCT ?

0
This is not garbage, these are chunks of binary data that somehow got mixed with the uncompressed text. Looks like data are lost. Seconding genomax and lieven.sterck here, give it a rm *, not much you can reliably do about it.