Corrupted FASTq files with missing "+" under some sequences.
0
0
Entering edit mode
10 weeks ago
akh22 ▴ 50

Hi,

I have been trying to recover corrupted fastqs files. I had a decompression error;

invalid compressed data--crc error.

I got around the crc error by using gzrecover and then used a seqkit sana to fix sequence inconsistencies. Now, the issue is when I run FastQC, it complains that some sequences lack “+” under the sequence. I thought about using sed but am not sure how to add missing "+" to where it should be.

Any help will be appreciated.

Update:

I run ValidateFasta and found an issue;

INFO  [2021-05-21 16:13:40,878] [ValidateFastqanonfun$main$1] - 107300000 reads processed


I should be able to add "+" right below this line by sed ?

fastq RNAseq corruption recover • 313 views
3
Entering edit mode

This sounds like a lost cause. Trying to fix corrupt data is not good strategy. You can't be certain of results you will generate doing this. Please go back and re-download the data.

If this was your only copy and it is now corrupt then you learned a valuable lesson. Always keep backup copies of all data.

0
Entering edit mode

can you post a small extract of some of those corrupted lines ?

0
Entering edit mode

This is a output aronb the problem line;

gsed -n '429343624,429343626p' rec.test.fastq
FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
P�;8���>-�T��T
L_�:q����J{/�bh[�li3�=c�/>8�7���/w8Zd�7n�PPPPP�XQ@A00165:69:HKJ3YDMXX:1:1127:28203:31297 2:N:0:GTCCTTCT
GCCCTGAAAAACAACAGTAATGATATTGTAAATGCTATTATGGAATTAACAATGTAACTATTTGACAGCGAAGACAACTCCCCCTTTCCCC


I have to delete this garbage

P�;8���>-�T��T
L_�:q����J{/�bh[�li3�=c�/>8�7���/w8Zd�7n�PPPPP�XQ


gsed -n '429343623,429343628p' rec.test.fastq
+
FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00165:69:HKJ3YDMXX:1:1127:28203:31297 2:N:0:GTCCTTCT
GCCCTGAAAAACAACAGTAATGATATTGTAAATGCTATTATGGAATTAACAATGTAACTATTTGACAGCGAAGACAACTCCCCCTTTCCCC
+

FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFF

0
Entering edit mode

Yep, sorry seeing this I have to go with GenoMax 's point of view I'm afraid.

Moreover, you don't need to change it to a '+' you need to change it to the read header line , starting with @ and containing crucial info for correct processing of your fastq file.

better not to waste any more time on this. Those files are lost!

0
Entering edit mode

I noticed you changed your post, omitting the replacement with '+' .

How do you know the line should be: @A00165:69:HKJ3YDMXX:1:1127:28203:31297 2:N:0:GTCCTTCT ?

0
Entering edit mode

This is not garbage, these are chunks of binary data that somehow got mixed with the uncompressed text. Looks like data are lost. Seconding genomax and lieven.sterck here, give it a rm *, not much you can reliably do about it.