Corrupted FASTq files with missing "+" under some sequences.
0
0
Entering edit mode
10 weeks ago
akh22 ▴ 50

Hi,

I have been trying to recover corrupted fastqs files. I had a decompression error;

invalid compressed data--crc error.

I got around the crc error by using gzrecover and then used a seqkit sana to fix sequence inconsistencies. Now, the issue is when I run FastQC, it complains that some sequences lack “+” under the sequence. I thought about using sed but am not sure how to add missing "+" to where it should be.

Any help will be appreciated.

Update:

I run ValidateFasta and found an issue;

INFO  [2021-05-21 16:13:40,878] [ValidateFastq$$anonfun$main$1] - 107300000 reads processed
Exception in thread "main" htsjdk.samtools.SAMException: Quality header must start with +: GCCCTGAAAAACAACAGTAATGATATTGTAAATGCTATTATGGAATTAACAATGTAACTATTTGACAGCGAAGACAACTCCCCCTTTCCCC at line 429343625 in fastq /Volumes/Aura/rec.test.fastq

I should be able to add "+" right below this line by sed ?

fastq RNAseq corruption recover • 313 views
ADD COMMENT
3
Entering edit mode

This sounds like a lost cause. Trying to fix corrupt data is not good strategy. You can't be certain of results you will generate doing this. Please go back and re-download the data.

If this was your only copy and it is now corrupt then you learned a valuable lesson. Always keep backup copies of all data.

ADD REPLY
0
Entering edit mode

can you post a small extract of some of those corrupted lines ?

ADD REPLY
0
Entering edit mode

This is a output aronb the problem line;

gsed -n '429343624,429343626p' rec.test.fastq
FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
P�;8���>-�T��T
              L_�:q����J{/�bh[�li3�=c�/>8�7���/w8Zd�7n�`P`P`P`P`P`�XQ@A00165:69:HKJ3YDMXX:1:1127:28203:31297 2:N:0:GTCCTTCT
GCCCTGAAAAACAACAGTAATGATATTGTAAATGCTATTATGGAATTAACAATGTAACTATTTGACAGCGAAGACAACTCCCCCTTTCCCC

I have to delete this garbage

P�;8���>-�T��T
                  L_�:q����J{/�bh[�li3�=c�/>8�7���/w8Zd�7n�`P`P`P`P`P`�XQ

and add "@A00165:69:HKJ3YDMXX:1:1127:28203:31297 2:N:0:GTCCTTCT".

gsed -n '429343623,429343628p' rec.test.fastq
+
FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00165:69:HKJ3YDMXX:1:1127:28203:31297 2:N:0:GTCCTTCT
GCCCTGAAAAACAACAGTAATGATATTGTAAATGCTATTATGGAATTAACAATGTAACTATTTGACAGCGAAGACAACTCCCCCTTTCCCC
+


FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFF
ADD REPLY
0
Entering edit mode

Yep, sorry seeing this I have to go with GenoMax 's point of view I'm afraid.

Moreover, you don't need to change it to a '+' you need to change it to the read header line , starting with @ and containing crucial info for correct processing of your fastq file.

better not to waste any more time on this. Those files are lost!

ADD REPLY
0
Entering edit mode

I noticed you changed your post, omitting the replacement with '+' .

How do you know the line should be: @A00165:69:HKJ3YDMXX:1:1127:28203:31297 2:N:0:GTCCTTCT ?

ADD REPLY
0
Entering edit mode

This is not garbage, these are chunks of binary data that somehow got mixed with the uncompressed text. Looks like data are lost. Seconding genomax and lieven.sterck here, give it a rm *, not much you can reliably do about it.

ADD REPLY

Login before adding your answer.

Traffic: 1126 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6