Fastq With Format Errors
3
2
Entering edit mode
9.1 years ago

I'm aligning RNA-seq paired-end reads from ENCODE project, but some FASTQ has reads with Phred error lines longer than their sequence, what makes my pipeline fails, due to a FASQT format error. Here is an example of the conflictive reads:

@30DY0AAXX_HWI-EAS229_75:7:1:1:1661/1
TTCTACTACCAGCCCTTGGGGCACTCACCCCTGTGATCAAGCAATCATTGTCAATGACAAAGTGACTATTGAAGTT
+
IIIIIIIIIIIIIIIIIIIIII<IIIIIIIIII3I*II:IIIIGIIIIIIII/5IIIIB9&F+IIII*I<I/+>9F
@30DY0AAXX_HWI-EAS229_75:7:1:1:1818/1
TTAACCACGATTATGTGCACGTACCTATGTATTATTTCTATGCGTGTCGTTGTCGAGTGCAACAACTAACTGTGCG
+
+I$##I%%'5/I#($(&"$%$-+/)%+(,(%%*/%4@$%*8&%+/F+.%(6#%%#2(I&#%$&%'$%;&#%*+%1II@GI+(&&%ID$#&I#&$.#11'IIGI%&+'5%&&&?%$.&+%#($%'&5#+%%)'%%&&-#%%*0%)&$'&$@30DY0AAXX_HWI-EAS229_75:7:1:1:1976/1 TTTGTGTCTTGTTCACGTTTCTGGTCTCGTAGCTTCTCCTCTCATCTCTTTGCATTTCTGTCCTTCCATGTCTGTG + @30DY0AAXX_IIIIIIIII&I;0&IIIII%III1,I#I0II'II,BI=$48IIB&II33&III5I3C+H?+6,'"&I-.#4./$6IIIIIIIIGIII&III/IIGII'I28I%-GAI(II123#G*$,8III3B-/,B221(3%?$;((#+2<$*.'%C"/1
@30DY0AAXX_HWI-EAS229_75:7:1:1:791/1
CTCAAGATGACATCAGTCCCATTTGTCTTAAGTCCTGGTGTTGTGTGGATGACAAGCAGAAGCCAGTTATGATGAC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII/IIIIIIIIIIIIIII?IIIIIIICIII7III@IIIII


The frequency of this "bad reads" isn't very high, but neither small enough to manually remove these from the big FASTQ files. Do you know a tool to identify reads with bad FASTQ fortmat in order to remove those from both paired-end FASTQ file?.

I tried to do a script in python, but I'm used to use SeqIO module from biopython libraries and it also fail due to the conflictive reads.

Any advice will be welcome, Thanks for for time.

fastq format python rna biopython • 4.7k views
1
Entering edit mode

I also met such problem before. and be careful, if need recorded the line number of the error, since for the pair-end fastq, you need remove the same lines in another paired fastq file.

6
Entering edit mode
9.1 years ago
SES 8.4k

Beyond the length of the qual lines being odd, line 12 looks peculiar. It doesn't seem like you should have the identifier repeated as the first part of your qual string. I would try to re-download the raw data. Maybe you got a quality filtered file and the quality line was not trimmed correctly, but that does not explain the other anomalies.

Instead of trying to solve this problem with a custom script, you may be able to just find better formatted data.

4
Entering edit mode
9.1 years ago
brentp 23k

If all you want to do is truncate the qual string to the length of the seq string, then it's simple with awk:

awk '(NR % 4 == 2){ l=length($1); } (NR % 4 != 0){ print$0 }
(NR %4 ==0){ print substr($1, 1, l)}' in.fastq > out.fastq  but it looks like there's something wrong with those lines. You can just get rid of bad records with: awk 'BEGIN{OFS="\n"} { a[NR % 4] =$0;
if(NR % 4 == 0 && length(a[2]) == length(a[0])){
print a[1],a[2],a[3],a[0]
}
}'


that assumes that every 4th line is the start of a new fastq record (which may not be the case if you fastq is really messed up).

4
Entering edit mode
9.1 years ago
Bach ▴ 550

After you made sure that the errors are in the original data downloaded from ENCODE, I think the most important step you have to take is to notify the ENCODE project about this problem. Be polite, show them the problematic areas like you did here and ask them whether they could fix it ("pretty please?").

It helps you to get what you want, it helps them by fixing errors in their pipelines and own high quality data ... and it helps the community as other will invariably trip over the same problem as you.

I cannot comment on ENCODE, but had some good experience with people from the NCBI Tracearchive who would respond to such inquiries quite quickly indeed.