Question: Fastq With Format Errors
gravatar for Geparada
8.2 years ago by
Geparada1.4k wrote:

I'm aligning RNA-seq paired-end reads from ENCODE project, but some FASTQ has reads with Phred error lines longer than their sequence, what makes my pipeline fails, due to a FASQT format error. Here is an example of the conflictive reads:


The frequency of this "bad reads" isn't very high, but neither small enough to manually remove these from the big FASTQ files. Do you know a tool to identify reads with bad FASTQ fortmat in order to remove those from both paired-end FASTQ file?.

I tried to do a script in python, but I'm used to use SeqIO module from biopython libraries and it also fail due to the conflictive reads.

Any advice will be welcome, Thanks for for time.

python fastq format rna biopython • 4.3k views
ADD COMMENTlink modified 8.2 years ago by Bach550 • written 8.2 years ago by Geparada1.4k

I also met such problem before. and be careful, if need recorded the line number of the error, since for the pair-end fastq, you need remove the same lines in another paired fastq file.

ADD REPLYlink written 3.7 years ago by Shicheng Guo8.2k
gravatar for SES
8.2 years ago by
Vancouver, BC
SES8.3k wrote:

Beyond the length of the qual lines being odd, line 12 looks peculiar. It doesn't seem like you should have the identifier repeated as the first part of your qual string. I would try to re-download the raw data. Maybe you got a quality filtered file and the quality line was not trimmed correctly, but that does not explain the other anomalies.

Instead of trying to solve this problem with a custom script, you may be able to just find better formatted data.

ADD COMMENTlink written 8.2 years ago by SES8.3k
gravatar for brentp
8.2 years ago by
Salt Lake City, UT
brentp23k wrote:

If all you want to do is truncate the qual string to the length of the seq string, then it's simple with awk:

awk '(NR % 4 == 2){ l=length($1); }
     (NR % 4 != 0){ print $0 }
     (NR %4 ==0){ print substr($1, 1, l)}' in.fastq > out.fastq

but it looks like there's something wrong with those lines. You can just get rid of bad records with:

awk 'BEGIN{OFS="\n"} { 
        a[NR % 4] = $0; 
        if(NR % 4 == 0 && length(a[2]) == length(a[0])){
            print a[1],a[2],a[3],a[0] 

that assumes that every 4th line is the start of a new fastq record (which may not be the case if you fastq is really messed up).

ADD COMMENTlink written 8.2 years ago by brentp23k
gravatar for Bach
8.2 years ago by
Bach550 wrote:

After you made sure that the errors are in the original data downloaded from ENCODE, I think the most important step you have to take is to notify the ENCODE project about this problem. Be polite, show them the problematic areas like you did here and ask them whether they could fix it ("pretty please?").

It helps you to get what you want, it helps them by fixing errors in their pipelines and own high quality data ... and it helps the community as other will invariably trip over the same problem as you.

I cannot comment on ENCODE, but had some good experience with people from the NCBI Tracearchive who would respond to such inquiries quite quickly indeed.

ADD COMMENTlink written 8.2 years ago by Bach550
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1160 users visited in the last hour