Question

Modification of fastq header

0

Entering edit mode

7.0 years ago

seta ★ 1.9k

Hi all,

I'm trying to use the script (PAL finder at enter link description here, but it returned me an error "Non-valid paired end read", however, my fastq files are paired. I think the problem is related to fastq header. The fastq header of example data (one of the PE reads) is like below:

@ILLUMINA-545855_0049_FC61RLR:2:1:8899:1514#0/1

and the header of one of my fastq PE reads is here:

@SRR707811.1 FCD0CDRABXX:4:1101:1290:2174/1

Could you please help me how to modify the header of my fastq reads similar to the header of example data?

Thanks in advance

header fastq read modification • 4.3k views

ADD COMMENT • link updated 7.0 years ago by Charles Plessy ★ 2.9k • written 7.0 years ago by seta ★ 1.9k

score 1 · Answer 1 · 2017-04-12

In pal_finder's source code, one can see that:

Read names from the same pair are required to be identical,

sub validPEread {
    my $title1 = shift;
    my $title2 = shift;
[cut parts for brevity]
    return 0 unless ($title1 eq $title2);

And failures are reported by an error message where /1 and /2 are added to the read names.

if (not validPEread ($title1, $title2, $seq1, $seq2, $qual1, $qual2) ) {
    print "Non-valid paired end read:\n$title1/1:$seq1:$qual1\n$title2/2:$seq2:$qual2\n";
    exit(1);
}

In your reply to shenwei356, you show an error message with read names ending in \1\1 and \2\2. Thus they already differ in your source file. Differing that way is totally valid, but pal_finder was last updated 5 years ago and does no expect this name convention. Perhaps a command such as sed '/^[@+]/s/\\[12]$//' will help you to remove the trailing parts of the read names that makes them differ.

score 0 · Answer 2 · 2017-04-12

0

Entering edit mode

7.0 years ago

shenwei356 8.4k

gzip -d -c old_1.fq.gz | sed 's/ /-/g' | gzip -c > new_1.fq.gz

Replace gzip with pigz if you have pigz, which is much faster.

ADD COMMENT • link 7.0 years ago by shenwei356 8.4k

0

Entering edit mode

Thank you for your reply. I tried with this modification and original header, I don't know why the header changed during runnig the script, it's probably problematic. Actualy th error is:

Non-valid paired end read:
SRR707811.1-FCD0CDRABXX:4:1101:1290:2174/1/1:TCAGCATCAGTGACAGAGGGCCAGCAGAACGAGCAGTGACAAGACAGGTGGGGCCTGGCTCCCCCCCCGCCAGCTCCANNNNNNCCCCTTGCTGCATCTG:eeeddeeeeeeeeecddd\dbdWVdddWcdc_c`bd_dcdeeeaec^c^^cdbddL_]^^ddddUdddVdBBBBBBBBBBBBBBBBBBBBee\eedecee
SRR707811.1-FCD0CDRABXX:4:1101:1290:2174/2/2:TAACTCCTCCTGGGAAAATAATCCTGTTGGAGTTGGGGGCTCTTCCCAGTTGTCTGGTTAGTTGGCCCAGGAAGGGGCAG:dae\ddddcd\ddddefdWffegffdefbd`bZ\c`O_]ZX]]L]b]acbcbZccd\Tdf_]]SbZ^__^fae^BBBBBB

As you see, the end of header changed to 2174/1/1 or 2174/2/2, why /1 or /2 was added?! could you please help me out on this issue? this problem was not with the example data.

ADD REPLY • link 7.0 years ago by seta ★ 1.9k

0

Entering edit mode

Probably there is a space outside the fastq header -

sed '/^@SRR/ s/ /_/g' infile >outfile

ADD REPLY • link 7.0 years ago by Rohit ★ 1.5k

0

Entering edit mode

Your solution didn't work. The same error was appeared. Any suggestion please!

ADD REPLY • link 7.0 years ago by seta ★ 1.9k

0

Entering edit mode

Can you provide the first four lines of the fastq file, both for the forward and reverse?

ADD REPLY • link 7.0 years ago by Rohit ★ 1.5k

0

Entering edit mode

Yes, it's here for forward:

@SRR707811.1 FCD0CDRABXX:4:1101:1290:2174/1
TCAGCATCAGTGACAGAGGGCCAGCAGAACGAGCAGTGACAAGACAGGTGGGGCCTGGCTCCCCCCCCGCCAGCTCCANNNNNNCCCCTTGCTGCATCTG
+
eeeddeeeeeeeeecddd\dbdWVdddWcdc_c`bd_dcdeeeaec^c^^cdbddL_]^^ddddUdddVdBBBBBBBBBBBBBBBBBBBBee\eedecee

and for reverse:

@SRR707811.1 FCD0CDRABXX:4:1101:1290:2174/2
TAACTCCTCCTGGGAAAATAATCCTGTTGGAGTTGGGGGCTCTTCCCAGTTGTCTGGTTAGTTGGCCCAGGAAGGGGCAG
+
dae\ddddcd\ddddefdWffegffdefbd`bZ\c`O_]ZX]]L]b]acbcbZccd\Tdf_]]SbZ^__^fae^BBBBBB

ADD REPLY • link 7.0 years ago by seta ★ 1.9k

0

Entering edit mode

Having spaces in fastq headers may be another issue. If you had fastq-dumped this data using -F option (to recover original Illumina headers) you would not have the extra SRR707811.1 bit in your headers.

ADD REPLY • link 7.0 years ago by GenoMax 141k