Question: Modification of fastq header
0
gravatar for seta
2.9 years ago by
seta1.2k
Sweden
seta1.2k wrote:

Hi all,

I'm trying to use the script (PAL finder at enter link description here, but it returned me an error "Non-valid paired end read", however, my fastq files are paired. I think the problem is related to fastq header. The fastq header of example data (one of the PE reads) is like below:

@ILLUMINA-545855_0049_FC61RLR:2:1:8899:1514#0/1

and the header of one of my fastq PE reads is here:

@SRR707811.1 FCD0CDRABXX:4:1101:1290:2174/1

Could you please help me how to modify the header of my fastq reads similar to the header of example data?

Thanks in advance

modification header fastq read • 1.2k views
ADD COMMENTlink modified 2.9 years ago by Charles Plessy2.7k • written 2.9 years ago by seta1.2k
0
gravatar for shenwei356
2.9 years ago by
shenwei3565.0k
China
shenwei3565.0k wrote:
gzip -d -c old_1.fq.gz | sed 's/ /-/g' | gzip -c > new_1.fq.gz

Replace gzip with pigz if you have pigz, which is much faster.

ADD COMMENTlink written 2.9 years ago by shenwei3565.0k

Thank you for your reply. I tried with this modification and original header, I don't know why the header changed during runnig the script, it's probably problematic. Actualy th error is:

Non-valid paired end read:
SRR707811.1-FCD0CDRABXX:4:1101:1290:2174/1/1:TCAGCATCAGTGACAGAGGGCCAGCAGAACGAGCAGTGACAAGACAGGTGGGGCCTGGCTCCCCCCCCGCCAGCTCCANNNNNNCCCCTTGCTGCATCTG:eeeddeeeeeeeeecddd\dbdWVdddWcdc_c`bd_dcdeeeaec^c^^cdbddL_]^^ddddUdddVdBBBBBBBBBBBBBBBBBBBBee\eedecee
SRR707811.1-FCD0CDRABXX:4:1101:1290:2174/2/2:TAACTCCTCCTGGGAAAATAATCCTGTTGGAGTTGGGGGCTCTTCCCAGTTGTCTGGTTAGTTGGCCCAGGAAGGGGCAG:dae\ddddcd\ddddefdWffegffdefbd`bZ\c`O_]ZX]]L]b]acbcbZccd\Tdf_]]SbZ^__^fae^BBBBBB

As you see, the end of header changed to 2174/1/1 or 2174/2/2, why /1 or /2 was added?! could you please help me out on this issue? this problem was not with the example data.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by seta1.2k

Probably there is a space outside the fastq header -

sed '/^@SRR/ s/ /_/g' infile >outfile
ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by Rohit1.4k

Your solution didn't work. The same error was appeared. Any suggestion please!

ADD REPLYlink written 2.9 years ago by seta1.2k

Can you provide the first four lines of the fastq file, both for the forward and reverse?

ADD REPLYlink written 2.9 years ago by Rohit1.4k

Yes, it's here for forward:

@SRR707811.1 FCD0CDRABXX:4:1101:1290:2174/1
TCAGCATCAGTGACAGAGGGCCAGCAGAACGAGCAGTGACAAGACAGGTGGGGCCTGGCTCCCCCCCCGCCAGCTCCANNNNNNCCCCTTGCTGCATCTG
+
eeeddeeeeeeeeecddd\dbdWVdddWcdc_c`bd_dcdeeeaec^c^^cdbddL_]^^ddddUdddVdBBBBBBBBBBBBBBBBBBBBee\eedecee

and for reverse:

@SRR707811.1 FCD0CDRABXX:4:1101:1290:2174/2
TAACTCCTCCTGGGAAAATAATCCTGTTGGAGTTGGGGGCTCTTCCCAGTTGTCTGGTTAGTTGGCCCAGGAAGGGGCAG
+
dae\ddddcd\ddddefdWffegffdefbd`bZ\c`O_]ZX]]L]b]acbcbZccd\Tdf_]]SbZ^__^fae^BBBBBB
ADD REPLYlink written 2.9 years ago by seta1.2k

Having spaces in fastq headers may be another issue. If you had fastq-dumped this data using -F option (to recover original Illumina headers) you would not have the extra SRR707811.1 bit in your headers.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by genomax78k
0
gravatar for Charles Plessy
2.9 years ago by
Charles Plessy2.7k
Japan
Charles Plessy2.7k wrote:

In pal_finder's source code, one can see that:

Read names from the same pair are required to be identical,

sub validPEread {
    my $title1 = shift;
    my $title2 = shift;
[cut parts for brevity]
    return 0 unless ($title1 eq $title2);

And failures are reported by an error message where /1 and /2 are added to the read names.

if (not validPEread ($title1, $title2, $seq1, $seq2, $qual1, $qual2) ) {
    print "Non-valid paired end read:\n$title1/1:$seq1:$qual1\n$title2/2:$seq2:$qual2\n";
    exit(1);
}

In your reply to shenwei356, you show an error message with read names ending in \1\1 and \2\2. Thus they already differ in your source file. Differing that way is totally valid, but pal_finder was last updated 5 years ago and does no expect this name convention. Perhaps a command such as sed '/^[@+]/s/\\[12]$//' will help you to remove the trailing parts of the read names that makes them differ.

ADD COMMENTlink written 2.9 years ago by Charles Plessy2.7k

Thanks, you're right about read name from the same pair must be identical. I manually remove /1 and /2 in the very short data and the script worked, however, the header in the example data of the script is like below:

@ILLUMINA-545855_0049_FC61RLR:2:1:8899:1514#0/1 and @ILLUMINA-545855_0049_FC61RLR:2:1:8899:1514#0/2

But when I changed the end of my header (/1 or /2) to (#0/1 or #0/2), similar to example data, the same error appeared!!, so I have to remove /1 and /2 from my header, right? could you please let me know how I can do it?

Thank you

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by seta1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1372 users visited in the last hour