Fastq Splitter For Paired End Reads
4
15
Entering edit mode
10.8 years ago

Hi!!

I have to analyze paired-end RNA-seq read that are in an unusual format: both pair-end reads are joined in one FASTQ. I need to split the file in two separated FASTQ paried-end files.

There are a galaxy tool named FASTQ splitter that can do this:

FASTQ splitter

What it does

Splits a single fastq dataset representing paired-end run into two datasets (one for each end). This tool works only for datasets where both ends have the same length.

Sequence identifiers will have /1 or /2 appended for the split left-hand and right-hand reads, respectively.

Input format

A multiple-fastq file, for example:

@HWI-EAS91_1_30788AAXX:7:21:1542:1758
GTCAATTGTACTGGTCAATACTAAAAGAATAGGATCGCTCCTAGCATCTGGAGTCTCTATCACCTGAGCCCA
+HWI-EAS91_1_30788AAXX:7:21:1542:1758
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhfhhVZSWehR


Outputs

@HWI-EAS91_1_30788AAXX:7:21:1542:1758/1
GTCAATTGTACTGGTCAATACTAAAAGAATAGGATC
+HWI-EAS91_1_30788AAXX:7:21:1542:1758/1
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh


@HWI-EAS91_1_30788AAXX:7:21:1542:1758/2
GCTCCTAGCATCTGGAGTCTCTATCACCTGAGCCCA
+HWI-EAS91_1_30788AAXX:7:21:1542:1758/2
hhhhhhhhhhhhhhhhhhhhhhhhhfhhVZSWehR


Do you know any other standard alone script that can do this job?

rna paired tool • 44k views
7
Entering edit mode

Just as a side note, this output is what you get when you use 'fastq-dump' of the SRA tools to download paired-end data from the SRA archive. In that case, there is a nice option: 'fastq-dump --split-files' which will output one file for all the /1 pairs and another file for the /2 pairs.

0
Entering edit mode

I tried this command on interleaved fq containing paired reads from both ends, but it did not work even if I changed .fq to .sra. Should it have worked?

2
Entering edit mode

Hi trakhtenberg,

Hopefully you were able to figure this out given how long ago this was, but for posterity of others looking for answers, the fastq-dump does not act on local files. It instead downloads files from SRA and converts them into fastq files for you. If the SRA accession number (usually SRR#######) stores paired-end reads, you should use the following command:

fastq-dump --split-files -O path/to/output SRR#######


(edit to make code clearer)

14
Entering edit mode
10.8 years ago

copy this code to a file, e.g. splitPairedEndReads.pl

use strict;
use warnings;
my $file =$ARGV[0];
open(FILE, "<$file") || die "cannot open$file\n";
open(OUT1, ">$file\_1") || die "cannot open$file\_1\n";
open(OUT2, ">$file\_2") || die "cannot open$file\_2\n";
while(<FILE>){
chomp;
print OUT1 "$_\/1\n"; print OUT2 "$_\/2\n";
my $newline = <FILE>; chomp($newline);
print OUT1 substr($newline, 0, length($newline)/2)."\n";
print OUT2 substr($newline, length($newline)/2, length($newline)/2)."\n";$newline = <FILE>; chomp($newline); print OUT1 "$newline\/1\n";
print OUT2 "$newline\/2\n";$newline = <FILE>; chomp($newline); print OUT1 substr($newline, 0, length($newline)/2)."\n"; print OUT2 substr($newline, length($newline)/2, length($newline)/2)."\n";
}
close(FILE);


Now call it with ./splitPairedEndReads.pl input.fastq.

I hope it works.... haven't checked it.

0
Entering edit mode

@david : nice simple script: its splits to half: what if datasets where both ends have the not same length? thats what Geparada might be looking for?

0
Entering edit mode

Well, if they don't have the same length, then there has to be an information about the position to split. In his example, I cannot find anything like that. Furthermore, this format (split in the middle) is well known.

0
Entering edit mode

But it's simple to change the script and split on different positions.

0
Entering edit mode

I agree with that!!

0
Entering edit mode

@David: Thanks a lot for the Script. It is exactly wha was looking for. However, unfortunately it does not really do what it is supposed to. Here is the head of the original file:

@HWI-ST980:107:D098EACXX:2:1101:1471:1986 1:N:0:ACAGTG
GCTAGGGATGACTCCAGAGACTGATAATATTTCTGAACAAAATGATATCAATTCCTTATTACTACCAAACTCAGAAAATAAAGATTTACCTCGACCAT
+HWI-ST980:107:D098EACXX:2:1101:1471:1986 1:N:0:ACAGTG
DFFFFHHHHHJJJJJJJJJJJJJIJJJIJJJJJJJIJJJJJJJJJJJJIIJJJIIJJJJJJJJJJJJJJJJJHIJHHHHHFFFFFFFEEECEBDDDDD
@HWI-ST980:107:D098EACXX:2:1101:1471:1986 2:N:0:ACAGTG
GATTCCATTCATTCATTGTAGTTTCACTTGAATTTAGCTCAGAATTCTGTGTATAAGCAGGTGAAGGCATATCACTTTGGTACCAGGTGGAAGAACTC
+HWI-ST980:107:D098EACXX:2:1101:1471:1986 2:N:0:ACAGTG
FDFFFHGHHHJJJJJJJJJIJJJJJJJJIJHIIJIIJJJJJJJIIJJGHIHHJIJJJJJJJBFGHIJCGGIJIJJJJJFIHIJHFHHCDFEDEEEEEE
@HWI-ST980:107:D098EACXX:2:1101:1423:1987 1:N:0:ACAGTG
CGTTGGAGTGTTTGCGTTGCGAAGCTGCTGCAACCGTTGGCAGCTGATTTGAACTGTTCAGTTGTTGGCAGAAGGTAGGAGGCATTCCCGGGCTGG


and here after splitting:

File1:

@HWI-ST980:107:D098EACXX:2:1101:1471:1986 1:N:0:ACAGTG/1
GCTAGGGATGACTCCAGAGACTGATAATATTTCTGAACAAAATGATATC
+HWI-ST980:107:D098EACXX:2:1101:1471:1986 1:N:0:ACAGTG/1
DFFFFHHHHHJJJJJJJJJJJJJIJJJIJJJJJJJIJJJJJJJJJJJJI
@HWI-ST980:107:D098EACXX:2:1101:1471:1986 2:N:0:ACAGTG/1
GATTCCATTCATTCATTGTAGTTTCACTTGAATTTAGCTCAGAATTCTG
+HWI-ST980:107:D098EACXX:2:1101:1471:1986 2:N:0:ACAGTG/1
FDFFFHGHHHJJJJJJJJJIJJJJJJJJIJHIIJIIJJJJJJJIIJJGH
@HWI-ST980:107:D098EACXX:2:1101:1423:1987 1:N:0:ACAGTG/1
CGTTGGAGTGTTTGCGTTGCGAAGCTGCTGCAACCGTTGGCAGCTGAT


File2:

@HWI-ST980:107:D098EACXX:2:1101:1471:1986 1:N:0:ACAGTG/2
AATTCCTTATTACTACCAAACTCAGAAAATAAAGATTTACCTCGACCAT
+HWI-ST980:107:D098EACXX:2:1101:1471:1986 1:N:0:ACAGTG/2
IJJJIIJJJJJJJJJJJJJJJJJHIJHHHHHFFFFFFFEEECEBDDDDD
@HWI-ST980:107:D098EACXX:2:1101:1471:1986 2:N:0:ACAGTG/2
TGTATAAGCAGGTGAAGGCATATCACTTTGGTACCAGGTGGAAGAACTC
+HWI-ST980:107:D098EACXX:2:1101:1471:1986 2:N:0:ACAGTG/2
IHHJIJJJJJJJBFGHIJCGGIJIJJJJJFIHIJHFHHCDFEDEEEEEE
@HWI-ST980:107:D098EACXX:2:1101:1423:1987 1:N:0:ACAGTG/2
TTGAACTGTTCAGTTGTTGGCAGAAGGTAGGAGGCATTCCCGGGCTGG


I guess that the script did not realize the paired information that is given with the number (1 or 2) after the first space (CASAVA 1.8). I know that this is exactly the problem with other scipts that do not work anymore after the Illumina CASAVA 1.8 Update. What would I have to change to split the above shown format? Thanks a lot for any help on this issue!

0
Entering edit mode

At a first glance, it seems there is a problem with your format. Directly after the quality values in the third line, there seems to be no newline. Is that a problem of the editor here, or is it in you file?

There have to be 4 lines for each entry!

0
Entering edit mode

I tried the head of your file (after manually inserting the newlines) and the output seems to be correct.

8
Entering edit mode
10.8 years ago
Marvin ▴ 880

awk 'NR%2==1 { print $0 "/1" } ; NR%2==0 { print substr($0,0,length($0)/2) }'  for the right read: awk 'NR%2==1 { print$0 "/2" } ; NR%2==0 { print substr($0,length($0)/2) }'

4
Entering edit mode

Very elegant, but string indices in awk are 1-based so that as is, above code duplicates the last base of the first read as the first base of the second read. To avoid that, instead run:

awk 'NR%2==1 { print $0 "/2" } ; NR%2==0 { print substr($0,length(\$0)/2+1) }'


for 2nd/right read. (Code for 1st/left read can remain unchanged since awk substr returns the same result for both 0 and 1 as the start index.)

0
Entering edit mode

Thank you for this indexing fix, @rschulz, and for the general method, @Marvin!

I used this method to separate some PE reads downloaded through NCBI's sratools fastq-dump. This works just great with the fixed indexing at the end of the "right read" function.

0
Entering edit mode

much more simpler and efficient than perl or python

2
Entering edit mode
10.8 years ago

Thanks for your script David Langenberger!

I don't understand perl scripts, so I decided to make my own python script, but again thanks for your time!

import sys
from Bio import SeqIO
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna
from Bio.SeqRecord import SeqRecord

def main(fastq):

Rd1_out_name = fastq.replace(".fastq", "Rd1.fastq")
Rd2_out_name = fastq.replace(".fastq", "Rd2.fastq")
Rd1_out = open(Rd1_out_name, 'w')
Rd2_out = open(Rd2_out_name, 'w')

for record in SeqIO.parse(fastq, "fastq"):

seq = str(record.seq)
Rd1_seq = seq[:len(seq)/2]
Rd2_seq = seq[len(seq)/2:]
Q = record.letter_annotations["phred_quality"]
Rd1_Q = Q[:len(Q)/2]
Rd2_Q = Q[len(Q)/2:]
Rd1_id = record.id.strip("/1").strip("/2") + "/1"
Rd2_id = record.id.strip("/1").strip("/2") + "/2"

Rd1 = SeqRecord( Rd1_seq , id = Rd1_id, description = "" )
Rd1.letter_annotations["phred_quality"] = Rd1_Q
Rd2 = SeqRecord( Rd2_seq , id = Rd2_id, description = "" )
Rd2.letter_annotations["phred_quality"] = Rd2_Q

Rd1_out.write(Rd1.format("fastq"),)
Rd2_out.write(Rd2.format("fastq"),)

if __name__ == '__main__':
main(sys.argv[1])

0
Entering edit mode

Great idea! I think in Galaxy they ask you to cite them, when using the tool, don't they? Haha! I think you can publish everything....

0
Entering edit mode

I just try to don't use galaxy because I run my pipelines by shell scripts ...

0
Entering edit mode

I just use it sometimes to test some tools... when I'm too lazy to install it. But for high-throughput analysis it's completely useless.

1
Entering edit mode
8.3 years ago
smilefreak ▴ 420

Using List Comprehension in python

fastq_1 = open('fastq_r1.fastq')
fastq_2 = open('fastq_r2.fastq')

[r1.write(line) if (i % 8 < 4) else r2.write(line) for i, line in enumerate(open('test.fastq'))]

fastq_1.close()
fastq_2.close()

1
Entering edit mode

Elegant and simple, but you necro'd a 2.6-year-old old topic :)

Edit: might as well contribute anyway. I'd recommend wrapping using 'with' to make it even more idiomatic:

fastq_1 = open('fastq_r1.fastq')


becomes

with open('fastq_r1.fastq', 'a') as fastq_1:


etc., in case you forget to close your file handles.