Randomly Split A Fastq File
3
0
Entering edit mode
10.6 years ago
Assa Yeroslaviz ★ 1.8k

Hi,

We have one fastq file, which we would like to split into three smaller fastq files. This could be probably done with the split command ( and a multiplier of 4).

But what we would like to do is create 10 times triplicates of this one fastq file. So I would like to know if there is a way of splitting a fastq files randomly and still keeping the four lines structure of the fastq file.

Another way to do it is to just use split on the fastq file, thank shuffle the order of the reads and split again. Is there a way to re-order the reads in a fastq file randomly?

Thanks in advance for any idea.

Assa

fastq split • 5.6k views
ADD COMMENT
3
Entering edit mode
10.6 years ago
brentp 24k

Here is one solution:

ADD COMMENT
0
Entering edit mode

Thanks for the script. It seems to work, though I am getting an error after a few minutes.

AS the fastq files is zipped, this is the command I'm using:

python  SplitReads.py. fastq.gz 10 3

After a few minutes I am getting a chunk size massage

chunk_size: 3436054

But than the script stops without any errors, but only with the traceback massage:

 Traceback (most recent call last):
   File "SplitFastqFile.py", line 61, in <module>
        fqsplit(fq, nchunks, nreps)
   File "SplitFastqFile.py", line 49, in fqsplit
        for i, fqr in zip(ints, fqiter(fq)):
   File "SplitFastqFile.py", line 24, in fqiter
        with xopen(fq) as fh:

Is it a memory problem? I hope you can help

Thanks, Assa

ADD REPLY
0
Entering edit mode

I updated the script just now (to use izip in place of zip). Give another try.

ADD REPLY
1
Entering edit mode

NO it is still not working. I can run it with the unzipped files, but not with the gzipped ones. I can't understand why.

ADD REPLY
1
Entering edit mode
10.6 years ago
cts ★ 1.7k

You could select random samples of the reads using seqtk

ADD COMMENT
1
Entering edit mode

Yes, but I don't want to just extract a specific number of reads from a file. I would like to split the file into three parts, so that I don't get the same read in two different samples of one one triplicate. With seqtk I can extract a subsample, but if I do it twice there might be repetitions in the two files.

ADD REPLY
0
Entering edit mode

This answer is wrong and should be given -1.

ADD REPLY
1
Entering edit mode
10.6 years ago

Another way to do it is to just use split on the fastq file, thank shuffle the order of the reads and split again. Is there a way to re-order the reads in a fastq file randomly?

To recover random reads in constant time, you could pull the file into memory, into an array, storing byte offsets to a newline character before the start of a new read.

In the course of reading the FASTQ file into memory, you can strip newlines between reads, as you are storing offsets in an index-to-offset hash table.

Then, generally:

  1. Having counted the number of lines (4n) in the file, divide by four (n).
  2. Build a list of indices from {1..n}.
  3. Permute that list.
  4. To extract reads, iterate through the list and, for a given index i, extract four lines from the byte offset after index i to the byte offset before index i+1.

A lot of scripting languages have efficient permutation libraries (example).

ADD COMMENT

Login before adding your answer.

Traffic: 1824 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6