Question: Randomly Split A Fastq File
0
gravatar for Assa Yeroslaviz
5.5 years ago by
Assa Yeroslaviz1.2k
Munich
Assa Yeroslaviz1.2k wrote:

Hi,

We have one fastq file, which we would like to split into three smaller fastq files. This could be probably done with the split command ( and a multiplier of 4).

But what we would like to do is create 10 times triplicates of this one fastq file. So I would like to know if there is a way of splitting a fastq files randomly and still keeping the four lines structure of the fastq file.

Another way to do it is to just use split on the fastq file, thank shuffle the order of the reads and split again. Is there a way to re-order the reads in a fastq file randomly?

Thanks in advance for any idea.

Assa

fastq split • 3.2k views
ADD COMMENTlink modified 5.5 years ago by Alex Reynolds27k • written 5.5 years ago by Assa Yeroslaviz1.2k
3
gravatar for brentp
5.5 years ago by
brentp22k
Salt Lake City, UT
brentp22k wrote:

Here is one solution:

<script src="&lt;a href=" 6625544"="">6625544"></script>

ADD COMMENTlink written 5.5 years ago by brentp22k

Thanks for the script. It seems to work, though I am getting an error after a few minutes.

AS the fastq files is zipped, this is the command I'm using:

python  SplitReads.py. fastq.gz 10 3

After a few minutes I am getting a chunk size massage

chunk_size: 3436054

But than the script stops without any errors, but only with the traceback massage:

 Traceback (most recent call last):
   File "SplitFastqFile.py", line 61, in <module>
        fqsplit(fq, nchunks, nreps)
   File "SplitFastqFile.py", line 49, in fqsplit
        for i, fqr in zip(ints, fqiter(fq)):
   File "SplitFastqFile.py", line 24, in fqiter
        with xopen(fq) as fh:

Is it a memory problem? I hope you can help

Thanks, Assa

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by Assa Yeroslaviz1.2k

I updated the script just now (to use izip in place of zip). Give another try.

ADD REPLYlink written 5.4 years ago by brentp22k
1

NO it is still not working. I can run it with the unzipped files, but not with the gzipped ones. I can't understand why.

ADD REPLYlink written 5.4 years ago by Assa Yeroslaviz1.2k
1
gravatar for cts
5.5 years ago by
cts1.6k
Pasadena
cts1.6k wrote:

You could select random samples of the reads using seqtk

ADD COMMENTlink written 5.5 years ago by cts1.6k
1

Yes, but I don't want to just extract a specific number of reads from a file. I would like to split the file into three parts, so that I don't get the same read in two different samples of one one triplicate. With seqtk I can extract a subsample, but if I do it twice there might be repetitions in the two files.

ADD REPLYlink written 5.5 years ago by Assa Yeroslaviz1.2k

This answer is wrong and should be given -1.

ADD REPLYlink written 18 months ago by SmallChess480
1
gravatar for Alex Reynolds
5.5 years ago by
Alex Reynolds27k
Seattle, WA USA
Alex Reynolds27k wrote:

Another way to do it is to just use split on the fastq file, thank shuffle the order of the reads and split again. Is there a way to re-order the reads in a fastq file randomly?

To recover random reads in constant time, you could pull the file into memory, into an array, storing byte offsets to a newline character before the start of a new read.

In the course of reading the FASTQ file into memory, you can strip newlines between reads, as you are storing offsets in an index-to-offset hash table.

Then, generally:

  1. Having counted the number of lines (4n) in the file, divide by four (n).
  2. Build a list of indices from {1..n}.
  3. Permute that list.
  4. To extract reads, iterate through the list and, for a given index i, extract four lines from the byte offset after index i to the byte offset before index i+1.

A lot of scripting languages have efficient permutation libraries (example).

ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by Alex Reynolds27k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1866 users visited in the last hour