Question: Split a FastQ file into smaller files
0
gravatar for Matthias Zepper
3 months ago by
M√ľnster, Germany
Matthias Zepper70 wrote:

Hello folks,

I happen to have a small problem which seemed to be trivial at first, but keeps me busy for a while now already. Maybe you can help anew, because most Biostars posts on this issue are quite outdated:

Problem:

I need to split 615M paired reads currently in two FastQ files (forward and reversed reads separate) into two file pairs with 308M reads.

Solution attempt A:

I unsuccessfully tried to use line count based tools like split or awk, but the wc -l does not even count them correctly (line counts differ), although they are processed nicely by several tools. By googling, I found that a possible reason for this behavior might be that newline characters may occur in the quality scores, which thus hinder the core utilities.

Solution attempt B:

I tried to use my BBTools, my favorite Swiss army knife of everything sequence related, but...

bbmap/reformat.sh  in=...  in2=... out=... out2=... reads=308000000
bbmap/reformat.sh  in=...  in2=... out=... out2=... skipreads=308000000

resulted in

Input is being processed as paired
Input:                          615307122 reads                 91326404105 bases
Output:                         615307122 reads (100.00%)       91326404105 bases (100.00%)

respectively

Input is being processed as paired
Input:                          615307122 reads                 91326404105 bases
Output:                         0 reads (0.00%)         0 bases (0.00%)

for the second command, such that I got a copy of my data in the first case and an empty file in the second.

Solution attempt C:

Here on Biostars, somebody had suggested famas for a similar problem, so I installed and tried it:

famas --in=...  --in2=... --out=...XXXXXX.fq.gz  --out2=...XXXXXX.fq.gz  -x 308000000

However, the program flooded to output directory with thousands of subfiles files (instead of the actually needed two files each) until the file system couldn't cope with the number of open files anymore and ran out of file descriptors.

ERROR(famas.c|open_output_one:1056): Couldn't open =...compressed.065534.fq.gz
ERROR(famas.c|main:1163): Couldn't open output files. Exiting...

In case somebody is wondering why I mixed the long and short notation here: the parameter --split-every=308000000 was not recognized (unknown parameter).

Since it took me a while to clean that mess up again, I am somewhat reluctant to try out more. Does sombeody have any ideas where I screwed up the commands or has suggestions which tools work better?

Thanks a lot for reading, pondering and help!

Matthias

awk bbmap fastq famas • 269 views
ADD COMMENTlink modified 3 months ago by genomax91k • written 3 months ago by Matthias Zepper70
2

For reformat.sh the deinterleave (split) syntax is:

reformat.sh in=reads.fq out1=read1.fq out2=read2.fq

While the interleave (combine) syntax is:

reformat.sh in1=read1.fq in2=read2.fq out=reads.fq

Note the difference between (in vs in1/in2).

I think you confused and therefore combined the syntax of interleave/deinterleave.

Also there's the option of seqkit (which I absolutely love). It has seqkit grep and seqkit split2 which I believe both support fastq files. It allows for seemless pipelining between standard bash functions and seqkit specific functions.

edit: I think I miss read the question! oops

ADD REPLYlink modified 3 months ago • written 3 months ago by Mark800

Thank you very much for your response! Indeed, seqkit seems to be the solution I was looking for. Currently it is still running, but the output looks quite promising. The bbmap thing seems to be a glitch since also counting reads while deinterleaving a temporarily interleaved file didn't work.

ADD REPLYlink written 3 months ago by Matthias Zepper70

What's wrong with a really stupid zcat | head -n 1232000000

ADD REPLYlink written 3 months ago by swbarnes28.9k

The request was to combine and split by F/R reads.

ADD REPLYlink written 3 months ago by Mark800
2
gravatar for cpad0112
3 months ago by
cpad011214k
India
cpad011214k wrote:

with seqkit from the manual:

$ seqkit split2 -1 reads_1.fq.gz -2 reads_2.fq.gz -p 2 -O out

out is output directory, reads_1 is R1 and reads_2 is R2. This function is to break the file into two parts. try -s option for sequence number based split.

ADD COMMENTlink modified 3 months ago • written 3 months ago by cpad011214k

Thank you very much. This works like a charm!

ADD REPLYlink written 3 months ago by Matthias Zepper70
2
gravatar for Mensur Dlakic
3 months ago by
Mensur Dlakic7.0k
USA
Mensur Dlakic7.0k wrote:

First combine your reads:

bbmap/reformat.sh in=reads1.fastq in2=reads2.fastq out=paired.fastq

Then split them:

bbmap/reformat.sh in=paired.fastq out=paired_part1.fastq reads=308000000
bbmap/reformat.sh in=paired.fastq out=paired_part2.fastq skipreads=308000000
ADD COMMENTlink modified 3 months ago • written 3 months ago by Mensur Dlakic7.0k

I tried your solution to generate an interleaved file temporarily and deinterleave it afterwards again with the read numbers. However, while the interleaving worked nicely, the deinterleaving again produced a file containing all reads and an empty file as before. So I could convert the formats but not split it into smaller files. Nonetheless, thank you very much for your help!

ADD REPLYlink written 3 months ago by Matthias Zepper70
0
gravatar for genomax
3 months ago by
genomax91k
United States
genomax91k wrote:

I am not sure why this did not work for you. I started with a 1000 PE reads. I am using BBMap v. 38.82.

$ reformat.sh in1=test1000.R1.fq.gz in2=test1000.R2.fq.gz out1=rest1000.R1.fq.gz out2=rest1000.R2.fq.gz reads=151

Set INTERLEAVED to false
Input is being processed as paired
Input:                          302 reads               75500 bases
Output:                         302 reads (100.00%)     75500 bases (100.00%)

$ reformat.sh in1=test1000.R1.fq.gz in2=test1000.R2.fq.gz out1=rest_2.R1.fq.gz out2=rest_2.R2.fq.gz skipreads=151 

Set INTERLEAVED to false
Input is being processed as paired
Input:                          2000 reads              500000 bases
Output:                         1698 reads (84.90%)     424500 bases (84.90%)
ADD COMMENTlink modified 3 months ago • written 3 months ago by genomax91k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1882 users visited in the last hour