Question: Split Fastq Files Into Chunks Of 1M Reads
4
gravatar for Bioscientist
8.9 years ago by
Bioscientist1.7k
Bioscientist1.7k wrote:

Each of my fastq files is about 20M reads, while I need to split the big fastq files into chunks of 1M reads.

Is there any available tool that can do such jobs?

my thought is, just count the line, and print out the lines after counting every 1M lines.But how can I do that with python? thx

edit: My input fastq file is actually in .gz compressed form. I tried

split -l 4000000 XXX.recal.fastq.gz prefix

however, I just got one prefix-aa file which is exactly the same size as input. I don't know if it's because of the .gz form so that we cannot count the line?

when I tried

split -b 46m XXX.recal.fastq.gz prefix

it works well!!! The fastq.gz is successfully split into several smaller fastq.gz files.

so why cannot we use -l 4000000 command?

thx

another question:there is only a "prefix" option for split command; but is there a suffix option?(only suffix_length option) because with prefix the output is XXX.fastq.gz-ab, which destroys the format of .gz file. So I want sth. like XXX_1.fastq.gz (changing suffix), how can I do that? thx

fastq split • 21k views
ADD COMMENTlink modified 4.7 years ago by Andreas2.5k • written 8.9 years ago by Bioscientist1.7k
1

split will only work on text (not gzipped files) and you will likely get truncated records at the end of each file using -b 46m. you can use:

zless recal.fastq.gz | split -l 4000000 prefix

to get a bunch of uncompressed files.

ADD REPLYlink written 8.9 years ago by brentp23k

Yeah, works well, until it doesn't;) As I wrote, for sure you'll get your file split always, but the result is not necessarily a valid fastq files, but good it works for you, just check if every file that is generated starts with a '@'. All depends on that the sequence lines are not wrapped, not containing newlines, which they could by definition of the format.

ADD REPLYlink written 8.9 years ago by Michael Dondrup47k

hi brentp, I tried zless...|split -l 4000000 prefix. the error is cannot open `prefix' for reading: No such file or directory. Seems prefix is regarded as input here.

ADD REPLYlink written 8.9 years ago by Bioscientist1.7k

my bad you need '-' to tell it stdin: zless...|split -l 4000000 - prefix

ADD REPLYlink written 8.9 years ago by brentp23k

sorry, it doesn't work...error is:"split: invalid option -- E"

ADD REPLYlink written 8.9 years ago by Bioscientist1.7k

my advice is to paste the entire command you're using into your question; then let @Jeremy update his answer since that's the clearest solution.

ADD REPLYlink written 8.9 years ago by brentp23k
8
gravatar for Marvin
8.9 years ago by
Marvin850
Marvin850 wrote:

A gzipped file cannot be "split by lines" without decompressing it. Split doesn't decompress, but it can read from a pipe:

zcat XXX.recal.fastq.gz | split -l 4000000 - prefix

Naturally, the output will be uncompressed.

ADD COMMENTlink written 8.9 years ago by Marvin850
6
gravatar for Jeremy Leipzig
8.9 years ago by
Philadelphia, PA
Jeremy Leipzig19k wrote:

on the command line:

split -l 4000000 myFastq.fq

behold xaa, xab, xac etc...

ADD COMMENTlink modified 8.9 years ago • written 8.9 years ago by Jeremy Leipzig19k
1

my split (coreutils 8.5) uses -l/--lines for number of lines per file.

ADD REPLYlink written 8.9 years ago by brentp23k
1

Unfortunately, that will not work for fastq files with entries containing newlines, so it's definitely not safe. The fastq format definition allows that, I had the same misconeption and was happy to be corrected here: http://biostar.stackexchange.com/questions/3405/processing-fastq-files-in-r-bioconductor-without-reading-entire-files-into-memory/4083#4083

ADD REPLYlink written 8.9 years ago by Michael Dondrup47k
1

Note that this is definitely wrong, see here: http://biostar.stackexchange.com/questions/11957/how-common-are-multi-line-fastq-files

ADD REPLYlink written 8.7 years ago by Michael Dondrup47k

This is a good solution, because a fastq file is defined to be 4 lines long, and using a command line program to do it is fast and reliable.

ADD REPLYlink written 8.9 years ago by Lee Katz3.0k

brentp, mine does too, not sure where -m came from

ADD REPLYlink written 8.9 years ago by Jeremy Leipzig19k

thx guys, i got the file xaa, which seems to be a binary file. then how can I use this file for further analysis like alignment?

ADD REPLYlink written 8.9 years ago by Bioscientist1.7k

also, I need to split hundreds of fastq files so need to distinguish the output by assigning different names. (while seems split of different fastq files here all result in the same xaa file)

ADD REPLYlink written 8.9 years ago by Bioscientist1.7k

you can specify the prefix (type "man split" for the manual)> presumably you would just use the name of the fastq file as the prefix

ADD REPLYlink written 8.9 years ago by Jeremy Leipzig19k

It shouldn't be a binary file if your input was a fastq file. Did you check the output from split yet?

ADD REPLYlink written 8.9 years ago by Mitch Bekritsky1.2k

hi guys, plz look at my edit, thx

ADD REPLYlink written 8.9 years ago by Bioscientist1.7k
2
gravatar for Ian
8.9 years ago by
Ian5.6k
University of Manchester, UK
Ian5.6k wrote:

If you want an already written script:

SHRIMP has a script called 'splitreads.py'. [part of installation]

PerM has a script called 'splitReads.sh'. [available here]

Take your pick :-)

ADD COMMENTlink written 8.9 years ago by Ian5.6k
1
gravatar for Andreas
4.7 years ago by
Andreas2.5k
Singapore
Andreas2.5k wrote:

For the sake of completeness: the latest version of famas (commit 04ed15e) can do this as well. The split is based on number of reads though and not file size.

 

Andreas

ADD COMMENTlink written 4.7 years ago by Andreas2.5k
0
gravatar for Michael Dondrup
8.9 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

I don't know python, but some python pros told me that it can do string handling and regular expressions (almost ;) ) as powerful as Perl, so here is what I would do in Perl, guess it would be easy to translate if you must:

If you have enough memory to read the whole file into one string, so something like:

my @records = split /\n@/, $filecontent;

That way getting an array of fastq records. Or, if that doesn't fit, change the input record separator (in perl that's $/, default n) to "n@", then read the file record by record:

local $/ = "\n@";
while (<FASTQ>) {
  # do stuff
}

Or, in case you are 100% sure, there are no wrapped lines, use Jeremy's method.

Edit: Even this is not totally safe, the '@' can appear in the quality string, and then in theory a 'n@' is possible in the quality string. Then you have to keep track of the number of opening brackets 'n+' and closing brackets 'n@' making the coding just a little more cumbersome.

So, please, do not use this ill defined FASTQ format! ... What, there is nothing else? You must be kidding me!

ADD COMMENTlink modified 8.9 years ago • written 8.9 years ago by Michael Dondrup47k

Has anyone ever seen a FASTQ file where there was an errant newline? (from a file created by software used by more than 2 people)

ADD REPLYlink written 8.9 years ago by brentp23k

No, I haven't, but when I proposed split -l, people objected vigorously, so I assumed, this was to be taken serious. Btw.: "The original Sanger FASTQ files also allowed the sequence and quality strings to be wrapped (split over multiple lines), but this is generally discouraged as it can make parsing complicated due to the unfortunate choice of "@" and "+" as markers (these characters can also occur in the quality string)." (Wikipedia). So, use split -l 4X, but please, please, take very good care...

ADD REPLYlink written 8.9 years ago by Michael Dondrup47k

hi michael, I checked some files, and it does start with @. But you are right, definitely should be cautious with using split this way.

ADD REPLYlink written 8.9 years ago by Bioscientist1.7k

I should have ignored the comments I got back then , from now on I will again assume that every record has 4 lines and that split is safe :P

ADD REPLYlink written 8.9 years ago by Michael Dondrup47k

Oh yes, that's actually the best answer ever: DON'T USE FASTQ! You can always replace FastQ by BAM, and you get a reasonably well-defined, compressed format with good supporting tools.

ADD REPLYlink written 8.9 years ago by Marvin850
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1722 users visited in the last hour