Split Super Large Files
3
3
Entering edit mode
10.9 years ago
Bioscientist ★ 1.7k

I think we always could come arcoss large files, say, fastq files. Now I have really super large fastq files, around 10GB. I need to first split it into some smaller files. I've written some python script to do this, with the algorithm of taking the whole file into memory (similar to readlines(), or zcat file), thus leading to insufficiency of memory (60GB memory on cluster node, but still not enough)

Just wondering, is there any algorithm which doesn't take the whole file, but read line by line? Anyone wanna share any script for splitting?

BTW, I'm not doing BWA paired-end mapping; but curious is it possible to run BWA with an input file with size around 10GB? thx

Actually I'm using python to split. One of the key command here is:

input = commands.getoutput('zcat ' + fastqfile).splitlines(True)


Seems a bit faster than readlines(); but basically the idea is still to create "list" or in perl called "array". Then I can manipulate specific line of the list, say list[1000] (the 1000th line)

split • 6.8k views
0
Entering edit mode

I've successfully run BWA with files over 20 GB (compressed) in size.

0
Entering edit mode

Also, what are you trying to do? 60GB is more than enough for most alignment needs unless you have a really large genome, whereas it may fall short for assembly and splitting your reads would not help much here.

0
Entering edit mode

I've successfully run BWA on compressed fastq files over 20GB in size (around 40 GB uncompressed) on a machine with 18GB of RAM.

0
Entering edit mode

python has a gzip module--no need to the zcat call.

10
Entering edit mode
10.9 years ago
brentp 24k

I agree with Wen.Huang that it's not that large, but it will help parallelization if you split.

You definitely do not need to read the entire file into memory. There's a unix command, split that does what you want. Here's an example.

# make a fake example file.
for i in seq 10000; do echo $i >> t.fake; done # should be a multipe of 4 for fastq LINES_PER_FILE=1000 split -l$LINES_PER_FILE t.fake output_prefix
ls output_prefix*


Will create:

output_prefixaa  output_prefixac  output_prefixae  output_prefixag  output_prefixai
output_prefixab  output_prefixad  output_prefixaf  output_prefixah  output_prefixaj

6
Entering edit mode

There is also csplit, which splits files based on line context.

0
Entering edit mode

A small perl code as an example:

## Split the file into 10MB pieces for sorting
system("split -b 10k $file_path$output_file $output_dir$split_file_name");

## Split the merid into 10000 lines  pieces for sorting
system("split -l 100000 $file_path$output_file $output_dir$split_file_name");


hope this helps.

0
Entering edit mode

In perl you can do it in 2 ways:

## Split the file into 10MB pieces for sorting
system("split -b 10k $file_path$output_file $output_dir$split_file_name");

## Split the merid into 10000 lines  pieces for sorting
system("split -l 100000 $file_path$output_file $output_dir$split_file_name");


hope this helps.

0
Entering edit mode

hmm, didn't know about csplit. thanks.

0
Entering edit mode

thanks. BUt I'm kind of compressed files....for which I cannot use "split" commands.

0
Entering edit mode

@bioscientist

gunzip -c input.fastq.gz | split -l 2000 - output_prefix

0
Entering edit mode

great thanks! works super!

0
Entering edit mode
2
Entering edit mode
10.9 years ago
Wen.Huang ★ 1.2k

1) 10G is not super large... 2) almost all programing language can read line by line, for example, you could simply create a few filehandles with perl and read one line at a time and iteratively write through them 3) yes, bwa can handle 10GB

1
Entering edit mode
10.9 years ago
Gjain 5.7k

A small perl example of what Brentp mentioned:

## Split the file into 10MB pieces for sorting
system("split -b 10k $file_path$output_file $output_dir$split_file_name");

## Split the merid into 10000 lines  pieces for sorting
system("split -l 100000 $file_path$output_file $output_dir$split_file_name");


hope this helps.

3
Entering edit mode

this is not really a Perl example, it's a shell example being called by perl ...

0
Entering edit mode

Thankyou for clearing it out. I know its a shell example implemented in perl. Will mention it clearly in future.