Splitting fastq from stdin into multiple files ?
1
0
Entering edit mode
8.0 years ago
flyamer ▴ 60

Hi, I am trying to split a fastq stream using a python script into files with defined number of reads and immediately start a mapping job on newly created files. All this is happening on an SGE cluster, that's why I want to map reads in small chunks in parallel. The problem I encounter is that starting from one of the files I generate always starts somewhere inside a read, not the start. It can be in different locations of the read, and cam happen with different file number. Because of that mapping fails, obviously. This is my script.

import argparse
import os
from os import path
from subprocess import call
import sys

parser = argparse.ArgumentParser()
parser.add_argument("outfolder", help="folder to store new files")
parser.add_argument("prefix", help="prefix for the new filename")
parser.add_argument("N_reads", help="number of reads per file",
                    default=1000000, type=int)

args = parser.parse_args()

def read_write():
    i = 0
    n = 0
    filename = path.join(args.outfolder, args.prefix + '_' + str(n) + '.fastq')
    f = open(filename, 'w')
    print('Writing to '+ filename)
    try:
        for line in sys.stdin:
            f.write(line)
            i += 1
            if i == args.N_reads*4:
                f.close()
                print('Mapping', filename)
                call('ssh headnode1.ecdf.ed.ac.uk "cd /exports/igmm/datastore/wendy-lab/ilia/scripts; qsub launcher.sh ' + filename + '"', shell=True)
                n += 1
                filename = path.join(args.outfolder, args.prefix + '_' + str(n) + '.fastq')
                i = 0
                f = open(filename, 'w')
                print('Writing to '+ filename)
    except EOFError:
        f.close()
    if i == 0:
        os.remove(filename)
        print('Removed empty file')
    print('Done')

read_write()

The only assumption I make is that all reads are written in 4 lines with no blank lines or any weirdness. The files were generated by Illumina software, then I concatenated gzipped files and now I decompress them with pigz to stdout, then my script reads it from stdin. So I think this assumption should be met. Is there anything else I am missing here?..

fastq python • 3.9k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
1
Entering edit mode

Thanks, I know about this, but I want to start mapping while I'm still splitting the file. Maybe it's too much to ask of course...

ADD REPLY
1
Entering edit mode

Do you think the time lost splitting then mapping is going to be smaller or greater than the time spent figuring out how to split-and-map on the fly? :)

Also, plan for what happens when a mapper/node goes down in the middle of processing - if you map and split on the fly, you'll have to start from the beginning

ADD REPLY
0
Entering edit mode

My guess is it will be smaller, and you are probably right that I should just forget this. But except for the benefit of splitting on-the-fly, I am really surprised this is not working as it should and would like to understand why...

ADD REPLY
0
Entering edit mode

How about splitting the file with xargs?

ADD REPLY
0
Entering edit mode

Can you please provide an example? I am not very familiar with bash unfortunately.

ADD REPLY
1
Entering edit mode
$cat file
1a
2a
3a
4a
1b
2b
3b
4b
$xargs -n4 < file echo "4 args:"
4 args: 1a 2a 3a 4a
4 args: 1b 2b 3b 4b
ADD REPLY
0
Entering edit mode

Thanks! That's interesting!

ADD REPLY
0
Entering edit mode

Don't know. My guess it is a network filesystem problem. However, these days, the better way to map reads is to process them in one go with multiple threads. As to reading compressed files, many mappers natively deal with that or can read data from a stream.

ADD REPLY
1
Entering edit mode
8.0 years ago

a very fast alternative is Heng Li's seqtk:

seqtk sample -s100 read1.fq 10000 > sub1.fq
ADD COMMENT
0
Entering edit mode

Thanks for the idea! But can it split the file and not only give N reads?

ADD REPLY
0
Entering edit mode

I don't think this is natively implemented in seqtk, but once you read the fastq file through seqtk you can be completely sure about its output format, and you could split the file with bash by groups of 4 lines:

seqtk seq file.fastq.gz | split -l 4000

this would split your original fastq file into smaller 1000 reads fastq files.

ADD REPLY

Login before adding your answer.

Traffic: 3741 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6