Question: Splitting fastq from stdin into multiple files ?
0
gravatar for flyamer
4.5 years ago by
flyamer40
Russian Federation
flyamer40 wrote:

Hi, I am trying to split a fastq stream using a python script into files with defined number of reads and immediately start a mapping job on newly created files. All this is happening on an SGE cluster, that's why I want to map reads in small chunks in parallel. The problem I encounter is that starting from one of the files I generate always starts somewhere inside a read, not the start. It can be in different locations of the read, and cam happen with different file number. Because of that mapping fails, obviously. This is my script.

import argparse
import os
from os import path
from subprocess import call
import sys

parser = argparse.ArgumentParser()
parser.add_argument("outfolder", help="folder to store new files")
parser.add_argument("prefix", help="prefix for the new filename")
parser.add_argument("N_reads", help="number of reads per file",
                    default=1000000, type=int)

args = parser.parse_args()

def read_write():
    i = 0
    n = 0
    filename = path.join(args.outfolder, args.prefix + '_' + str(n) + '.fastq')
    f = open(filename, 'w')
    print('Writing to '+ filename)
    try:
        for line in sys.stdin:
            f.write(line)
            i += 1
            if i == args.N_reads*4:
                f.close()
                print('Mapping', filename)
                call('ssh headnode1.ecdf.ed.ac.uk "cd /exports/igmm/datastore/wendy-lab/ilia/scripts; qsub launcher.sh ' + filename + '"', shell=True)
                n += 1
                filename = path.join(args.outfolder, args.prefix + '_' + str(n) + '.fastq')
                i = 0
                f = open(filename, 'w')
                print('Writing to '+ filename)
    except EOFError:
        f.close()
    if i == 0:
        os.remove(filename)
        print('Removed empty file')
    print('Done')

read_write()

The only assumption I make is that all reads are written in 4 lines with no blank lines or any weirdness. The files were generated by Illumina software, then I concatenated gzipped files and now I decompress them with pigz to stdout, then my script reads it from stdin. So I think this assumption should be met. Is there anything else I am missing here?..

fastq python • 1.9k views
ADD COMMENTlink modified 4.5 years ago by Jorge Amigo12k • written 4.5 years ago by flyamer40

use split : http://linux.die.net/man/1/split

ADD REPLYlink written 4.5 years ago by Pierre Lindenbaum131k
1

Thanks, I know about this, but I want to start mapping while I'm still splitting the file. Maybe it's too much to ask of course...

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by flyamer40
1

Do you think the time lost splitting then mapping is going to be smaller or greater than the time spent figuring out how to split-and-map on the fly? :)

Also, plan for what happens when a mapper/node goes down in the middle of processing - if you map and split on the fly, you'll have to start from the beginning

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by John12k

My guess is it will be smaller, and you are probably right that I should just forget this. But except for the benefit of splitting on-the-fly, I am really surprised this is not working as it should and would like to understand why...

ADD REPLYlink written 4.5 years ago by flyamer40

How about splitting the file with xargs?

ADD REPLYlink written 4.5 years ago by 5heikki9.0k

Can you please provide an example? I am not very familiar with bash unfortunately.

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by flyamer40
1
$cat file
1a
2a
3a
4a
1b
2b
3b
4b
$xargs -n4 < file echo "4 args:"
4 args: 1a 2a 3a 4a
4 args: 1b 2b 3b 4b
ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by 5heikki9.0k

Thanks! That's interesting!

ADD REPLYlink written 4.5 years ago by flyamer40

Don't know. My guess it is a network filesystem problem. However, these days, the better way to map reads is to process them in one go with multiple threads. As to reading compressed files, many mappers natively deal with that or can read data from a stream.

ADD REPLYlink written 4.5 years ago by lh332k
1
gravatar for Jorge Amigo
4.5 years ago by
Jorge Amigo12k
Santiago de Compostela, Spain
Jorge Amigo12k wrote:

a very fast alternative is Heng Li's seqtk:

seqtk sample -s100 read1.fq 10000 > sub1.fq
ADD COMMENTlink written 4.5 years ago by Jorge Amigo12k

Thanks for the idea! But can it split the file and not only give N reads?

ADD REPLYlink written 4.5 years ago by flyamer40

I don't think this is natively implemented in seqtk, but once you read the fastq file through seqtk you can be completely sure about its output format, and you could split the file with bash by groups of 4 lines:

seqtk seq file.fastq.gz | split -l 4000

this would split your original fastq file into smaller 1000 reads fastq files.

ADD REPLYlink written 4.5 years ago by Jorge Amigo12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 754 users visited in the last hour