Hi, I am trying to split a fastq stream using a python script into files with defined number of reads and immediately start a mapping job on newly created files. All this is happening on an SGE cluster, that's why I want to map reads in small chunks in parallel. The problem I encounter is that starting from one of the files I generate always starts somewhere inside a read, not the start. It can be in different locations of the read, and cam happen with different file number. Because of that mapping fails, obviously. This is my script.
import argparse import os from os import path from subprocess import call import sys parser = argparse.ArgumentParser() parser.add_argument("outfolder", help="folder to store new files") parser.add_argument("prefix", help="prefix for the new filename") parser.add_argument("N_reads", help="number of reads per file", default=1000000, type=int) args = parser.parse_args() def read_write(): i = 0 n = 0 filename = path.join(args.outfolder, args.prefix + '_' + str(n) + '.fastq') f = open(filename, 'w') print('Writing to '+ filename) try: for line in sys.stdin: f.write(line) i += 1 if i == args.N_reads*4: f.close() print('Mapping', filename) call('ssh headnode1.ecdf.ed.ac.uk "cd /exports/igmm/datastore/wendy-lab/ilia/scripts; qsub launcher.sh ' + filename + '"', shell=True) n += 1 filename = path.join(args.outfolder, args.prefix + '_' + str(n) + '.fastq') i = 0 f = open(filename, 'w') print('Writing to '+ filename) except EOFError: f.close() if i == 0: os.remove(filename) print('Removed empty file') print('Done') read_write()
The only assumption I make is that all reads are written in 4 lines with no blank lines or any weirdness. The files were generated by Illumina software, then I concatenated gzipped files and now I decompress them with pigz to stdout, then my script reads it from stdin. So I think this assumption should be met. Is there anything else I am missing here?..