How can I split SampleName_R1.fastq in such a way that mate of read_X gets into SampleName_R2-part1.fastq if read_X makes its way to SampleName_R1-part1.fastq while splitting. Making it clearer
Normally, if readX is the N *th* record in SampleNameR1.fastq then readXmate is the N th record in SampleName_R2.fastq.
So you can simply split both files taking the same number of lines (multiple of 4) in each fastq file (I am supposing you have 4-lines-style fastq files).
For basic splitting in this fashion, have a look at the Linux command split .
For instance to split your files in 2 : if N is the number of lines of your fastq files (should be the same for both files). You have N/4 fastq records. Take K=(E[N/8] + 1)*4 lines for first part, the rest in second part.
split -l K SampleName_R1.fastq SampleName_R1_part
split -l K SampleName_R2.fastq SampleName_R2_part
More generally, to split in even more fastq files, just give to -l option a multiple of 4, representing the maximum number of lines you want in each file.
Yes, I have 4-line-style fastq files. I have 95622055 lines in SampleName_R1.fastq and 95269156 lines in SampleName_R2.fastq. Wouldn't this create problem while splitting ?
Yes it will create problem. You need to have the same number of lines in each fastq file, with the matched fastq records written. Where do your fastq files come from ? Is there any preprocessing made on these already ?
They are from Illumina HiSeq machine. I have not made any preprocessing. I am searching fusion genes from these RNA-Seq data. Since compressed single fastq file is about 28GB, I need to split it because I had memory issues using large files with Tophat Fusion. Do you have any idea, how can I remove reads that do not have mates ?
As far as I know, I do not know about an existing tool that would remove orphan reads. So I would do it myself by parsing both files at the same time, and I would remove a record as soon as I do not find any mate. Perhaps you may have more chances to get on answer on this point by opening a new thread (?, something might exist to do this)
I generated only two FASTQ files after processing raw data from sequencer. This eliminate the headache of splitting single FASTQ file, but thanks a lot for your help.
Hi, Samsara,
I have the same question as yours. Do you already have an effective way to do this?
Best,
Xiao