Although I came up with a solution using sed and awk, It takes an extremely large amount of time to finish since each of these files are raw reads from an RNA-seq experiment and are in the range of 3.6 GB each.
Are there any better ways to merge fastQ files like this?
Can take an arbitrary number of fastq files which should not have the same length. Requires biopython. Will pick one read from each file until that file is emptied.
from Bio import SeqIO
fqs = [SeqIO.parse(f, "fastq") for f in sys.argv[1:]]
for fq in fqs:
if len(fqs) == 0:
Why is the reads' order so important? If the read ID is ordered per file, you can try:
If the read ID contains the barcode, you may need to fiddle around.
It is important since I am going to run Kallisto on the merged file and Kallisto estimates the fragment length distribution but it uses only a certain number of reads from the top to do that. So if I use cat then the reads in the other files are not being used to estimate the fragment length. So I need equal representation from all the files are each position in my merged file.
Than replace the sort command by
shuf. You wouldn't get your requested order but a random one. (Which is AFAIK, more suited for Kallisto or Salmon)  you can also have a look here
Unless there's a quite significant difference between the file then it won't matter whether the fragment length distribution is estimated from a single file or all of them. I mean, presumably these are all of the same library, or else merging them at this level would be problematic for the statistics performed on the quantification.
Check this older thread out
Be sure to check the replies within it. Also keep in mind you can merge the data after alignment as bam files with SAMtools.
The previous thread talks mostly about using cat and glob patterns. There is also a mention for a particular tool but it does not seem to merge in his specific way. Regarding alignment, yeah I can merge it later but I need to merge the fastQ files before alignment :)
Unless I am missing something this looks like straight concatenation of the files. Why do you need sed/awk for this?
Using cat on the files results a file with all the reads from file1 followed by the reads from file2 and so on and so forth.
However I want the first seven reads from in my merged file to be the first reads from all the seven files, the second seven reads to be the second reads from the seven files. The problem is that each entry in the fastQ file has four lines. Thats why I had to use awk/sed to convert the entries into one line using a delimiter. Then I used the paste command to get the exact merge and then substituted the delimiters with a newline character. But unfortunately this takes forever.
Looking at the example you posted above this was not apparent. See
You should edit the post above and add this important text there. You could even remove the example altogether.
Those are the last four lines of the example after which I put a couple of dots to indicate that the process continues. It is important to note that one read consists of four lines in a file. The first lines of the example explain the merging process quite clearly.
Be easier to do it this way. Reads instead of lines. Fastq record = 4 lines, a standard.