Question: Merge FastQ Files
0
gravatar for shounak.chakraborty1990
11 months ago by

I have 7 FastQ files and I want to merge them into one in the following way:

>File1 line1

>File1 line2

>File1 line3

>File1 line4

>File2 line1

>File2 line2

>File2 line3

>File2 line4

>File3 line1

>File3 line2

>File3 line3

>File3 line4

>.

>.

>.

>File7 line1

>File7 line2

>File7 line3

>File7 line4

>File1 line5

>File1 line6
>.

>.

Although I came up with a solution using sed and awk, It takes an extremely large amount of time to finish since each of these files are raw reads from an RNA-seq experiment and are in the range of 3.6 GB each.

Are there any better ways to merge fastQ files like this?

Thanks Shounak

rna-seq fastq • 664 views
ADD COMMENTlink modified 24 days ago by Biostar ♦♦ 20 • written 11 months ago by shounak.chakraborty19900
1

Why is the reads' order so important? If the read ID is ordered per file, you can try:

cat *fastq | paste - - - - | sort -k1 | sed 's/\t/\n/g'

If the read ID contains the barcode, you may need to fiddle around.

ADD REPLYlink written 11 months ago by michael.ante2.8k

It is important since I am going to run Kallisto on the merged file and Kallisto estimates the fragment length distribution but it uses only a certain number of reads from the top to do that. So if I use cat then the reads in the other files are not being used to estimate the fragment length. So I need equal representation from all the files are each position in my merged file.

ADD REPLYlink written 11 months ago by shounak.chakraborty19900

Than replace the sort command by shuf. You wouldn't get your requested order but a random one. (Which is AFAIK, more suited for Kallisto or Salmon) [edit] you can also have a look here

ADD REPLYlink modified 11 months ago • written 11 months ago by michael.ante2.8k

Unless there's a quite significant difference between the file then it won't matter whether the fragment length distribution is estimated from a single file or all of them. I mean, presumably these are all of the same library, or else merging them at this level would be problematic for the statistics performed on the quantification.

ADD REPLYlink written 11 months ago by Devon Ryan86k

Check this older thread out

Be sure to check the replies within it. Also keep in mind you can merge the data after alignment as bam files with SAMtools.

ADD REPLYlink written 11 months ago by lshepard130

The previous thread talks mostly about using cat and glob patterns. There is also a mention for a particular tool but it does not seem to merge in his specific way. Regarding alignment, yeah I can merge it later but I need to merge the fastQ files before alignment :)

ADD REPLYlink written 11 months ago by shounak.chakraborty19900

Unless I am missing something this looks like straight concatenation of the files. Why do you need sed/awk for this?

ADD REPLYlink written 11 months ago by genomax59k

Using cat on the files results a file with all the reads from file1 followed by the reads from file2 and so on and so forth.

However I want the first seven reads from in my merged file to be the first reads from all the seven files, the second seven reads to be the second reads from the seven files. The problem is that each entry in the fastQ file has four lines. Thats why I had to use awk/sed to convert the entries into one line using a delimiter. Then I used the paste command to get the exact merge and then substituted the delimiters with a newline character. But unfortunately this takes forever.

ADD REPLYlink written 11 months ago by shounak.chakraborty19900

Looking at the example you posted above this was not apparent. See

>File7 line3

>File7 line4

>File1 line1

>File1 line2

You should edit the post above and add this important text there. You could even remove the example altogether.

However I want the first seven reads from in my merged file to be the first reads from all the seven files, the second seven reads to be the second reads from the seven files.

ADD REPLYlink modified 11 months ago • written 11 months ago by genomax59k

Those are the last four lines of the example after which I put a couple of dots to indicate that the process continues. It is important to note that one read consists of four lines in a file. The first lines of the example explain the merging process quite clearly.

ADD REPLYlink written 11 months ago by shounak.chakraborty19900
2

Be easier to do it this way. Reads instead of lines. Fastq record = 4 lines, a standard.

>File7 read3

>File7 read4

>File1 read5

>File1 read6

>File1 read7

>File1 read8

>File2 read5

>File2 read6..
ADD REPLYlink modified 11 months ago • written 11 months ago by genomax59k
3
gravatar for WouterDeCoster
11 months ago by
Belgium
WouterDeCoster35k wrote:

A solution in Python(3). Save as interleave_fqs.py and execute as

python interleave_fqs.py file1.fastq file2.fastq .... fileN.fastq > mynewfile.fastq

Can take an arbitrary number of fastq files which should not have the same length. Requires biopython. Will pick one read from each file until that file is emptied.

from Bio import SeqIO
import sys

fqs = [SeqIO.parse(f, "fastq") for f in sys.argv[1:]]
while True:
    for fq in fqs:
        try:
            print(next(fq).format("fastq"), end="")
        except StopIteration:
            fqs.remove(fq)
    if len(fqs) == 0:
        break
ADD COMMENTlink written 11 months ago by WouterDeCoster35k

Thanks. This works really good and fast.

ADD REPLYlink written 11 months ago by shounak.chakraborty19900
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1219 users visited in the last hour