Question

how to merge two big file faster

0

Entering edit mode

4.1 years ago

zhangdengwei ▴ 210

Hi, all

I am merging two fastq files, either of which is more than 10 GB, into one file. I use cat to realize it, but this process took a long time, even more than 3 hours. Is there any approach to speed up it? BTW, my command is cat read1.fq read2.fq > merged.fq

Thanks in advance!

shell cat • 1.8k views

ADD COMMENT • link 4.1 years ago by zhangdengwei ▴ 210

1

Entering edit mode

That's as fast as you can go. If it's still too slow, that probably means the storage you're running on doesn't have sufficient I/O, which could either be because it is of low quality or because other processes (other users or yourself) are using up all the I/O.

ADD REPLY • link 4.1 years ago by liorglic ★ 1.4k

0

Entering edit mode

Thanks. Maybe the cause is indeed insufficient I/O. The other processes nearly run out of the I/O. I naively suppose that cat only consumes little CPU and ignore this case. I will try it again, with shutting down other processes.

ADD REPLY • link 4.1 years ago by zhangdengwei ▴ 210

0

Entering edit mode

You can monitor this with ls -lh to see how fast the new file grows larger. Normally this should be several dozens of megabytes per second. Large files take time, not much you can do about it.

ADD REPLY • link 4.1 years ago by ATpoint 82k

0

Entering edit mode

Thanks very much. Your comment is pretty helpful to me.

ADD REPLY • link 4.1 years ago by zhangdengwei ▴ 210

1

Entering edit mode

No, this is already as lowlevel as it gets. If files are big, then you have to be patient. Maybe you are having I/O bottlenecks. Is this on a HDD drive?

ADD REPLY • link 4.1 years ago by ATpoint 82k

0

Entering edit mode

Thanks for your advice.

ADD REPLY • link 4.1 years ago by zhangdengwei ▴ 210

0

Entering edit mode

What are you merging? read1.fq and read2.fq are the forward and reverse reads of a sequencing run? Why are you merging them? What are the downstream analyses you want to perform?

ADD REPLY • link 4.1 years ago by h.mon 35k

0

Entering edit mode

metaphlan2 and humann2

ADD REPLY • link 4.1 years ago by zhangdengwei ▴ 210

0

Entering edit mode

I would reserve the term merging to when R1 and R2 are merged based on their overlap, as this is the currently adopted practice.

Keep in mind:

you are throwing out information when you concatenate the files (although it is true metaphlan2 and humann2 do not use this information).
these concatenated files have very restricted uses, as they break the pairing between R1 and R2, and most programs expect this pairing.

edit: I would use just the R1 file for analyses like metaphlan2 and humann2.

edit 2: it seems humann2 recommendation is the opposite of what I suggested above:

Penalizing such cases would be overly strict: in the absence of a the gene's genomic context, this looks like a perfectly reasonable alignment (READ2 may fall in a non-coding region and not align, or it may align to another [isolated] coding sequence). As a result, the best way to use paired-end sequencing data with HUMAnN2 is simply to concatenate all reads into a single FASTA or FASTQ file.

ADD REPLY • link 4.1 years ago by h.mon 35k

0

Entering edit mode

Indeed, the recommendation of humann2 is the merged file. I'm not clear about the difference using one or two reads, does it impact the result? Have you tested it?

ADD REPLY • link 4.1 years ago by zhangdengwei ▴ 210

1

Entering edit mode

It is clear, it is immediately above the manual snippet I pasted above:

HUMAnN2 and paired-end sequencing data

End-pairing relationships are currently not taken into account during HUMAnN2's alignment steps. This is due to the fact that HUMAnN2 strictly aligns reads to isolated coding sequences: either at the nucleotide level or through translated search. As such, it will frequently be the case that one read (READ1) will map inside a given coding sequence while its mate-pair (READ2) will not.

I haven't tested, though. As bacterial genomes are very dense in coding material, I would expect the difference between just R1 versus R1+R2 to be rather small. Maybe the HUMAnN2 authors tested this on their manuscript?

ADD REPLY • link 4.1 years ago by h.mon 35k