Question

How to sort fastq.gz files efficiently

1

Entering edit mode

8.1 years ago

komal.rathi ★ 4.1k

Hello everyone,

I am working on some RNASeq data and after merging the raw fastq.gz files into sample_R1.fastq.gz and sample_R2.fastq.gz, I sort the fastq files so that they can be accepted as input by STAR.

I am currently using pairfq but it is taking a really long time to sort the files. Is there any other tool that does the same but quicker?

This is my command for pairfq:

pairfq makepairs \
-c gzip \
-f sample_R1.fastq.gz \
-r sample_R2.fastq.gz \
-fp sample_sorted_R1.fastq \
-rp sample_sorted_R2.fastq \
-fs sample_forwards.fastq \
-rs sample_reverses.fastq

I know there are many posts about it on Biostars suggesting the use of awk, zcat etc. I just wanted to know if there is any tool that sorts the fastq files quickly.

Thanks!!

fastq sort • 4.2k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 8.1 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

I'm not sure you need to use this tool unless the reads have been trimmed and the pairs are out of sync. Otherwise, you can just concatenate the files (in the same order for each pair) and you should no problems. By the way, how many reads do you have?

ADD REPLY • link 8.1 years ago by SES 8.6k

0

Entering edit mode

For one of the projects, we did not trim the reads and concatenated the files in the same order for each pair but still the reads were out of sync. I want to integrate this sorting step in the pipeline so I don't have to worry about checking if or not the reads are in the correct order for other projects that may have the same problem.

I have about 80-85 mil reads in this particular project.

ADD REPLY • link 8.1 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

Okay, this makes sense. What I would do for now is pair the individual files before concatenating them together, and then concatenate all the paired reads. That should go much faster. 80-85 mil. reads is a lot of reads, so I can't offer a better solution right now. I am working on a new version of Pairfq that will be much faster so this should be a nonissue in the near future. Thanks.

ADD REPLY • link 8.1 years ago by SES 8.6k