Entering edit mode
7.2 years ago
komal.rathi ★ 4.1k
I am working on some RNASeq data and after merging the raw fastq.gz files into sample_R1.fastq.gz and sample_R2.fastq.gz, I sort the fastq files so that they can be accepted as input by STAR.
I am currently using pairfq but it is taking a really long time to sort the files. Is there any other tool that does the same but quicker?
This is my command for pairfq:
pairfq makepairs \ -c gzip \ -f sample_R1.fastq.gz \ -r sample_R2.fastq.gz \ -fp sample_sorted_R1.fastq \ -rp sample_sorted_R2.fastq \ -fs sample_forwards.fastq \ -rs sample_reverses.fastq
I know there are many posts about it on Biostars suggesting the use of awk, zcat etc. I just wanted to know if there is any tool that sorts the fastq files quickly.
I'm not sure you need to use this tool unless the reads have been trimmed and the pairs are out of sync. Otherwise, you can just concatenate the files (in the same order for each pair) and you should no problems. By the way, how many reads do you have?
For one of the projects, we did not trim the reads and concatenated the files in the same order for each pair but still the reads were out of sync. I want to integrate this sorting step in the pipeline so I don't have to worry about checking if or not the reads are in the correct order for other projects that may have the same problem.
I have about 80-85 mil reads in this particular project.
Okay, this makes sense. What I would do for now is pair the individual files before concatenating them together, and then concatenate all the paired reads. That should go much faster. 80-85 mil. reads is a lot of reads, so I can't offer a better solution right now. I am working on a new version of Pairfq that will be much faster so this should be a nonissue in the near future. Thanks.