Trim Reads In Bz2 Fastq Files
2
2
Entering edit mode
10.2 years ago
Vikas Bansal ★ 2.4k

Hi,

I have 20 bunzip2 fastq files. Each compressed fastq file is ~4gb and reads are 50bp long. I want to trim reads to 36bp in compressed files.

I have tried bioawk but it does not accept bz2 files.

I tried FASTX-toolkit, but it also does not accept bz2 files but then I tried -

bzcat input.fastq.bz2 | fastx_trimmer -l 36 -i - | gzip > trimmed.fastq.gz


The reason I have used gzip (or may be pigz for making it more fast) here because my next step is mapping using BWA and it does not accept bz2 files but do accept gz files.

The above code works but for each file it is taking around 30 minutes. If I don't use gzip in above code then it takes about 22 minutes for each file but then files have large size. For 20 files, it is going to take lot of time and in future I will be receiving 40-45 files like this.

Can anyone please suggest me an alternative way which is efficient and not time consuming?

fastq trimming • 4.6k views
1
Entering edit mode

Why don't you write a shell script to execute the trimming on all fastq files in parallel? The effective time is then 30 minutes...

0
Entering edit mode

Yes, you are right. I can do that. So is this the right way or we have some other efficient solution because I have a bad feeling that I am doing something wrong with bzcat and then compressing it again using gzip and creating new file.

0
Entering edit mode

if your command (the one you have shown) works nicely, and you've got a cluster waiting for chunks of data to swallow and spit, then why not? :)

1
Entering edit mode
10.2 years ago
Weronika ▴ 300

Looks to me like you're basically doing it right - you have to use bzcat and then gzip to convert from .bz2 to .gz files, and since bowtie takes .gz files, that seems to be the best way to go. Doing it in parallel on multiple files (see the other answer/comments) sounds like a good idea. In general, if you're dealing with a lot of deep-sequencing data, you have to expect the processing to take a while.

Is there any chance of asking your data provider to give you .gz instead of .bz2 files next time?

0
Entering edit mode

Thanks. I am doing the same thing now. I thought may be there is some trick that we can trim reads in compressed file without making new compressed file.

0
Entering edit mode
10.2 years ago
Arun 2.4k

Just in case, here's a sample script I use to execute my perl script on all files in a particular folder simultaneously (& at the end of the perl script). The set command at the top is borrowed from websites/forums and I don't remember what exactly it is for. But it does the trick for me neatly!

#!/bin/bash
set -o errexit
BAM_PATH="$1" OUT_PATH="$2"
for BAM in $BAM_PATH/*.bam do if [ -f$BAM ]
then
perl myscript.pl $BAM$OUT_PATH &
fi
done
trap "kill 0" SIGINT SIGTERM EXIT
wait