What is the computation requirement to process whole genome paired end data each fastq files being 170G?
3
0
Entering edit mode
5.9 years ago
shuksi1984 ▴ 60

How much HDD, RAM, and internet speed is required to process whole genome paired end data each fastq files being 170G? My machine has following configuration:

HDD-1TB

RAM-16GB

But its taking >1hr for the following simple command to process:

 wc SRR6876052_1.fastq -l

Also, brief me about online server.

next-gen sequencing genome • 3.9k views
ADD COMMENT
2
Entering edit mode
5.9 years ago
GenoMax 141k

Your question can use some clarity. How many of these files are you referring to. Just two for one sample or more?

I am not sure why you need internet connectivity to process genome data (assuming you have the reference downloaded and indexed).

RAM is going to be limiting if this is human genome (or similar sized) data. You need ~30G of free RAM with many of the aligners. Your best bet may be bwa which is one of the lightest memory requirement aligner (~6G free for human data).

Counting lines in a fastq files can't be realistically be considered processing data and doesn't give you any idea of how long it may take to scan/trim/align the data. You should also keep the fastq files compressed to save space. Most NGS programs understand compressed data and will work with compressed data seamlessly.

You can look into Amazon AWS and Google Compute to get an idea of pricing for online compute resources.

ADD COMMENT
0
Entering edit mode

It is paired end genomic data for single human sample. Thus, two fastq files of 170G each. Internet connectivity is required only to download dataset.

ADD REPLY
0
Entering edit mode
5.9 years ago

Just a rough estimate... You have 340 GB of fastq (170 x 2, I assume this is uncompressed). Aligned and in BAM format this may be ~50 GB. To sort it you need another 50 GB for the temporary files and 50 GB for the final aligned BAM. You could pipe the output of the aligner (say bwa mem) into samtools sort so you save time and ~50 GB of disk space. Once done, you can delete fastq and unligned bam, if any, and you finish with ~50GB of BAM and ~1/2 TB peak disk space usage. Of course, 340 GB uncompressed could be reduced to maybe even 1/10 of that size with gzip.

ADD COMMENT
0
Entering edit mode
5.9 years ago
ATpoint 82k

For a very rough estimate:

On a Broadwell Xeon node (2.4GHz I think) with 128GB RAM, processing a 2x100bp WGS sample with 635.658.231 read pairs, using BWA mem with 24 threads, piped into SAMBLASTER for duplicate marking, and a sort with SAMBAMBA with 30G memory usage, it takes 7-9 hours. As you are limited to 16GB RAM, you'll probably need to limit BWA to 8 threads or so, if your machine has that capacity. Still, it will probably take an entire day.

ADD COMMENT

Login before adding your answer.

Traffic: 2553 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6