Question

bioinformatic cost of whole genome sequencing

1

Entering edit mode

21 months ago

Tohid ▴ 10

I plan to use an Illumina sequencer to conduct WGS on 200 patients, and I have trouble estimating the annual costs of bioinformatics parts. For example, what kind of computer does data analysis require (RAM ? CPU ? GPU?)

How much hard drive space will I need to save the 200 patients' data?

Are there any databases I need to create?

Is there a need for cloud servers?

WGS Whole-Genome-Sequencing NGS • 719 views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 21 months ago by Tohid ▴ 10

score 2 · Answer 1 · 2022-07-24

Data of that size cannot meaningfully be analysed on a single computer (aka workstation/laptop). Say a sample would need 13 hours (https://www.nature.com/articles/nmeth.3505) and I think that is quite optimistic, that would be 13h*200=108days for serial/sequential processing. Obviously not a choice. You need parallel processing either on a HPC or cloud instance. In the linked paper (not saying this was the standard for anything) they used a 128GB 32core node for their analysis. You need something like that but many. Check whether your institution has a HPC, often these exist without people knowing it. Check documentation of cloud providers such as AWS for processing costs. Be sure your analysis pipeline stands and is well tested before going big. Storage will probably be well in the range of 50TB for the raw data and alignments alone, mire if the pipeline produces any other intermediate files (How many storage needs for Whole genome sequencing?). This all needs existing high-performance infrastructure, you don’t buy yourself, that would be uneconomic. Sorry that I cannot give a cost estimate, I was always spoiled to have that infrastructure provided by my university free of charge.

score 1 · Answer 2 · 2022-07-24

I agree with ATpoint. Apart from cores/processors, what matters is always what kind of application you are looking at for your downstream analyses. For example, 200 samples of WGS assuming that they are all human data would probably have 600 million reads of 150 bp (or 300M paired-end reads), So 600 M downstream annotation and all files acumulate to 300 GB data. Of course, you could then keep raw reads and final vcf or sorted bam files deleting rest of the files ( after you backup or upload in SRA registry!).

So now comes your Math, 300 GB * 200 samples = 60 TB perpetual data

And upon parallel processing, you could accomplish this in 3-4 hours on an average, depending on how good you could analyse the data

Any academic lab has to sail through Moore's law of genomics. Redundancy checks/good bioinformatics practices is the need of the hour!

Best, Prash