How to subsample SARS-CoV-2 dataset with limited computational resources?
0
0
Entering edit mode
3.2 years ago
fhsantanna ▴ 610

I need to do a phylogenetic analysis of 300 sars-cov-2 samples, but it is being challenging due to the enormous GISAID dataset (> 500k genomes). I removed sequences that do not encompass the temporal window of my samples, reducing the dataset to ~350k genomes. Even so, nextstrain pipeline and genome-sampler (https://caporasolab.us/genome-sampler/intro.html) are crashing, given that I only have 64 gb of ram available.

Given that, I am willing to adjust my analysis to my computational resources. GISAID provides metadata for all genomes, and I am thinking to subsample GISAID dataset considering date, country and pangolin_lineage.

My main problem is that I do not know how many genomes I should selected for a computer having 64 GB of ram. Another question is how many genomes per stratum I must have to achieve significant results.

Could you please give me some ideas? Thank you very much.

sars-cov-2 subsample ram genome gisaid • 686 views
ADD COMMENT

Login before adding your answer.

Traffic: 2013 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6