Question

Closed:What would be the most plausible explanation as to why USA WA has more SARS-CoV-2 diversity than any other place on Earth?

1

Entering edit mode

4.1 years ago

5heikki 11k

My methods briefly:

Download all the complete genomes from gisaid
Remove trailing N's
Remove genomes with more than 10 N's post step 2
Remove a few crap quality outlier genomes and all the pangolin genomes and the bat genome
Build a distance matrix from Mash distances (k17, s 5000) ..s gets exhausted. Since all the genomes have the same orientation, I'm using the -n option here
Cluster with AP (I'm adjusting q here to reduce the number of clusters: sm_ap <- apcluster(negDistMat(r=2),sm,q=0.05))

So there are 570 genomes here, grouped into 12 clusters. Somehow 8/12 clusters have genomes from USA WA (49 genomes altogether in my dataset). I'm also a little baffled how that Australia NSW lineage is so different to everything else..

Here's a PDF if someone wants to Ctrl F something..

570 genomes

Edit. Here is one without the NSW lineage. Its sister lineage (maybe originally from Shandong?) is still a big outlier and has members from multiple countries.This is the main reason why I believe that the NSW lineage is real as well. This one has 13 clusters instead of 12 in OP

covid-19 SARS-CoV-2 • 340 views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 4.1 years ago by 5heikki 11k