Entering edit mode
4.1 years ago
5heikki
11k
My methods briefly:
- Download all the complete genomes from gisaid
- Remove trailing N's
- Remove genomes with more than 10 N's post step 2
- Remove a few crap quality outlier genomes and all the pangolin genomes and the bat genome
- Build a distance matrix from Mash distances (k17, s 5000) ..s gets exhausted. Since all the genomes have the same orientation, I'm using the -n option here
- Cluster with AP (I'm adjusting q here to reduce the number of clusters: sm_ap <- apcluster(negDistMat(r=2),sm,q=0.05))
So there are 570 genomes here, grouped into 12 clusters. Somehow 8/12 clusters have genomes from USA WA (49 genomes altogether in my dataset). I'm also a little baffled how that Australia NSW lineage is so different to everything else..
Here's a PDF if someone wants to Ctrl F something..
Edit. Here is one without the NSW lineage. Its sister lineage (maybe originally from Shandong?) is still a big outlier and has members from multiple countries.This is the main reason why I believe that the NSW lineage is real as well. This one has 13 clusters instead of 12 in OP