Hi! I am trying to do SNP calling and clustering on 20 strains of amoeba and I am following the methods described in this paper: paper
SAMfiles of all samples were further processed using picard tools version 1.106 (http://picard.sourceforge.net). SAMfiles were sorted (SortSam.jar), duplicates marked (MarkDuplicates.jar) before read groups were added (AddOrReplaceReadGroups.jar) and BAM indices build (BuildBamIndex.jar). The GATK57 was then applied to the resulting BAMfiles to realign indels and remove duplicates. SNPs were called for all samples combined using the UnifiedGenotyper with parameters -ploidy 1, -glm SNP and a minimum phred-scaled confidence threshold of 100 to call and emit variants. Called variants were hard filtered using filters ‘QD<2.0’, ‘FS>60.0’ and ‘MQ<30.0’. Variants passing the filter were annotated using SNPeff. Filtered SNPs were LD pruned (LD threshold: 1.0) and Identity-By-State (IBS) proportions computed from remaining 30,444 SNPs using the Bioconductor package SNPrelate and R. These were then clustered using complete hierarchical clustering.
I have done all the steps until the SNPeff annotation since the downloaded annotation from their database has a long list of chromosome names (that I could not find in any other annotation files on the same organism). Here is an example
# Chromosomes : Format 'chromo_name size codon_table' # 'CH709182' 39796 Standard # 'CH709153' 12018 Standard # 'CH709148' 11086 Standard # 'CH709158' 9512 Standard # 'CH709161' 8929 Standard # 'CH709160' 8346 Standard
The chromosome names in my vcf file look like this:
Chromosome DDB0215018 Chromosome DDB0215151: Chromosome DDB0220052: 0.50%, 1/200 Chromosome DDB0232428: 0.04%, 4/11,217 Chromosome DDB0232429: 0.03%, 5/17,119 Chromosome DDB0232430: 0.03%, 4/15,291 Chromosome DDB0232431: 0.03%, 4/11,931 Chromosome DDB0232432 Chromosome DDB023243
Since I am not interested in which genes are polymorphic but in making a dendrogram of the similarity between strains (how divergent are the strains not specific genes) I skipped this step and loaded the dataset in R.
I merged the 20 vcf files into one vcf with the command:
vcf-merge 1NC105.1_filtered_snps.vcf.gz 2NC28.1_filtered_snps.vcf.gz ....(more strains)... | bgzip -c > output.vcf.gz
Then I loaded the dataset in R and used SNPRelate package with the following code:
library(SNPRelate) library(gdsfmt) vcf.fn <- "output.vcf" snpgdsVCF2GDS(vcf.fn,"test1.gds", method="biallelic.only") snpgdsSummary("test1.gds") genofile<- snpgdsOpen("test1.gds") snpset <- snpgdsLDpruning(genofile, autosome.only=FALSE, ld.threshold=0.2) ibd <- snpgdsIBDMoM(genofile,maf=0.05, missing.rate=0.05, autosome.only = FALSE)
The result is like this:
SNP pruning based on LD: Excluding 78,147 SNPs (monomorphic: TRUE, MAF: NaN, missing rate: NaN) Working space: 20 samples, 40 SNPs using 1 (CPU) core sliding window: 500,000 basepairs, Inf SNPs |LD| threshold: 0.2 method: composite Chromosome DDB0215018: 0.24%, 1/425 Chromosome DDB0215151: 0.82%, 1/122 Chromosome DDB0220052: 0.50%, 1/200 Chromosome DDB0232428: 0.04%, 4/11,217 Chromosome DDB0232429: 0.03%, 5/17,119 Chromosome DDB0232430: 0.03%, 4/15,291 Chromosome DDB0232431: 0.03%, 4/11,931 Chromosome DDB0232432: 0.03%, 4/12,963 Chromosome DDB0232433: 0.03%, 3/8,595 27 markers are selected in total.
Does anyone know why it excludes almost all my SNPs and how I can get around this problem? At first I thought it was a problem with the autosome filtering but I added that (autosome.only=FALSE) and it still throws them away. I am using the same data as in the paper and in the paper the resulting SNPS were 30 000 so I am clearly doing something wrong...