Question

All SNPs excluded with LD pruning

0

Entering edit mode

5.2 years ago

popescuiofelia ▴ 10

Hi! I am trying to do SNP calling and clustering on 20 strains of amoeba and I am following the methods described in this paper: paper

SAMfiles of all samples were further processed using picard tools version 1.106 (http://picard.sourceforge.net). SAMfiles were sorted (SortSam.jar), duplicates marked (MarkDuplicates.jar) before read groups were added (AddOrReplaceReadGroups.jar) and BAM indices build (BuildBamIndex.jar). The GATK57 was then applied to the resulting BAMfiles to realign indels and remove duplicates. SNPs were called for all samples combined using the UnifiedGenotyper with parameters -ploidy 1, -glm SNP and a minimum phred-scaled confidence threshold of 100 to call and emit variants. Called variants were hard filtered using filters ‘QD<2.0’, ‘FS>60.0’ and ‘MQ<30.0’. Variants passing the filter were annotated using SNPeff. Filtered SNPs were LD pruned (LD threshold: 1.0) and Identity-By-State (IBS) proportions computed from remaining 30,444 SNPs using the Bioconductor package SNPrelate and R. These were then clustered using complete hierarchical clustering.

I have done all the steps until the SNPeff annotation since the downloaded annotation from their database has a long list of chromosome names (that I could not find in any other annotation files on the same organism). Here is an example

# Chromosomes                : Format 'chromo_name size codon_table'
#  'CH709182' 39796 Standard
#  'CH709153' 12018 Standard
#  'CH709148' 11086 Standard
#  'CH709158' 9512 Standard
#  'CH709161' 8929 Standard
#  'CH709160' 8346 Standard

The chromosome names in my vcf file look like this:

Chromosome DDB0215018 Chromosome DDB0215151: Chromosome DDB0220052: 0.50%, 1/200 Chromosome DDB0232428: 0.04%, 4/11,217 Chromosome DDB0232429: 0.03%, 5/17,119 Chromosome DDB0232430: 0.03%, 4/15,291 Chromosome DDB0232431: 0.03%, 4/11,931 Chromosome DDB0232432 Chromosome DDB023243

Since I am not interested in which genes are polymorphic but in making a dendrogram of the similarity between strains (how divergent are the strains not specific genes) I skipped this step and loaded the dataset in R.

I merged the 20 vcf files into one vcf with the command:vcf-merge 1NC105.1_filtered_snps.vcf.gz 2NC28.1_filtered_snps.vcf.gz ....(more strains)... | bgzip -c > output.vcf.gz

Then I loaded the dataset in R and used SNPRelate package with the following code:

    library(SNPRelate)
    library(gdsfmt)
    vcf.fn <- "output.vcf"
    snpgdsVCF2GDS(vcf.fn,"test1.gds", method="biallelic.only")
    snpgdsSummary("test1.gds")
    genofile<- snpgdsOpen("test1.gds")
    snpset <- snpgdsLDpruning(genofile, autosome.only=FALSE, ld.threshold=0.2)
    ibd <- snpgdsIBDMoM(genofile,maf=0.05, missing.rate=0.05, autosome.only = FALSE)

The result is like this:

     SNP pruning based on LD:
  Excluding 78,147 SNPs (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
Working space: 20 samples, 40 SNPs
using 1 (CPU) core
sliding window: 500,000 basepairs, Inf SNPs
|LD| threshold: 0.2
method: composite
Chromosome DDB0215018: 0.24%, 1/425
Chromosome DDB0215151: 0.82%, 1/122
Chromosome DDB0220052: 0.50%, 1/200
Chromosome DDB0232428: 0.04%, 4/11,217
Chromosome DDB0232429: 0.03%, 5/17,119
Chromosome DDB0232430: 0.03%, 4/15,291
Chromosome DDB0232431: 0.03%, 4/11,931
Chromosome DDB0232432: 0.03%, 4/12,963
Chromosome DDB0232433: 0.03%, 3/8,595
27 markers are selected in total.

Does anyone know why it excludes almost all my SNPs and how I can get around this problem? At first I thought it was a problem with the autosome filtering but I added that (autosome.only=FALSE) and it still throws them away. I am using the same data as in the paper and in the paper the resulting SNPS were 30 000 so I am clearly doing something wrong...

LD pruning SNP analysis • 2.3k views

ADD COMMENT • link updated 5.1 years ago by Biostar 20 • written 5.2 years ago by popescuiofelia ▴ 10

1

Entering edit mode

Hmmm, you have ld.threshold=0.2, whilst in text you reference they say LD pruned (LD threshold: 1.0). Also, I wonder if there should be some sort of correction for multiple tests (i.e., using a false discovery rate to lower the LD threshold)?

ADD REPLY • link 5.1 years ago by jean.elbers ★ 1.7k