Question: Retrieve list of kept SNP's after LD pruning
1
gravatar for leeandroid
9 months ago by
leeandroid80
leeandroid80 wrote:

Hi everyone!

I followed the SNPRelate package tutorial in order to filter a VCF by LD and finally pursued PCA. My problem is that by converting the VCF into GDS SNP's and Chromosomes ID's are also converted into a different nomenclature.

sample.id, a unique identifier for each sample.

snp.id, a unique identifier for each SNP.

How can I get the original SNP ID's that are kept after LD pruning?

Thanks in advance.

setwd("~/Desktop/snprelate")
    library(gdsfmt)
    library(SNPRelate)

    cap.vcf <- "all_148.vcf"
    snpgdsVCF2GDS(cap.vcf, "cap.gds", method = "biallelic.only")
    snpgdsSummary("cap.gds")

    genofile <- snpgdsOpen("cap.gds")

    pop_code <- scan("pop.txt", sep = "\t", what = list("pop.txt"))
    pop_code

    set.seed(1000)
    cap.LD <- snpgdsLDpruning(genofile, remove.monosnp = TRUE, ld.threshold = .1, maf = .1, missing.rate = 0.1, method = "composite", slide.max.bp = 500000, verbose = TRUE)

    snp_id_list <- unlist(cap.LD)
snprelate snp gds R vcf • 504 views
ADD COMMENTlink modified 18 days ago by jessivannatta10 • written 9 months ago by leeandroid80

Hi there,

I'm also having trouble finding which SNPs have been pruned after LD pruning. I want to be able to list those loci that are being removed so I can remove them from my dataset for future analyses. I'll be honest that I am very new to this stuff!

Here is my code (forgive me for any errors!):

library(gdsfmt) library(SNPRelate)

File Conversion - VCF to GDS

vcf.fn <- "H:PhD Research/UNEAK_Bat_West1+3+4/Filtered Data/CORA 50%.vcf" snpgdsVCF2GDS(vcf.fn, "CORA.gds", method="biallelic.only") snpgdsSummary("CORA.gds")

setwd("H:/PhD Research/UNEAK_Bat_West1+3+4/SNPRelate/")

Clean-up Linkage Disequilibrium

CORA.gds <- snpgdsOpen("CORA.gds", readonly=FALSE, allow.duplicate=TRUE, allow.fork=FALSE) set.seed(1000) snpset <- snpgdsLDpruning(CORA.gds, ld.threshold=0.8) names(snpset) head(snpset$chr0) snp.id <- unlist(snpset) read.gdsn(index.gdsn(CORA.gds, "snpset"))

I'm trying to read the GDS field in the last line but can't seem to get an output. Instead I get an error: Error in inherits(node, "gdsn.class") : No such GDS node "snpset"!

I'm also trying to calculate HWE equilibrium and have come across very little documentation on how to do this, but here is my code:

Clean-up HWE

CORA.gds <- snpgdsOpen("CORA.gds", readonly=FALSE, allow.duplicate=FALSE, allow.fork=FALSE) sample.id <- read.gdsn(index.gdsn(CORA.gds, "sample.id")) p <- snpgdsHWE(CORA.gds, sample.id=NULL, snp.id=NULL, with.id=TRUE) summary(p) plot(-log10((1:length(p))/length(p)), -log10(p[order(p)]), xlab="-log10(expected P)", ylab="-log10(observed P)", main="QQ Plot")

Any advice at all on any of these aspects would be much appreciated!

ADD REPLYlink written 18 days ago by jessivannatta10

Please post as a new question. In doing so, please highlight your code (with mouse) and select the 101 010 button to format it as code. This helps to improve visualisation.

ADD REPLYlink written 18 days ago by Kevin Blighe35k

Done. Thank you for your help, Kevin.

ADD REPLYlink written 18 days ago by jessivannatta10
2
gravatar for Kevin Blighe
9 months ago by
Kevin Blighe35k
Republic of Ireland
Kevin Blighe35k wrote:

Thank you for adding the code.

A few things to note:

  • if the chromosomes in your VCF are labelled with a 'chr' prefix, then you should use ignore.chr.prefix = "chr" with the snpgdsVCF2GDS() function
  • SNPRelate retains both the row index of each variant and also the ID (presumably whatever was set in the VCF ID field). The documentation states: "the variable snp.id stores the original the row index of variants, and the variable snp.rs.id stores the rs id". So, if you want to map back from the LD pruning object to the variant ID, then go by the row index.
  • for samples, I can only assume that they are taken from left to right as they appear in the VCF. You should run rigorous checks to confirm this by looking at variants in the GDS object and how they match up to your VCF
ADD COMMENTlink written 9 months ago by Kevin Blighe35k
1

Thank you for replying, Kevin.

I was aware of the importance of chromosome nomenclature so I previously changed it to numeric values. Regarding variant ID's you're right. I was able to retrieve its original name by corresponding the row index in the snp_id_list to the row with the same index of the snp.rs.id stored in the GDS file. (In this case sample name was not an issue, I copied it by mistake when I wrote the question. But, its stored the same way as the snps). Any of GDS fields can be checked by read.gdsn(index.gdsn(genofile, "field_name"))

Thank you!

ADD REPLYlink written 9 months ago by leeandroid80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2277 users visited in the last hour