Question

How to get RSIDs of indels from GWAS summary statistics

0

Entering edit mode

3.2 years ago

Fede_Santos95 ▴ 10

Hi,

I have a GWAS summary statistics file with SNPs and indels positions, and I'd like to get their rsIDs. My .txt file has the following format:

CHR        BP             A1           A2
 8        101592213       T            C
 8        106973048       A            G
 8        108690829       T            G
 8        103128181       D            I
 8        104098842       I            D
...         ...           ...        ...

where CHR is the specific chromosome, BP is the chromosome start position of variants, A1 and A2 are reference and alternative alleles. For indels, the .txt file reports D (=deletion) and I(=insertion) symbols for A1 and A2.

Once in R,I created a new dataframe by adding a new column with end positions, which are the same of the start positions. So, the dataframe has the following aspect:

    CHR     chr_start       chr_end                     A1     A2
     8        101592213     101592213                    T     C
     8        106973048     106973048                    A     G
     8        108690829     108690829                    T     G
     8        103128181     103128181                    D     I               
     8        104098842     104098842                    I     D 
 ...           ...           ...                        ...   ...

Then, I run the following commands in R:

###commands to install biomaRt
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("biomaRt")

#loading biomaRt
library(biomaRt)

#creating mart object to search in ensembl
snp_mart <- useMart(biomart="ENSEMBL_MART_SNP", host="grch37.ensembl.org", 
                    path="/biomart/martservice", dataset="hsapiens_snp")


#creating initial storage vector
results <- c()

###loop for getting rsIDs
for (snp in 1:dim(ds1)[1]) { 
  ###ds1 is the object name of the SNP file


  temp <- getBM(attributes = c('refsnp_id', 'allele', 'chrom_start'), 
                filters = c('chr_name', 'start', 'end'), 
                values = list(ds1[snp,1], ds1[snp, 2], ds1[snp, 3]), 
                mart = snp_mart)

  ds1[snp, 4] <- snp 
  ds1[snp, 5] <- temp[snp, 1]
  # Store SNP file index row next to each answer

  results <- rbind(results, temp)

}

The output of this script is the following:

refsnp_id         allele          chrom_start
rs62513865         C/T          101592213
rs79643588         G/A          106973048
rs17396518        T/C/G         108690829
...                 ...             ...

This script seems that does not work for indels of my .txt file. I'm supposing it does not work since I created the end positions column, in which end positions are the same of start positions also for indels, and this should not be correct. However, I did it because I found as possible filters for my getBM function 'chr_name', 'start' and 'end', where 'end' is known for SNPs, since end and start positions are identical, but I have no further information about indels position, apart from start positions.

Could someone help me in finding rsIDs for indels by improving getBM function in R, or by using a different strategy?

Thank u!!!

R gene snp • 1.0k views

ADD COMMENT • link 3.2 years ago by Fede_Santos95 ▴ 10