Question

SNP annotation with R and SnpEff

3

Entering edit mode

7.6 years ago

wpierrick ▴ 90

Hello everyone,

My question comes in 2 part.

First, I have a R data frame of 1.3M human SNPs (all with a RS number) and would like to:

Update their genomic position (the position I have is based on an old hapmap reference, it would be nice to have it with Hg38 coordinates)
Annotate their effect (as in the CONTEXT column in the GWAS catalog, including for example : 3_prime_UTR_variant 5_prime_UTR_variant, downstream_gene_variant, intergenic_variant, intron_variant, missense_variant, non_coding_transcript_exon_variant , regulatory_region_variant and a few other categories)

I looked at many R packages, including rsnps and BSgenome, but none of them was able to extract dbSNP information (only rsnps could extract their position but on a very small subset). I am aware SnpEff could (maybe, not sure) work with rsID but I would like to stick with R for that part.

Second question, I am using SnpEff on a vcf-type file of SNP list, and I would like to get the annotation on only the coding variants and for them whether if it is Synonymous or Missense/LoF for the non synonymous.

I ran SnpEff and got a shitload of extra information and it's hard to make sense of it. So I used the filters -no-downstream -no-intergenic -no-intron -no-upstream -no-utr -no EffectType (low). Fore some reason, the output still contained a lot of the stuff I wasn't interested about, apparently the filters weren't the good ones.

I anybody can help me with those questions, that would be great! I am starting to run a bit dry on the answers Google can tell me.

Many thanks,

SNP SnpEff R • 9.0k views

ADD COMMENT • link updated 6.2 years ago by zx8754 12k • written 7.6 years ago by wpierrick ▴ 90

0

Entering edit mode

Did you find a solution? I have to convert 8milion rs# to CHR:position, and I'm looking for a solution.

ADD REPLY • link 6.1 years ago by bigfoot ▴ 10

score 4 · Accepted Answer · 2017-11-23

4

Entering edit mode

7.6 years ago

cpad0112 21k

R code:

rsid=c("rs123","rs150")
library(biomaRt)
ensembl_snp=useMart("ENSEMBL_MART_SNP", dataset="hsapiens_snp")
getBM(attributes=c("refsnp_source",'refsnp_id','chr_name','chrom_start','chrom_end',"consequence_type_tv", "clinical_significance"), filters = 'snp_filter', values = rsid,mart = ensembl_snp)

output:

> getBM(attributes=c("refsnp_source",'refsnp_id','chr_name','chrom_start','chrom_end',"consequence_type_tv", "clinical_significance"), filters = 'snp_filter', values = rsid,mart = ensembl_snp)

 refsnp_source refsnp_id chr_name chrom_start chrom_end consequence_type_tv
1         dbSNP     rs123        7    24926827  24926827      intron_variant
2         dbSNP     rs150        7    24971033  24971033      intron_variant
  clinical_significance
1                    NA
2                    NA

coordinates seem to match with dbSNP records.

ADD COMMENT • link 7.6 years ago by cpad0112 21k

0

Entering edit mode

Thanks a lot!!! That worked well and made my day. It is weird that I have multiple entries for each SNP effect, I guess it depends on versions of the genome/multiple adds to the dbSNP database, but I'll clean that up and keep only one of the many predicted effects. Thanks a lot again.

ADD REPLY • link 7.6 years ago by wpierrick ▴ 90

0

Entering edit mode

No. It doesn't depend on the genome/multiple adds to the dbSNP database. Calculated effect is dependent on the gene and number of transcripts at the SNP position. Multiple effects for a single gene is expected if the gene has multiple transcripts and/or the position has overlapping genes. @ wpierrick

ADD REPLY • link 6.1 years ago by cpad0112 21k