Question: retrieve complete gene sequence (exons and introns) from SNP name (rsid)
0
gravatar for rdlady
11 days ago by
rdlady0
rdlady0 wrote:

I have a list of SNPs that consists of their rsids or SNP names, like rs6699871, for example. From that list, I would like to retrieve a list of gene variations that result from that SNP. I would expect something like this as output:

rs6699871 var1 CTACATCGATTTGCAGCACCCAGCTGA[C]CCAGAAATCGACAAGTCGACCCGTGCTAGATCGACGA <- (this is the COMPLETE gene or ORF sequence, not the X bp flanking sequence) rs6699871 var2 CTACATCGATTTGCAGCACCCAGCTGA[T]CCAGAAATCGACAAGTCGACCCGTGCTAGATCGACGA rs6699871 var3 CTACATCGATTTGCAGCACCCAGCTGA[G]CCAGAAATCGACAAGTCGACCCGTGCTAGATCGACGA

where the [X] refers to the location of the SNP and it's different forms (C, T, or G). Each line represents a variation in the sequence of a given gene resulting from the SNP rs6699871. This variation could be either in Introns or Exons.

So my question is: what database would have this SNP versus gene variations map that I'm looking for, or what database could have enough information on SNPs that could allow me to retrieve this output easily?

I have experience in linux terminal, perl, and c++, so if I know where to look I can make a script to retrieve this data automatically for a large list of SNPs.

Thanks!

ADD COMMENTlink modified 10 days ago • written 11 days ago by rdlady0
1
gravatar for rdlady
10 days ago by
rdlady0
rdlady0 wrote:

I think I got it!!!!!

R's biomaRt library is amazing! I was able to retrieve the complete gene sequences (introns+exons) for the gene where my SNP is located.

I made an R script that retrieves the complete gene sequence and its variations using a SNP name or rsid as input. The script is available at Github:

https://github.com/raqueldias/retrieve_gene_variations_from_SNP/blob/master/retrieve_SNP_info.R

The script requires the biomaRt package, so you gotta install it first in R:

source("https://bioconductor.org/biocLite.R")

biocLite("biomaRt")

After installing the biomaRt package and downloading the script retrieve_SNP_info.R , load and run the R script:

source("retrieve_SNP_info.R")

my_result <- retrieve_SNP_info("rs10828658")

Then save the results to a tabular text file:

write.table(my_result, file="retrieve_gene_result.txt", quote=F, row.names=F, sep="\t")

A full example of result file is available at Github:

https://raw.githubusercontent.com/raqueldias/retrieve_gene_variations_from_SNP/master/retrieve_result_example.txt

The result contains the complete gene sequence (exons and introns) variations for the gene that belongs to the same region as the SNP rs10828658. Now I just need to run the R script for all my SNPs, I can call the retrieve_SNP_info() function inside a for loop to do this.

Problem solved! Thanks to everyone that helped me providing information on the SNP databases.

ADD COMMENTlink written 10 days ago by rdlady0
0
gravatar for harold.smith.tarheel
11 days ago by
United States
harold.smith.tarheel3.3k wrote:

Variants with 'rs' identifiers are available from dbSNP.

If you're primarily interested in annotations, see Variant Tools (Applications > Annotations > Variant-based databases > dbSNP and dbNSFP).

ADD COMMENTlink written 11 days ago by harold.smith.tarheel3.3k

Thanks for answering, but in the dbSNP I can't find the complete gene sequence (introns and exons included) of the gene belonging to the SNPs. Where exactly do you click to find that?

ADD REPLYlink written 11 days ago by rdlady0
1

The dbSNP report for rs6699871 includes 30bp of flanking sequence on each side of the SNP (section 'Submitter records for this refSNP cluster'), which matches your example. The 'Gene View' section says 'Function class: rs6699871 is located in the intron region of XM_017002858.1'. The sequence and annotations (introns/exons) for XM_017002858.1 are available from the NCBI Nucleotide Database, if you truly require the complete gene sequence. Instructions for batch processing of dbSNP and NCBI are available on their respective websites.

ADD REPLYlink modified 10 days ago • written 10 days ago by harold.smith.tarheel3.3k

Using dbSNP you can get the coordinates of your SNP. You can add some padding to those coordinates (say 20 nucleotides up- and downstream) and create a bed file containing these intervals. Then you can use bedtools getfasta. That should work. You'll have to figure out how to modify the [X] but that would be quite straightforward since you already know the forms (from the dbSNP download).

But perhaps @harold.smith.tarheel had a different approach in mind.

ADD REPLYlink written 10 days ago by WouterDeCoster11k

Thanks for your reply! Sorry maybe I didn't expained myself well in my example: I want the COMPLETE GENE sequence, including introns and exons, not the Xbp flanking sequence. And I want an automatic method, not a website that I have to click, because I will have to retrieve this info for more than 100000 SNPs.

Thanks!

ADD REPLYlink modified 10 days ago • written 10 days ago by rdlady0

Does that mean the spliced gene sequence? What about intronic variants?
You'll need the coordinates and then some bedtools magic to get the right interval(s), then the bedtools getfasta to get the fasta

ADD REPLYlink written 10 days ago by WouterDeCoster11k

I mean the unspliced gene sequences (intros+exons). I found the biomaRt R package that seems to do what I want, but I'm learning about it yet. I will share the info here if biomaRt is able to extract the complete ORF/gene.

ADD REPLYlink modified 10 days ago • written 10 days ago by rdlady0

And I want an automatic method, not a website that I have to click, because I will have to retrieve this info for more than 100000 SNPs.

As stated previously, both dbSNP and NCBI support batch queries.

ADD REPLYlink written 10 days ago by harold.smith.tarheel3.3k

Oh I see it. Thanks! But I still couldn't find a way to automatically download the complete ORF using these queries.

ADD REPLYlink written 10 days ago by rdlady0

What did you attempt?

ADD REPLYlink modified 10 days ago • written 10 days ago by harold.smith.tarheel3.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 732 users visited in the last hour