retrieve complete gene sequence (exons and introns) from SNP name (rsid)
3
0
Entering edit mode
7.3 years ago
rdlady ▴ 40

I have a list of SNPs that consists of their rsids or SNP names, like rs6699871, for example. From that list, I would like to retrieve a list of gene variations that result from that SNP. I would expect something like this as output:

rs6699871 var1 CTACATCGATTTGCAGCACCCAGCTGA[C]CCAGAAATCGACAAGTCGACCCGTGCTAGATCGACGA <- (this is the COMPLETE gene or ORF sequence, not the X bp flanking sequence) rs6699871 var2 CTACATCGATTTGCAGCACCCAGCTGA[T]CCAGAAATCGACAAGTCGACCCGTGCTAGATCGACGA rs6699871 var3 CTACATCGATTTGCAGCACCCAGCTGA[G]CCAGAAATCGACAAGTCGACCCGTGCTAGATCGACGA

where the [X] refers to the location of the SNP and it's different forms (C, T, or G). Each line represents a variation in the sequence of a given gene resulting from the SNP rs6699871. This variation could be either in Introns or Exons.

So my question is: what database would have this SNP versus gene variations map that I'm looking for, or what database could have enough information on SNPs that could allow me to retrieve this output easily?

I have experience in linux terminal, perl, and c++, so if I know where to look I can make a script to retrieve this data automatically for a large list of SNPs.

Thanks!

SNP gene genome sequence sequencing • 2.7k views
ADD COMMENT
3
Entering edit mode
7.3 years ago
rdlady ▴ 40

I think I got it!!!!!

R's biomaRt library is amazing! I was able to retrieve the complete gene sequences (introns+exons) for the gene where my SNP is located.

I made an R script that retrieves the complete gene sequence and its variations using a SNP name or rsid as input. The script is available at Github:

https://github.com/raqueldias/retrieve_gene_variations_from_SNP/blob/master/retrieve_SNP_info.R

The script requires the biomaRt package, so you gotta install it first in R:

source("https://bioconductor.org/biocLite.R")

biocLite("biomaRt")

After installing the biomaRt package and downloading the script retrieve_SNP_info.R , load and run the R script:

source("retrieve_SNP_info.R")

my_result <- retrieve_SNP_info("rs10828658")

Then save the results to a tabular text file:

write.table(my_result, file="retrieve_gene_result.txt", quote=F, row.names=F, sep="\t")

A full example of result file is available at Github:

https://raw.githubusercontent.com/raqueldias/retrieve_gene_variations_from_SNP/master/retrieve_result_example.txt

The result contains the complete gene sequence (exons and introns) variations for the gene that belongs to the same region as the SNP rs10828658. Now I just need to run the R script for all my SNPs, I can call the retrieve_SNP_info() function inside a for loop to do this.

Problem solved! Thanks to everyone that helped me providing information on the SNP databases.

ADD COMMENT
0
Entering edit mode
7.3 years ago

Variants with 'rs' identifiers are available from dbSNP.

If you're primarily interested in annotations, see Variant Tools (Applications > Annotations > Variant-based databases > dbSNP and dbNSFP).

ADD COMMENT
0
Entering edit mode

Thanks for answering, but in the dbSNP I can't find the complete gene sequence (introns and exons included) of the gene belonging to the SNPs. Where exactly do you click to find that?

ADD REPLY
1
Entering edit mode

The dbSNP report for rs6699871 includes 30bp of flanking sequence on each side of the SNP (section 'Submitter records for this refSNP cluster'), which matches your example. The 'Gene View' section says 'Function class: rs6699871 is located in the intron region of XM_017002858.1'. The sequence and annotations (introns/exons) for XM_017002858.1 are available from the NCBI Nucleotide Database, if you truly require the complete gene sequence. Instructions for batch processing of dbSNP and NCBI are available on their respective websites.

ADD REPLY
0
Entering edit mode

Using dbSNP you can get the coordinates of your SNP. You can add some padding to those coordinates (say 20 nucleotides up- and downstream) and create a bed file containing these intervals. Then you can use bedtools getfasta. That should work. You'll have to figure out how to modify the [X] but that would be quite straightforward since you already know the forms (from the dbSNP download).

But perhaps @harold.smith.tarheel had a different approach in mind.

ADD REPLY
0
Entering edit mode

Thanks for your reply! Sorry maybe I didn't expained myself well in my example: I want the COMPLETE GENE sequence, including introns and exons, not the Xbp flanking sequence. And I want an automatic method, not a website that I have to click, because I will have to retrieve this info for more than 100000 SNPs.

Thanks!

ADD REPLY
0
Entering edit mode

Does that mean the spliced gene sequence? What about intronic variants?
You'll need the coordinates and then some bedtools magic to get the right interval(s), then the bedtools getfasta to get the fasta

ADD REPLY
0
Entering edit mode

I mean the unspliced gene sequences (intros+exons). I found the biomaRt R package that seems to do what I want, but I'm learning about it yet. I will share the info here if biomaRt is able to extract the complete ORF/gene.

ADD REPLY
0
Entering edit mode

And I want an automatic method, not a website that I have to click, because I will have to retrieve this info for more than 100000 SNPs.

As stated previously, both dbSNP and NCBI support batch queries.

ADD REPLY
0
Entering edit mode

Oh I see it. Thanks! But I still couldn't find a way to automatically download the complete ORF using these queries.

ADD REPLY
0
Entering edit mode

What did you attempt?

ADD REPLY

Login before adding your answer.

Traffic: 1963 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6