Question

determine reading frame of snp

0

Entering edit mode

9.8 years ago

arronslacey ▴ 320

Hi, I have downloaded a fasta file from dbsnp of all snps for a certain gene. Unfortunately that fasta file is in nucleotide format, where I require the peptide sequence (I would have thought there would be an option to specify this when downloading). I see there are programs such as transeq that can convert them, but for each snp require the reading frame. I assume that in my fasta file containing X amount of sequences, there will be a distribution of different reading frames. How to obtain this information? Ideally I'd like to call the entire file to transeq, but I guess I will have to split the file up into individual snps and feed each one into transeq once I know the frame rate.......

unless anyone has any other suggestions to get the protein sequence. I will not be able to use UCSC unfortunately.

SNP • 2.6k views

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by arronslacey ▴ 320

score 0 · Answer 1 · 2014-07-21

0

Entering edit mode

9.8 years ago

Devon Ryan 104k

Just download the GTF/GFF annotation file from UCSC and determine the nucleotide from that. You can determine the affected amino acid and/or get the appropriate reading frame in R or biopython/bioperl.

ADD COMMENT • link 9.8 years ago by Devon Ryan 104k

Ram · Answer 2 · 2015-10-31

I am not exactly sure of your goal but as far as I can see you can use a software I have recently published called I-PV. I-PV takes 4 files as input, a fasta file of your amino acid sequence (NP_.... etc), a mRNA corresponding to that sequence (NM_...etc), single array of conservation scores (you can supply an array of random numbers if you are not interested in it) and lastly variation file which is in nucleotide format and can be downloaded from biomart (http://www.ensembl.org/ second option from top). The program looks at your protein sequence and determines which frame is encoding it within the mRNA and crops the rest leaving only the coding sequence. Than keeping in mind the strand information and the transcript of your choice (there are multiple SNPs for each transcript in the variation file usually) it plots the SNPs. While plotting, the software will alert you which frame matches your protein sequence. There are 3 possible frames, so you will get 1, 2 or 3. Here is a video of making an example graph: http://i-pv.org/intro_ipv_alt4.html

Since the entire protein is coded within one of 3 possible frames, the SNPs residing in that protein also inherit that frame. But if by 'reading frame' you mean the codon that codes for that SNP, then you can use the graph you generated to inspect it as well. Here is an example from MYOSIN2: http://i-pv.org/gifs/readingFrame.gif

Here I select a SNP, for instance p.R1682H, than I turn on the arginines only in the sequence. Than I mouse over residue 1682 and open the codon view. I can now see the possible point mutations in that codon. One of them (the fourth one) results in histidine. And the reading frame of that SNP is at the center which is "CGC".

To use I-PV, you will need to locally have circos and perl.

I hope this helps,

Good luck with your research,