Problem:
I have a spreadsheet that contains a list of mostly point mutations in a particular type of cancer. The sheet has columns that correspond to the chromosome number, the position, the reference base pair, and the alternative base pair. It also has a column called "uploaded variation" that summarizes this information in one statement. Last, there is another "location" column that contains both the chromosome number and position, and a column with the gene ID. An example is shown below:
CHROM POS REF ALT Uploaded_variation Location Gene
chr17 45229228 A C 17_45229228_A/C 17:45229228 ENSG00000004897
chr17 45229234 A C 17_45229234_A/C 17:45229234 ENSG00000004897
chr17 45234706 G C 17_45234706_G/C 17:45234706 ENSG00000004897
chr17 45232043 T C 17_45232043_T/C 17:45232043 ENSG00000004897
chr17 45229254 T G 17_45229253_T/G 17:45229253 ENSG00000004897
chr17 45229253 T G 17_45229253_T/G 17:45229253 ENSG00000004897
I have two questions that I would like to answer:
- First, is the reference base pair a member of a stop codon, i.e., TAG, TAA, or TGA.
- Second, does the alternative base pair introduce such a stop codon?
Possible solution:
Using the first entry from the above table, I thought perhaps that I could take the location of the base pair, 17:45229228, and add and subtract 2 from it. This would give me the base pairs on either side of the codon, i.e., 17:45229226-45229230. I would then download the sequence that corresponds to these five base pairs and look to see whether or not a stop codon appears. I suppose only the positive strand would be of interest in this case.
Help:
Can someone help guide me on how to do this? I know I can get the sequence from Ensembl's genome browser, but I am not sure how to automate this task (I have a lot of entries that I need to process). Once I download the sequence, what would be the best way to assess for a stop codon? Could I easily program R to do this? Does anyone know of a simpler way to go about this? Thanks to anyone who can help!
(+1) Thanks so much for your answer! Just logged in and saw it today. Am currently on the road but this week will sit down and work through your suggestions.
@Sean, I downloaded and installed
bioconductor
, including theGenomicFeatures
,VariantAnnotation
,BSgenome.Hsapiens.UCSC.hg19
, andTxDb.Hsapiens.UCSC.hg19.knownGene
packages. However, when I typeresults = predictCoding(dat,db,Hsapiens,varallele)
I get the following:Error in function (classes, fdef, mtable) : unable to find an inherited method for function "predictCoding", for signature "GRanges", "TranscriptDb", "BSgenome", "DNAStringSet"
. Any ideas?Probably best to send your example to the Bioconductor mailing list. Be sure to include the output of sessionInfo().