Problem:
I have a spreadsheet that contains a list of mostly point mutations in a particular type of cancer. The sheet has columns that correspond to the chromosome number, the position, the reference base pair, and the alternative base pair. It also has a column called "uploaded variation" that summarizes this information in one statement. Last, there is another "location" column that contains both the chromosome number and position, and a column with the gene ID. An example is shown below:
CHROM POS REF ALT Uploaded_variation Location Gene
chr17 45229228 A C 17_45229228_A/C 17:45229228 ENSG00000004897
chr17 45229234 A C 17_45229234_A/C 17:45229234 ENSG00000004897
chr17 45234706 G C 17_45234706_G/C 17:45234706 ENSG00000004897
chr17 45232043 T C 17_45232043_T/C 17:45232043 ENSG00000004897
chr17 45229254 T G 17_45229253_T/G 17:45229253 ENSG00000004897
chr17 45229253 T G 17_45229253_T/G 17:45229253 ENSG00000004897
I have two questions that I would like to answer:
- First, is the reference base pair a member of a stop codon, i.e., TAG, TAA, or TGA.
- Second, does the alternative base pair introduce such a stop codon?
Possible solution:
Using the first entry from the above table, I thought perhaps that I could take the location of the base pair, 17:45229228, and add and subtract 2 from it. This would give me the base pairs on either side of the codon, i.e., 17:45229226-45229230. I would then download the sequence that corresponds to these five base pairs and look to see whether or not a stop codon appears. I suppose only the positive strand would be of interest in this case.
Help:
Can someone help guide me on how to do this? I know I can get the sequence from Ensembl's genome browser, but I am not sure how to automate this task (I have a lot of entries that I need to process). Once I download the sequence, what would be the best way to assess for a stop codon? Could I easily program R to do this? Does anyone know of a simpler way to go about this? Thanks to anyone who can help!
(+1) Thanks so much for your answer! Just logged in and saw it today. Am currently on the road but this week will sit down and work through your suggestions.
@Sean, I downloaded and installed
bioconductor, including theGenomicFeatures,VariantAnnotation,BSgenome.Hsapiens.UCSC.hg19, andTxDb.Hsapiens.UCSC.hg19.knownGenepackages. However, when I typeresults = predictCoding(dat,db,Hsapiens,varallele)I get the following:Error in function (classes, fdef, mtable) : unable to find an inherited method for function "predictCoding", for signature "GRanges", "TranscriptDb", "BSgenome", "DNAStringSet". Any ideas?Probably best to send your example to the Bioconductor mailing list. Be sure to include the output of sessionInfo().