Question

Assessing Whether Point Mutations Introduce Stop Codons Into Chromosomal Dna Using Ensembl And R

4

Entering edit mode

13.1 years ago

Alexander ▴ 40

Problem:

I have a spreadsheet that contains a list of mostly point mutations in a particular type of cancer. The sheet has columns that correspond to the chromosome number, the position, the reference base pair, and the alternative base pair. It also has a column called "uploaded variation" that summarizes this information in one statement. Last, there is another "location" column that contains both the chromosome number and position, and a column with the gene ID. An example is shown below:

CHROM    POS            REF    ALT        Uploaded_variation    Location        Gene
chr17    45229228    A    C        17_45229228_A/C        17:45229228        ENSG00000004897
chr17    45229234    A    C        17_45229234_A/C        17:45229234        ENSG00000004897
chr17    45234706    G    C        17_45234706_G/C        17:45234706        ENSG00000004897
chr17    45232043    T    C        17_45232043_T/C        17:45232043        ENSG00000004897
chr17    45229254    T    G        17_45229253_T/G        17:45229253        ENSG00000004897
chr17    45229253    T    G        17_45229253_T/G        17:45229253        ENSG00000004897

I have two questions that I would like to answer:

First, is the reference base pair a member of a stop codon, i.e., TAG, TAA, or TGA.
Second, does the alternative base pair introduce such a stop codon?

Possible solution:

Using the first entry from the above table, I thought perhaps that I could take the location of the base pair, 17:45229228, and add and subtract 2 from it. This would give me the base pairs on either side of the codon, i.e., 17:45229226-45229230. I would then download the sequence that corresponds to these five base pairs and look to see whether or not a stop codon appears. I suppose only the positive strand would be of interest in this case.

Help:

Can someone help guide me on how to do this? I know I can get the sequence from Ensembl's genome browser, but I am not sure how to automate this task (I have a lot of entries that I need to process). Once I download the sequence, what would be the best way to assess for a stop codon? Could I easily program R to do this? Does anyone know of a simpler way to go about this? Thanks to anyone who can help!

ensembl r biomart • 5.0k views

ADD COMMENT • link updated 2.4 years ago by Ram 45k • written 13.1 years ago by Alexander ▴ 40

score 7 · Answer 1 · 2012-05-25

In R, something like this might work:

library(GenomicFeatures)
library(VariantAnnotation)
library(BSgenome.Hsapiens.UCSC.hg19)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

#Using your data
dat = GRanges(seqnames=rep('chr17',3),ranges=IRanges(start=c(45229228,45229234,45234706),width=1))
varallele = DNAStringSet(c('C','C','C'))
db = TxDb.Hsapiens.UCSC.hg19.knownGene

results = predictCoding(dat,db,Hsapiens,varallele)

The results look like this:

DataFrame with 12 rows and 13 columns
     queryID consequence         refSeq         varSeq         refAA
   <integer>    <factor> <DNAStringSet> <DNAStringSet> <AAStringSet>
1          1  synonymous         AATAGC         AACAGC            NS
2          1  synonymous         AATAGC         AACAGC            NS
3          1  synonymous         AATAGC         AACAGC            NS
4          1  synonymous         AATAGC         AACAGC            NS
5          2  synonymous         AGTGGA         AGCGGA            SG
6          2  synonymous         AGTGGA         AGCGGA            SG
7          2  synonymous         AGTGGA         AGCGGA            SG
8          2  synonymous         AGTGGA         AGCGGA            SG
9          3  synonymous            CAG            CAG             Q
10         3  synonymous            CAG            CAG             Q
11         3  synonymous            CAG            CAG             Q
12         3  synonymous            CAG            CAG             Q
           varAA  seqTxLoc  varTxLoc varCdsLoc subjStrand        txID     cdsID
   <AAStringSet> <integer> <integer> <integer>      <Rle> <character> <integer>
1             NS      1030      1032        74          -       61662    191676
2             NS      1012      1014        56          -       61663    191685
3             NS      1012      1014        56          -       61664    191685
4             NS       829       831        56          -       61665    191685
5             SG      1024      1026        68          -       61662    191676
6             SG      1006      1008        50          -       61663    191685
7             SG      1006      1008        50          -       61664    191685
8             SG       823       825        50          -       61665    191685
9              Q       520       520        44          -       61662    191679
10             Q       520       520        44          -       61663    191679
11             Q       520       520        44          -       61664    191679
12             Q       337       337        44          -       61665    191679
     geneID
   <factor>
1       996
2       996
3       996
4       996
5       996
6       996
7       996
8       996
9       996
10      996
11      996
12      996

And some sessionInfo:

> sessionInfo()
R Under development (unstable) (2012-01-19 r58141)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.6.2
 [2] BSgenome.Hsapiens.UCSC.hg19_1.3.17     
 [3] BiocInstaller_1.5.7                    
 [4] BSgenome_1.23.4                        
 [5] GenomicFeatures_1.7.30                 
 [6] AnnotationDbi_1.17.23                  
 [7] Biobase_2.15.4                         
 [8] VariantAnnotation_1.1.58               
 [9] Rsamtools_1.7.37                       
[10] Biostrings_2.23.6                      
[11] GenomicRanges_1.7.30                   
[12] IRanges_1.13.28                        
[13] BiocGenerics_0.1.12                    

loaded via a namespace (and not attached):
 [1] biomaRt_2.11.1     bitops_1.0-4.1     DBI_0.2-5          grid_2.15.0       
 [5] lattice_0.20-0     Matrix_1.0-3       RCurl_1.91-1       RSQLite_0.11.1    
 [9] rtracklayer_1.15.7 snpStats_1.5.4     splines_2.15.0     stats4_2.15.0     
[13] survival_2.36-12   tools_2.15.0       XML_3.9-4          zlibbioc_1.1.1

score 4 · Answer 2 · 2012-05-25

4

Entering edit mode

13.1 years ago

Alastair Kerr 5.3k

Check out the Variant Effect Predictor on the Ensembl Tools Site.

There is a nice tutorial on it by Bert Overduin on the bioinformatics knowledge blog

ADD COMMENT • link 13.1 years ago by Alastair Kerr 5.3k

0

Entering edit mode

(+1) Thanks for your suggestion! Am looking forward to reading through the tutorial later this week.

ADD REPLY • link 13.1 years ago by Alexander ▴ 40