Assessing Whether Point Mutations Introduce Stop Codons Into Chromosomal Dna Using Ensembl And R
2
4
Entering edit mode
11.9 years ago
Alexander ▴ 40

Problem:

I have a spreadsheet that contains a list of mostly point mutations in a particular type of cancer. The sheet has columns that correspond to the chromosome number, the position, the reference base pair, and the alternative base pair. It also has a column called "uploaded variation" that summarizes this information in one statement. Last, there is another "location" column that contains both the chromosome number and position, and a column with the gene ID. An example is shown below:

CHROM    POS            REF    ALT        Uploaded_variation    Location        Gene
chr17    45229228    A    C        17_45229228_A/C        17:45229228        ENSG00000004897
chr17    45229234    A    C        17_45229234_A/C        17:45229234        ENSG00000004897
chr17    45234706    G    C        17_45234706_G/C        17:45234706        ENSG00000004897
chr17    45232043    T    C        17_45232043_T/C        17:45232043        ENSG00000004897
chr17    45229254    T    G        17_45229253_T/G        17:45229253        ENSG00000004897
chr17    45229253    T    G        17_45229253_T/G        17:45229253        ENSG00000004897

I have two questions that I would like to answer:

  1. First, is the reference base pair a member of a stop codon, i.e., TAG, TAA, or TGA.
  2. Second, does the alternative base pair introduce such a stop codon?

Possible solution:

Using the first entry from the above table, I thought perhaps that I could take the location of the base pair, 17:45229228, and add and subtract 2 from it. This would give me the base pairs on either side of the codon, i.e., 17:45229226-45229230. I would then download the sequence that corresponds to these five base pairs and look to see whether or not a stop codon appears. I suppose only the positive strand would be of interest in this case.

Help:

Can someone help guide me on how to do this? I know I can get the sequence from Ensembl's genome browser, but I am not sure how to automate this task (I have a lot of entries that I need to process). Once I download the sequence, what would be the best way to assess for a stop codon? Could I easily program R to do this? Does anyone know of a simpler way to go about this? Thanks to anyone who can help!

ensembl r biomart • 4.4k views
ADD COMMENT
7
Entering edit mode
11.9 years ago

In R, something like this might work:

library(GenomicFeatures)
library(VariantAnnotation)
library(BSgenome.Hsapiens.UCSC.hg19)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

#Using your data
dat = GRanges(seqnames=rep('chr17',3),ranges=IRanges(start=c(45229228,45229234,45234706),width=1))
varallele = DNAStringSet(c('C','C','C'))
db = TxDb.Hsapiens.UCSC.hg19.knownGene

results = predictCoding(dat,db,Hsapiens,varallele)

The results look like this:

DataFrame with 12 rows and 13 columns
     queryID consequence         refSeq         varSeq         refAA
   <integer>    <factor> <DNAStringSet> <DNAStringSet> <AAStringSet>
1          1  synonymous         AATAGC         AACAGC            NS
2          1  synonymous         AATAGC         AACAGC            NS
3          1  synonymous         AATAGC         AACAGC            NS
4          1  synonymous         AATAGC         AACAGC            NS
5          2  synonymous         AGTGGA         AGCGGA            SG
6          2  synonymous         AGTGGA         AGCGGA            SG
7          2  synonymous         AGTGGA         AGCGGA            SG
8          2  synonymous         AGTGGA         AGCGGA            SG
9          3  synonymous            CAG            CAG             Q
10         3  synonymous            CAG            CAG             Q
11         3  synonymous            CAG            CAG             Q
12         3  synonymous            CAG            CAG             Q
           varAA  seqTxLoc  varTxLoc varCdsLoc subjStrand        txID     cdsID
   <AAStringSet> <integer> <integer> <integer>      <Rle> <character> <integer>
1             NS      1030      1032        74          -       61662    191676
2             NS      1012      1014        56          -       61663    191685
3             NS      1012      1014        56          -       61664    191685
4             NS       829       831        56          -       61665    191685
5             SG      1024      1026        68          -       61662    191676
6             SG      1006      1008        50          -       61663    191685
7             SG      1006      1008        50          -       61664    191685
8             SG       823       825        50          -       61665    191685
9              Q       520       520        44          -       61662    191679
10             Q       520       520        44          -       61663    191679
11             Q       520       520        44          -       61664    191679
12             Q       337       337        44          -       61665    191679
     geneID
   <factor>
1       996
2       996
3       996
4       996
5       996
6       996
7       996
8       996
9       996
10      996
11      996
12      996

And some sessionInfo:

> sessionInfo()
R Under development (unstable) (2012-01-19 r58141)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.6.2
 [2] BSgenome.Hsapiens.UCSC.hg19_1.3.17     
 [3] BiocInstaller_1.5.7                    
 [4] BSgenome_1.23.4                        
 [5] GenomicFeatures_1.7.30                 
 [6] AnnotationDbi_1.17.23                  
 [7] Biobase_2.15.4                         
 [8] VariantAnnotation_1.1.58               
 [9] Rsamtools_1.7.37                       
[10] Biostrings_2.23.6                      
[11] GenomicRanges_1.7.30                   
[12] IRanges_1.13.28                        
[13] BiocGenerics_0.1.12                    

loaded via a namespace (and not attached):
 [1] biomaRt_2.11.1     bitops_1.0-4.1     DBI_0.2-5          grid_2.15.0       
 [5] lattice_0.20-0     Matrix_1.0-3       RCurl_1.91-1       RSQLite_0.11.1    
 [9] rtracklayer_1.15.7 snpStats_1.5.4     splines_2.15.0     stats4_2.15.0     
[13] survival_2.36-12   tools_2.15.0       XML_3.9-4          zlibbioc_1.1.1
ADD COMMENT
0
Entering edit mode

(+1) Thanks so much for your answer! Just logged in and saw it today. Am currently on the road but this week will sit down and work through your suggestions.

ADD REPLY
0
Entering edit mode

@Sean, I downloaded and installed bioconductor, including the GenomicFeatures, VariantAnnotation, BSgenome.Hsapiens.UCSC.hg19, and TxDb.Hsapiens.UCSC.hg19.knownGene packages. However, when I type results = predictCoding(dat,db,Hsapiens,varallele) I get the following: Error in function (classes, fdef, mtable) : unable to find an inherited method for function "predictCoding", for signature "GRanges", "TranscriptDb", "BSgenome", "DNAStringSet". Any ideas?

ADD REPLY
1
Entering edit mode

Probably best to send your example to the Bioconductor mailing list. Be sure to include the output of sessionInfo().

ADD REPLY
4
Entering edit mode
11.9 years ago

Check out the Variant Effect Predictor on the Ensembl Tools Site.

There is a nice tutorial on it by Bert Overduin on the bioinformatics knowledge blog

ADD COMMENT
0
Entering edit mode

(+1) Thanks for your suggestion! Am looking forward to reading through the tutorial later this week.

ADD REPLY

Login before adding your answer.

Traffic: 3434 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6