Question: How to add biological annotations using R
1
gravatar for neranjan
5 weeks ago by
neranjan20
US
neranjan20 wrote:

Hi All,

Is there a R package that I can use to get the biological annotations?

I am using NOISeq package to analyze RNASeq data, and I am interested in adding additional biological annotations such as:

  • feature length
  • GC content
  • biological classification features
  • chromosome position

Question: Is it possible to direct me to how to get these information ? Can you let me know which packages that I can use to get the information.

Thank you very much.

ADD COMMENTlink modified 4 weeks ago • written 5 weeks ago by neranjan20

Thank you I was able to use biomaRt package to grab the required sequences using:

g_ids=c("ENSG00000177757", "ENSG00000187634", "ENSG00000188976")
human <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
gene_coords=getBM(attributes=c("ensembl_gene_id", "start_position","end_position","percentage_gene_gc_content"), filters="ensembl_gene_id", values=g_ids, mart=human)
gene_coords$length=gene_coords$end_position - gene_coords$start_position

Which gave me:

  ensembl_gene_id start_position end_position percentage_gene_gc_content length
1 ENSG00000177757         817371       819837                      48.52   2466
2 ENSG00000187634         923928       944581                      66.58  20653
3 ENSG00000188976         944204       959309                      59.55  15105

But the gene start and end positions as well as length of gene is well away from what it should be as shown below: (where i am following the NOISeq tutorial)

                Length    GC        Biotype Chromosome GeneStart GeneEnd
ENSG00000177757 2464.0 48.58        lincRNA          1    742614  745077
ENSG00000187634 4985.0 65.99 protein_coding          1    850393  869824
ENSG00000188976 3870.5 59.55 protein_coding          1    869459  884494
ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by neranjan20

You are using GRCh38, NOISeq tutorial GRCh37?

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Benn6.6k

it is possible, but in NOISeq it does not say which it is been using.

But again, the difference is huge, so it is obvious that I am missing something in calculating the gene-length.

I also tried to calculate the mean, median and sum of transcript length, and which was way off than the number they provided in NOISeq

  transcript_length.median transcript_length.mean transcript_length.sum
1                 1947.000               1947.000              1947.000
2                 2159.000               1699.353             28889.000
3                 1244.500               1749.000             10494.000
ADD REPLYlink written 4 weeks ago by neranjan20
1

I think your tutorial is very old. Add "version=75" to your useMart command, and you'll see gene coordinates close to the tutorials.

ADD REPLYlink written 4 weeks ago by swbarnes25.3k
1

As a side note, you calculate the total length of the gene (exons and introns). I guess NOISeq is using only sum of exon length.

ADD REPLYlink written 4 weeks ago by Benn6.6k

Thanks @swbarnes2 and @b.nota

I was able to replicate the NOISeq data length using a older version of mart, and also using transcript start and end locations to calculate the length.

So my final code would be:

en_2=useMart("ensembl", host = "http://feb2014.archive.ensembl.org")
human_2 = useDataset("hsapiens_gene_ensembl",mart=en_2)

g_ids_t=c("ENSG00000177757", "ENSG00000187634", "ENSG00000188976")
gene_coords=getBM(attributes=c("ensembl_gene_id", "start_position","end_position", "transcript_start", "transcript_end"), filters="ensembl_gene_id", values=g_ids_t, mart=human_2)
gene_coords$t_length = abs(gene_coords$transcript_start - gene_coords$transcript_end)

t_median_2 <- aggregate( t_length ~ ensembl_gene_id, gene_coords, plyr::each(median))

which gave me:

  ensembl_gene_id t_length
1 ENSG00000177757   2463.0
2 ENSG00000187634   4984.0
3 ENSG00000188976   3869.5

So I can take this as an answer.

As a side note: Which is off-set by 1 in each case, to the NOISeq data set: ( why ?)

                Length    GC        
ENSG00000177757 2464.0 
ENSG00000187634 4985.0 
ENSG00000188976 3870.5
ADD REPLYlink written 4 weeks ago by neranjan20
2

The off-set of one is probably because of the one-based coordinate system used by ensembl. Add + 1 to your equation.

ADD REPLYlink written 4 weeks ago by Benn6.6k

What are the three most common sources of programming errors? Poor naming, and 1-off errors.

ADD REPLYlink written 4 weeks ago by swbarnes25.3k
2
gravatar for Benn
5 weeks ago by
Benn6.6k
Netherlands
Benn6.6k wrote:

I think biomaRt can get you far, please read the manual.

ADD COMMENTlink written 5 weeks ago by Benn6.6k
1
gravatar for Ashastry
5 weeks ago by
Ashastry60
Ashastry60 wrote:

BiomaRt is my first option. Although, I occasionally use AnnotationDBI package.

ADD COMMENTlink written 5 weeks ago by Ashastry60
0
gravatar for neranjan
4 weeks ago by
neranjan20
US
neranjan20 wrote:
en_2=useMart("ensembl", host = "http://feb2014.archive.ensembl.org")
human_2 = useDataset("hsapiens_gene_ensembl",mart=en_2)

So as a last question to close this , is there a way to save this "mart" object for later use. (or as a reference for later analysis) ?

ADD COMMENTlink written 4 weeks ago by neranjan20

mart<-useMart(biomart ="ensemble", dataset = "")

ADD REPLYlink written 4 weeks ago by Ashastry60

not the assignment operator !!!

What I meant is to save it as object for later use, so I can load it when I do need it. (since these versions change what they have with time)

So is there a way to save this object in my hard drive? (some sort of similar way as they do in TxDb objects)

samplefile <- system.file("extdata", "hg19_knownGene_sample.sqlite",
package="GenomicFeatures")
txdb <- loadDb(samplefile)
ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by neranjan20
1

Save the R session:

save.image("Analysis.Rdata")

load it again another time:

load("Analysis.Rdata")
ADD REPLYlink written 4 weeks ago by Benn6.6k

Thanks, I think i was looking at more of only saving the ensemble database, and I think it can be done using:

rat_ensembl = useDataset("rnorvegicus_gene_ensembl",mart=ensembl)
save(rat_ensembl, file = "rat_ensembl.RData")

Then loading using:

load("rat_ensembl.RData")
ADD REPLYlink written 28 days ago by neranjan20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1656 users visited in the last hour