Question

Trying to add ensembl_transcript_id using Biomart

0

Entering edit mode

5.9 years ago

sandKings ▴ 40

Hi all,

I'm trying to run a differential gene expression analysis where I want the report to contain ensemble_transcript_id as well.

direc<- 'E:/counts_sep/MSC/' 
countFile<-paste(direc,'htseq.all.counts.txt',sep='')
print("Loading count data")
counts <- read.delim(file=countFile,header=TRUE,sep="\t")
dim(counts)

mart = useMart(biomart = "ensembl", dataset = "mmusculus_gene_ensembl")

#get the ensemble id you want to match
gene_anno_filtered = rownames(counts)
#now get the mart
dat = getBM(
  values = gene_anno_filtered,
  filters = c("ensembl_gene_id"),
  attributes = c("ensembl_gene_id", "external_gene_name", "description", "ensembl_transcript_id", "ensembl_peptide_id"),
  mart = mart
)
#now match the dat and your data
idx_gene = match(rownames(counts), dat$ensembl_gene_id)
counts$external_gene_name=dat$external_gene_name[idx_gene]
counts$Description=dat$description[idx_gene]
counts$ensembl_transcript_id =dat$ensembl_transcript_id[idx_gene]
counts$ensembl_peptide_id =dat$ensembl_peptide_id[idx_gene]

#getting data out 
geneData <- data.frame(counts[,1:5])
print ("Gene data from counts table")
colnames(geneData)[4] <- "width"
print(head(geneData))

#OUTPUT EXAMPLE
#                   Chr     Start       End width Strand
#ENSMUSG00000000001 chr3 108107280 108146146  3262      -
#ENSMUSG00000000003 chrX  77837901  77853623  1599      -
#ENSMUSG00000000028 chr16  18780447  18811987  4722      -
#ENSMUSG00000000031 chr7 142575529 142578143  6343      -
#ENSMUSG00000000037 chrX 161117193 161258213 23080      +
#ENSMUSG00000000049 chr11 108343354 108414396  2662      +

countData<-data.frame(counts[,6:length(colnames(counts))])  
print ("Counts data from counts table")

# Require at least 25% of samples to have count > 25
quant <- apply(countData,1,quantile,0.75)
keep <- which((quant >= 4) == 1)
countData <- countData[keep,]
dim(countData)
method <-'htseq'

the line:

 quant <- apply(countData,1,quantile,0.75)

gives the following error:

Error in quantile.default(newX[, i], ...) : 
  missing values and NaN's not allowed if 'na.rm' is FALSE

When I execute this line :

countData<-data.frame(counts[,6:length(colnames(counts))])

and check >countData, it shows several attributes are NA.

MSC329_RU486 MSC334_CTR MSC336_TPA external_gene_name Description ensembl_transcript_id ensembl_peptide_id
ENSMUSG00000000001.4          3889       5379       5080                 NA          NA                    NA                 NA
ENSMUSG00000000003.15            0          0          0                 NA          NA                    NA                 NA
ENSMUSG00000000028.14           79         67         50                 NA          NA                    NA                 NA
ENSMUSG00000000031.15           17          0         14                 NA          NA                    NA                 NA
ENSMUSG00000000037.16           18         10          8                 NA          NA                    NA                 NA

Can someone please help me fix this?

rnaseq biomart • 3.1k views

ADD COMMENT • link updated 5.9 years ago by 1769mkc ★ 1.2k • written 5.9 years ago by sandKings ▴ 40

0

Entering edit mode

By the looks of your code, I suspect you are trying to adapt an edgeR tutorial. You should store your data on a DGEList object, which is a list-based class with different slots for each king of data. This way, the counts are stored separately from the metadata, and it is easier to manipulate both:

library( edgeR )
y <- DGEList( counts = counts )
mart = useMart(biomart = "ensembl", dataset = "mmusculus_gene_ensembl")
gene_anno_filtered = rownames(counts)
dat = getBM( values = gene_anno_filtered, filters = c("ensembl_gene_id"), attributes = c("ensembl_gene_id", "external_gene_name", "description", "ensembl_transcript_id", "ensembl_peptide_id"), mart = mart )
genes <- match( rownames( counts ), dat$ensembl_gene_id)
y$genes <- genes

ADD REPLY • link 5.9 years ago by h.mon 35k

0

Entering edit mode

Hi h.mon, thanks so much for the feedback.

the line :

y <- DGEList( counts = counts )

returns:

Error in .isAllZero(counts) : counts must be positive finite values

is the error because I have nonnumerical rows and columns values? I'll do some more research and try to troubleshoot this. Thanks again!

ADD REPLY • link 5.9 years ago by sandKings ▴ 40

0

Entering edit mode

WARNING

Please be aware that you are arbitrarily adding a transcript_id to each of your genes. Each gene can have multiple transcripts associated with it and in your case you are simply appending the transcript_id associated with the first occurrence of the gene_id in your biomaRt dataframe using match. Only for genes with a single transcript can you be sure that your association is correct.

mart = useMart(biomart = "ensembl", dataset = "mmusculus_gene_ensembl")

## get the ensemble id you want to match
## commented out your old code
## gene_anno_filtered = rownames(counts)
gene_anno_filtered <- c("ENSMUSG00000000001","ENSMUSG00000000003","ENSMUSG00000000028","ENSMUSG00000000031","ENSMUSG00000000037","ENSMUSG00000000049")

## now get the mart
dat = getBM(
  values = gene_anno_filtered,
  filters = c("ensembl_gene_id"),
  attributes = c("ensembl_gene_id", "external_gene_name", "description", "ensembl_transcript_id", "ensembl_peptide_id"),
  mart = mart
)

head(dat)[,c(1,2,4)]
     ensembl_gene_id external_gene_name ensembl_transcript_id
1 ENSMUSG00000000001              Gnai3    ENSMUST00000000001
2 ENSMUSG00000000003               Pbsn    ENSMUST00000000003
3 ENSMUSG00000000003               Pbsn    ENSMUST00000114041
4 ENSMUSG00000000028              Cdc45    ENSMUST00000000028
5 ENSMUSG00000000028              Cdc45    ENSMUST00000096990
6 ENSMUSG00000000028              Cdc45    ENSMUST00000115585

match(gene_anno_filtered[1:3],dat$ensembl_gene_id)
[1] 1 2 4

ADD REPLY • link 3.5 years ago by benformatics 4.0k

0

Entering edit mode

Hi benformatic,

thanks for pointing this out. This basically voids everything that I'm trying to do and in light of this information, I should look for a way to run differential expression of 'transcripts' instead of genes. I'm not sure how to do that though. I was hoping that my output would contain multiple entries of the same gene matched with all the transcript ids.

ADD REPLY • link 5.9 years ago by sandKings ▴ 40

0

Entering edit mode

I understand this is an old post, but want to point out for any current readers that Salmon is a transcript aware aligner that accomplishes this: https://salmon.readthedocs.io/en/latest/salmon.html

ADD REPLY • link 3.5 years ago by gatollefson • 0

score 2 · Answer 1 · 2018-09-18

I think the reason you're getting some NA values is because the gene IDs you are using (taken from rownames(counts)) include a version number at the end e.g. "ENSMUSG00000000001.4". Your biomaRt query then doesn't match these as your filter specifies the non-versioned IDs to search.

Here's a little example to demonstrate:

library(biomaRt)
mart = useMart(biomart = "ensembl", dataset = "mmusculus_gene_ensembl")

## use a few example IDs taken from the question
gene_anno_filtered <- c("ENSMUSG00000000001.4", "ENSMUSG00000000003.15", "ENSMUSG00000000028.14")

dat = getBM(
    values = gene_anno_filtered,
    filters = c("ensembl_gene_id"),
    attributes = c("ensembl_gene_id", "external_gene_name", "ensembl_transcript_id"),
    mart = mart
)

dat_with_version = getBM(
    values = gene_anno_filtered,
    filters = c("ensembl_gene_id_version"),
    attributes = c("ensembl_gene_id", "external_gene_name", "ensembl_transcript_id"),
    mart = mart
)

The two queries above only vary on the filter argument, but the first returns 0 results, and the second 3.

> dim(dat)
[1] 0 3
> dim(dat_with_version)
[1] 3 3

You can either use the second query, or remove the version numbers from your gene IDs. Personally I think it's better to stick with your original query and remove the version numbers, as unless you're using the most recent version number it will probably still get dropped during your query. You can do that in many ways, but I would use the stringr package and something like:

library(stringr)
gene_anno_filtered <- str_replace(rownames(counts),
                                  pattern = ".[0-9]+$",
                                  replacement = "")

score 0 · Answer 2 · 2018-09-18

0

Entering edit mode

5.9 years ago

1769mkc ★ 1.2k

There is an answer ,check this post How to add gene symbol to RNA-Seq data using R

ADD COMMENT • link 5.9 years ago by 1769mkc ★ 1.2k