Question

Adding gene names to ballgown object gexpr(bg)

0

Entering edit mode

6.5 years ago

maria.traka ▴ 20

Hi, I'm having an issue with adding names to the gene expression of a bg object. It seems that the texpr$gene_id has multiple entries for the same MSTRGx (as expected I guess considering it's different isoforms for the same gene) but unfortunately for some of the genes the first one of the texpr entries is "." and not the actual gene name. This results in my gene names having lots of ".". How can i work around this? I am missing a lot of genes here from all my downstream functional analysis. Can you help? Thanks, Maria

gene_expression_ESC = gexpr(bg_ESC_89)
indicesG <- match(rownames(gene_expression_ESC), texpr(bg_ESC_89, 'all')$gene_id)
gene_names_F <- texpr(bg_ESC_89, 'all')$gene_name[indicesG]
gene_names_T <- texpr(bg_ESC_89, 'all')$t_name[indicesG]
gene_expression_ESC_N <- data.frame(geneNames=gene_names_F,ensIDs=gene_names_T, gene_expression_ESC)

RNA-Seq ballgown • 2.2k views

ADD COMMENT • link 6.5 years ago by maria.traka ▴ 20

0

Entering edit mode

are there any genes/transcripts in reference gtf starting with "."? Validate reference gtf. If there are no issues with gtf, you can filter out those genes starting with "." from texpr object.

ADD REPLY • link 6.5 years ago by cpad0112 21k

0

Entering edit mode

I'm using the Ensembl Homo_sapiens.GRCh38.89.gtf dowloaded from their ftp site so it's not that. I suspect these are putative novel isoforms of known genes that are listed and because they happened to be listed before the known transcripts match is hitting those. I have now managed a workaround where as you suggest i remove the "." entries from the texpr object but it seems very convoluted to me. Anyhow, here it is:

whole_tx_table_ESC = texpr(bg_ESC_89, 'all')
A=whole_tx_table_ESC[,c("gene_id","gene_name","t_name")] 
Bi=which(A[,2]!=".") #find out the indices that do not contain "."
B=A[Bi,] #create a new data.frame with gene names 
indicesG <- match(rownames(gene_expression_ESC), B$gene_id)
GE=data.frame(geneNames=B$gene_name[indicesG],ensIDs=B$gene_id[indicesG],ensTID=B$t_name[indicesG], gene_expression_ESC)

Has anyone else had the same problem? I have to say i bumped into this problem when i was looking for something completely different... I can't think why this would be unique to my data...

ADD REPLY • link 6.5 years ago by maria.traka ▴ 20