Question: Gene names in Ballgown differential expression analysis
gravatar for Albert
2.7 years ago by
Dublin, Ireland
Albert140 wrote:

Hi there,

I just performed an RNAseq experiment, for which I used HISAT2 for the alingment, Stringtie for the assembly and the R package Ballgown for the Differential Expression (DE) analysis (protocol published here: ).

The default output of the DE analysis is a table which looks pretty standard, with a transcript ID, a fold change value, a p-value and a q-value. The transcript ID corresponds to either an Ensmbl ID or a MSTRG ID, which is basically a number that is generated during the assembly of the transcriptome. However, the gene name/symbol corresponding to each feature does not appear in the table.

The protocol I followed provides a command to add the gene names to the results table:

results_transcripts = data.frame(geneNames=ballgown::geneNames(bg_chrX_filt), geneIDs=ballgown::geneIDs(bg_chrX_filt), results_transcripts)

And this actually works when the DE analysis is performed at the transcript level, but not when it is performed at the gene level. The reason is basically that the table resulting from a gene level analysis has many less rows than the original reference which links every transcript ID to a gene name, and R does not like that.

I was wondering if anyone had the same issue before, and if there's some kind of straightforward solution for it. It could virtually be done manually, but it would be a nightmare to add ~20k gene names that way! :S

Thanks in advance for your help!


rna-seq ballgown • 7.2k views
ADD COMMENTlink modified 2.2 years ago by mcalleja60 • written 2.7 years ago by Albert140

I'm having this problem too and fixed the annotation the same way as Albert but still trying to figure out how to do everything at the gene level.

ADD REPLYlink written 2.7 years ago by marina.v.yurieva480

Hi, does anyone figured out how to add the gene names at the gene level?

ADD REPLYlink written 2.6 years ago by zhifeiluo12080
gravatar for mcalleja
2.2 years ago by
mcalleja60 wrote:


I want to contribute the answer that AlyssaFrazee gave on Ballgown's github to retrieve GenesName (it's not intuitive, but rather a good workaround) Add geneNames to result of stattest on feature gene:

indices <- match(results_genes$id, texpr(bg, 'all')$gene_id)
gene_names_for_result <- texpr(bg, 'all')$gene_name[indices]
results_genes <- data.frame(geneNames=gene_names_for_result, results_genes)
ADD COMMENTlink written 2.2 years ago by mcalleja60

play around these commands and you will find the answers.

ADD REPLYlink written 10 weeks ago by ranjanmanishblue0
gravatar for Zhilong Jia
2.4 years ago by
Zhilong Jia1.4k
Zhilong Jia1.4k wrote:

There are gene names (symbols) and gene ids in results_transcripts as shown. So you can map them to results_genes directly. Note some ids have more than one gene names.

# get the mapping relationship between gene symbols and gene ids
genename_ids <- dplyr::filter(results_transcripts, geneNames!=".") %>% dplyr::distinct(geneNames, geneIDs)
# join them by gene ids.
results_genes1 <- dplyr::left_join(results_genes, genename_ids, by=c("id"="geneIDs"))
ADD COMMENTlink written 2.4 years ago by Zhilong Jia1.4k
gravatar for jerryzzy11
2.7 years ago by
jerryzzy110 wrote:

I am having the same problem here. if you get the answer elsewhere, please let me know. Thank you! BTW, do you know how to get the gene names with the MSTRG.xxxx.x?

ADD COMMENTlink written 2.7 years ago by jerryzzy110


you can get a table with all the information in your ballgown object (bg) , doing:

full_table <- texpr(bg , 'all')

The table will include: transcript id, chromosome, strand, start positon, end position, number of exons, length, gene id, gene name, and the FPKM values for each sample.

Btw, I have noticed that sometimes the same gene id (MSTRG.xxxx) can correspond to more than one gene name, so don't fully rely on the gene_id as a unique identifier.

Hope this helps!

ADD REPLYlink written 2.7 years ago by Albert140
gravatar for shmaisrael
2.3 years ago by
shmaisrael0 wrote:

It doesn't really work. I did next:

bg_filt = subset(bg,"rowVars(texpr(bg)) >5",genomesubset=TRUE)

results_transcripts = stattest(bg_filt, feature="transcript", covariate="Treatment", adjustvars = c("Age"), getFC=TRUE, meas="FPKM")

When I run the proposed

genename_ids <- dplyr::filter(results_transcripts, geneNames!=".") %>% dplyr::distinct(geneNames, geneIDs)

I got next message: Error in eval(expr, envir, enclos) : comparison (2) is possible only for atomic and list types

ADD COMMENTlink written 2.3 years ago by shmaisrael0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1508 users visited in the last hour