Question: Gene names in Ballgown differential expression analysis
8
gravatar for Albert
2.4 years ago by
Albert140
Dublin, Ireland
Albert140 wrote:

Hi there,

I just performed an RNAseq experiment, for which I used HISAT2 for the alingment, Stringtie for the assembly and the R package Ballgown for the Differential Expression (DE) analysis (protocol published here: http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html ).

The default output of the DE analysis is a table which looks pretty standard, with a transcript ID, a fold change value, a p-value and a q-value. The transcript ID corresponds to either an Ensmbl ID or a MSTRG ID, which is basically a number that is generated during the assembly of the transcriptome. However, the gene name/symbol corresponding to each feature does not appear in the table.

The protocol I followed provides a command to add the gene names to the results table:

results_transcripts = data.frame(geneNames=ballgown::geneNames(bg_chrX_filt), geneIDs=ballgown::geneIDs(bg_chrX_filt), results_transcripts)

And this actually works when the DE analysis is performed at the transcript level, but not when it is performed at the gene level. The reason is basically that the table resulting from a gene level analysis has many less rows than the original reference which links every transcript ID to a gene name, and R does not like that.

I was wondering if anyone had the same issue before, and if there's some kind of straightforward solution for it. It could virtually be done manually, but it would be a nightmare to add ~20k gene names that way! :S

Thanks in advance for your help!

Albert

rna-seq ballgown • 6.3k views
ADD COMMENTlink modified 23 months ago by mcalleja50 • written 2.4 years ago by Albert140

I'm having this problem too and fixed the annotation the same way as Albert but still trying to figure out how to do everything at the gene level.

ADD REPLYlink written 2.4 years ago by marina.v.yurieva480

Hi, does anyone figured out how to add the gene names at the gene level?

ADD REPLYlink written 2.2 years ago by zhifeiluo12080
5
gravatar for mcalleja
23 months ago by
mcalleja50
Barcelona
mcalleja50 wrote:

Hi!

I want to contribute the answer that AlyssaFrazee gave on Ballgown's github to retrieve GenesName (it's not intuitive, but rather a good workaround) Add geneNames to result of stattest on feature gene:

indices <- match(results_genes$id, texpr(bg, 'all')$gene_id)
gene_names_for_result <- texpr(bg, 'all')$gene_name[indices]
results_genes <- data.frame(geneNames=gene_names_for_result, results_genes)
ADD COMMENTlink written 23 months ago by mcalleja50
1
gravatar for Zhilong Jia
2.1 years ago by
Zhilong Jia1.4k
London
Zhilong Jia1.4k wrote:

There are gene names (symbols) and gene ids in results_transcripts as shown. So you can map them to results_genes directly. Note some ids have more than one gene names.

# get the mapping relationship between gene symbols and gene ids
genename_ids <- dplyr::filter(results_transcripts, geneNames!=".") %>% dplyr::distinct(geneNames, geneIDs)
# join them by gene ids.
results_genes1 <- dplyr::left_join(results_genes, genename_ids, by=c("id"="geneIDs"))
ADD COMMENTlink written 2.1 years ago by Zhilong Jia1.4k
0
gravatar for jerryzzy11
2.4 years ago by
jerryzzy110
jerryzzy110 wrote:

I am having the same problem here. if you get the answer elsewhere, please let me know. Thank you! BTW, do you know how to get the gene names with the MSTRG.xxxx.x?

ADD COMMENTlink written 2.4 years ago by jerryzzy110
1

Hi,

you can get a table with all the information in your ballgown object (bg) , doing:

full_table <- texpr(bg , 'all')

The table will include: transcript id, chromosome, strand, start positon, end position, number of exons, length, gene id, gene name, and the FPKM values for each sample.

Btw, I have noticed that sometimes the same gene id (MSTRG.xxxx) can correspond to more than one gene name, so don't fully rely on the gene_id as a unique identifier.

Hope this helps!

ADD REPLYlink written 2.4 years ago by Albert140
0
gravatar for shmaisrael
23 months ago by
shmaisrael0
shmaisrael0 wrote:

It doesn't really work. I did next:

bg_filt = subset(bg,"rowVars(texpr(bg)) >5",genomesubset=TRUE)

results_transcripts = stattest(bg_filt, feature="transcript", covariate="Treatment", adjustvars = c("Age"), getFC=TRUE, meas="FPKM")

When I run the proposed

genename_ids <- dplyr::filter(results_transcripts, geneNames!=".") %>% dplyr::distinct(geneNames, geneIDs)

I got next message: Error in eval(expr, envir, enclos) : comparison (2) is possible only for atomic and list types

ADD COMMENTlink written 23 months ago by shmaisrael0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1245 users visited in the last hour