Gene names in Ballgown differential expression analysis
4
11
Entering edit mode
7.5 years ago
Albert ▴ 180

Hi there,

I just performed an RNAseq experiment, for which I used HISAT2 for the alingment, Stringtie for the assembly and the R package Ballgown for the Differential Expression (DE) analysis (protocol published here: http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html ).

The default output of the DE analysis is a table which looks pretty standard, with a transcript ID, a fold change value, a p-value and a q-value. The transcript ID corresponds to either an Ensmbl ID or a MSTRG ID, which is basically a number that is generated during the assembly of the transcriptome. However, the gene name/symbol corresponding to each feature does not appear in the table.

The protocol I followed provides a command to add the gene names to the results table:

results_transcripts = data.frame(geneNames=ballgown::geneNames(bg_chrX_filt), geneIDs=ballgown::geneIDs(bg_chrX_filt), results_transcripts)

And this actually works when the DE analysis is performed at the transcript level, but not when it is performed at the gene level. The reason is basically that the table resulting from a gene level analysis has many less rows than the original reference which links every transcript ID to a gene name, and R does not like that.

I was wondering if anyone had the same issue before, and if there's some kind of straightforward solution for it. It could virtually be done manually, but it would be a nightmare to add ~20k gene names that way! :S

Thanks in advance for your help!

Albert

RNA-Seq ballgown • 14k views
ADD COMMENT
0
Entering edit mode

I'm having this problem too and fixed the annotation the same way as Albert but still trying to figure out how to do everything at the gene level.

ADD REPLY
0
Entering edit mode

Hi, does anyone figured out how to add the gene names at the gene level?

ADD REPLY
9
Entering edit mode
7.0 years ago
mcalleja ▴ 110

Hi!

I want to contribute the answer that AlyssaFrazee gave on Ballgown's github to retrieve GenesName (it's not intuitive, but rather a good workaround) Add geneNames to result of stattest on feature gene:

indices <- match(results_genes$id, texpr(bg, 'all')$gene_id)
gene_names_for_result <- texpr(bg, 'all')$gene_name[indices]
results_genes <- data.frame(geneNames=gene_names_for_result, results_genes)
ADD COMMENT
0
Entering edit mode

play around these commands and you will find the answers.

ADD REPLY
2
Entering edit mode
7.2 years ago
Zhilong Jia ★ 2.2k

There are gene names (symbols) and gene ids in results_transcripts as shown. So you can map them to results_genes directly. Note some ids have more than one gene names.

# get the mapping relationship between gene symbols and gene ids
genename_ids <- dplyr::filter(results_transcripts, geneNames!=".") %>% dplyr::distinct(geneNames, geneIDs)
# join them by gene ids.
results_genes1 <- dplyr::left_join(results_genes, genename_ids, by=c("id"="geneIDs"))
ADD COMMENT
0
Entering edit mode
7.5 years ago
jerryzzy11 • 0

I am having the same problem here. if you get the answer elsewhere, please let me know. Thank you! BTW, do you know how to get the gene names with the MSTRG.xxxx.x?

ADD COMMENT
2
Entering edit mode

Hi,

you can get a table with all the information in your ballgown object (bg) , doing:

full_table <- texpr(bg , 'all')

The table will include: transcript id, chromosome, strand, start positon, end position, number of exons, length, gene id, gene name, and the FPKM values for each sample.

Btw, I have noticed that sometimes the same gene id (MSTRG.xxxx) can correspond to more than one gene name, so don't fully rely on the gene_id as a unique identifier.

Hope this helps!

ADD REPLY
0
Entering edit mode
7.0 years ago
shmaisrael • 0

It doesn't really work. I did next:

bg_filt = subset(bg,"rowVars(texpr(bg)) >5",genomesubset=TRUE)

results_transcripts = stattest(bg_filt, feature="transcript", covariate="Treatment", adjustvars = c("Age"), getFC=TRUE, meas="FPKM")

When I run the proposed

genename_ids <- dplyr::filter(results_transcripts, geneNames!=".") %>% dplyr::distinct(geneNames, geneIDs)

I got next message: Error in eval(expr, envir, enclos) : comparison (2) is possible only for atomic and list types

ADD COMMENT

Login before adding your answer.

Traffic: 1816 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6