Question: Convert HTSeq output ENSG ID to HGNC Gene Symbol and preserve counts
gravatar for robvanner
2.6 years ago by
robvanner0 wrote:

How can I convert the output from HTSeq count from ENSG IDs with counts to HGNC gene symbols with counts?

If I use Biomart online or in R (see code below) I lose the ensembl gene IDs that don't have a corresponding symbol or that collapse to a single symbol. I am starting with 57778 ensembl IDs and am returned 35699 gene symbols. This is a problem since the gene symbols are returned in a different order and without their corresponding counts, complicating further analysis. I would like to use the gene symbols and counts together for downstream pathway analysis following edgeR or DESeq2. Any guidance is appreciated.

MLL<- read.delim("/Path.txt", header=FALSE)
colnames(MLL)<- c("ENSEMBL_GENE_ID", "Counts")
human = useMart("ENSEMBL_MART_ENSEMBL", datatset="hsapiens_gene_ensembl")
results<- getBM(attributes=c("hgnc_symbol"), values=MLL$ENSEMBL_GENE_ID, mart=human)

Below is a summary of the problem: gene symbols are fewer in number and I am not sure how to link the counts to the symbols


1 ENSG00000000003      4 
2 ENSG00000000005      0 
3 ENSG00000000419    586 
4 ENSG00000000457    384

... row 57778



... row 35699

rna-seq biomart htseq • 3.1k views
ADD COMMENTlink modified 2.5 years ago by EagleEye6.1k • written 2.6 years ago by robvanner0
gravatar for jimmy_zeng
2.6 years ago by
University of Macau
jimmy_zeng90 wrote:

I forgot about the details for the function getBM

But I guess you can add more columns for the results by add more tags in attributes=c("hgnc_symbol")

just like attributes=c("hgnc_symbol","ENSEMBL_ID")

please find the correct column name for the ENSEMBL_ID , In fact I hate to use biomart and other package just because they define many strange rules I just can't remember them.

ADD COMMENTlink written 2.6 years ago by jimmy_zeng90
gravatar for vchris_ngs
2.6 years ago by
Seattle,WA, USA
vchris_ngs4.6k wrote:

So there is a difference between transcript level and gene level of estimating the abundances and then use differential expression. Check the package tximport. See this paper

You can always summarize the counts for genes rather than transcripts depends upon the starting annotation file you select. Summarize all the transcript count from your aligned files and then use tximport to get the gene level counts using the proper annotation transcript to gene (tx2gene) file that maps transcript to gene and reduce the matrix to gene level with tximport or you can supply the annotation in htseq itself. Check this link here.

Also see how you can select the annotation file from here. Carefully chose the annotation file.

Alternatively if you are familiar with Salmon/Sailfish, this function is implemented in newer version and you already have a tx2gene file just supply them and run the quantification and create the matrix for your samples and you can directly start from fastq files and it is pretty fast. This will give you gene level quantification.

ADD COMMENTlink written 2.6 years ago by vchris_ngs4.6k
gravatar for robvanner
2.5 years ago by
robvanner0 wrote:

Thanks for the suggestions. I was able to convert the ENSG IDs to gene symbols using the package more or less as described in section 4.1.3 of the EdgeR User's Guide (RNA seq of oral carcinomas vs. matched normal tissue).

ADD COMMENTlink written 2.5 years ago by robvanner0
gravatar for EagleEye
2.5 years ago by
EagleEye6.1k wrote:

1) Convert your GTF file into table format using following script, A: extract only geneID and gene symbol from GTF file

Sample outputs from above script:

A: Converting gtf format to bed format

2) Merge your HTseq output matrix with the converted GTF table using gene_id column (use R 'merge' function or UNIX 'join' command).

ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by EagleEye6.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1082 users visited in the last hour