Question

Convert HTSeq output ENSG ID to HGNC Gene Symbol and preserve counts

1

Entering edit mode

7.8 years ago

robvanner ▴ 20

How can I convert the output from HTSeq count from ENSG IDs with counts to HGNC gene symbols with counts?

If I use Biomart online or in R (see code below) I lose the ensembl gene IDs that don't have a corresponding symbol or that collapse to a single symbol. I am starting with 57778 ensembl IDs and am returned 35699 gene symbols. This is a problem since the gene symbols are returned in a different order and without their corresponding counts, complicating further analysis. I would like to use the gene symbols and counts together for downstream pathway analysis following edgeR or DESeq2. Any guidance is appreciated.

MLL<- read.delim("/Path.txt", header=FALSE)
colnames(MLL)<- c("ENSEMBL_GENE_ID", "Counts")
human = useMart("ENSEMBL_MART_ENSEMBL", datatset="hsapiens_gene_ensembl")
results<- getBM(attributes=c("hgnc_symbol"), values=MLL$ENSEMBL_GENE_ID, mart=human)

Below is a summary of the problem: gene symbols are fewer in number and I am not sure how to link the counts to the symbols

MLL:

 ENSEMBL_GENE_ID COUNTS 
1 ENSG00000000003      4 
2 ENSG00000000005      0 
3 ENSG00000000419    586 
4 ENSG00000000457    384

... row 57778

results:

   hgnc_symbol
1 GENEA
2 GENEB
3 GENEC

... row 35699

HTSeq BiomaRt RNA-Seq • 8.3k views

ADD COMMENT • link updated 7.8 years ago by EagleEye 7.5k • written 7.8 years ago by robvanner ▴ 20

score 1 · Answer 1 · 2016-06-22

I forgot about the details for the function getBM

But I guess you can add more columns for the results by add more tags in attributes=c("hgnc_symbol")

just like attributes=c("hgnc_symbol","ENSEMBL_ID")

please find the correct column name for the ENSEMBL_ID , In fact I hate to use biomart and other package just because they define many strange rules I just can't remember them.

score 1 · Answer 2 · 2016-06-22

So there is a difference between transcript level and gene level of estimating the abundances and then use differential expression. Check the package tximport. See this paper

You can always summarize the counts for genes rather than transcripts depends upon the starting annotation file you select. Summarize all the transcript count from your aligned files and then use tximport to get the gene level counts using the proper annotation transcript to gene (tx2gene) file that maps transcript to gene and reduce the matrix to gene level with tximport or you can supply the annotation in htseq itself. Check this link here.

Also see how you can select the annotation file from here. Carefully chose the annotation file.

Alternatively if you are familiar with Salmon/Sailfish, this function is implemented in newer version and you already have a tx2gene file just supply them and run the quantification and create the matrix for your samples and you can directly start from fastq files and it is pretty fast. This will give you gene level quantification.

score 1 · Answer 3 · 2016-07-14

1

Entering edit mode

7.8 years ago

robvanner ▴ 20

Thanks for the suggestions. I was able to convert the ENSG IDs to gene symbols using the org.Hs.eg.db package more or less as described in section 4.1.3 of the EdgeR User's Guide (RNA seq of oral carcinomas vs. matched normal tissue).

ADD COMMENT • link 7.8 years ago by robvanner ▴ 20

score 0 · Answer 4 · 2016-07-14

1) Convert your GTF file into table format using following script, A: extract only geneID and gene symbol from GTF file

Sample outputs from above script:

A: Converting gtf format to bed format

2) Merge your HTseq output matrix with the converted GTF table using gene_id column (use R 'merge' function or UNIX 'join' command).