How can I convert the output from HTSeq count from ENSG IDs with counts to HGNC gene symbols with counts?
If I use Biomart online or in R (see code below) I lose the ensembl gene IDs that don't have a corresponding symbol or that collapse to a single symbol. I am starting with 57778 ensembl IDs and am returned 35699 gene symbols. This is a problem since the gene symbols are returned in a different order and without their corresponding counts, complicating further analysis. I would like to use the gene symbols and counts together for downstream pathway analysis following edgeR or DESeq2. Any guidance is appreciated.
MLL<- read.delim("/Path.txt", header=FALSE) colnames(MLL)<- c("ENSEMBL_GENE_ID", "Counts") human = useMart("ENSEMBL_MART_ENSEMBL", datatset="hsapiens_gene_ensembl") results<- getBM(attributes=c("hgnc_symbol"), values=MLL$ENSEMBL_GENE_ID, mart=human)
Below is a summary of the problem: gene symbols are fewer in number and I am not sure how to link the counts to the symbols
ENSEMBL_GENE_ID COUNTS 1 ENSG00000000003 4 2 ENSG00000000005 0 3 ENSG00000000419 586 4 ENSG00000000457 384
... row 57778
hgnc_symbol 1 GENEA 2 GENEB 3 GENEC
... row 35699