Help needed for Ensembl Gene ID conversion for RNA-seq data - biomart / EnsDb.Hsapiens.v86
2
1
Entering edit mode
3.6 years ago
yangjianhunt ▴ 10

Hello All,

I am new to the RNA-seq world and especially new to the bioinformatics side. We recently completed a RNA-seq experiment (total RNAs) on human samples and we used illumina's Dragen RNA pipeline which generated salmon gene count (.sf) output files. In the files, the gene ID is in ensembl gene ID format with version numbers, as follows:

Name    Length  EffectiveLength TPM NumReads

ENSG00000223972.4   1483    1290.1  0.065   1.63

ENSG00000227232.4   1612    1415.02 10.139  281.06

ENSG00000243485.2   462 314.72  0   0

ENSG00000237613.2   889 720.64  0   0

ENSG00000268020.2   483 347.96  0   0

ENSG00000240361.1   940 774.95  0   0

ENSG00000186092.4   918 752.97  0   0

ENSG00000238009.2   2079    1905.46 1.007   37.61

ENSG00000239945.1   1319    1147.49 0.224   5.03

.... (there are a total of more than 50,000 ENSG numbers. )

I'd like to convert these ENSG IDs to ENSG stable IDs, gene symbols, and also have gene description, gene type, Gene length, if possible.

So far, I've tried to use the ensembl biomart webpage interface. I was able to paste all the >50000 ENSG IDs (as shown above), however, the output only has about 26,000 gene IDs; In addition, the order of the 26000 genes are different from my input. Do you know why this happens? I was expecting a csv table showing both the input and output both in the same rows. But I don't see input in the output file.

Out put file are as follows:

Gene stable ID  Gene name   Transcript length (including UTRs and CDS)  Gene description    Gene type

ENSG00000019995 ZRANB1  5695    zinc finger RANBP2-type containing 1 [Source:HGNC Symbol;Acc:HGNC:18224]    protein_coding

ENSG00000019995 ZRANB1  587 zinc finger RANBP2-type containing 1 [Source:HGNC Symbol;Acc:HGNC:18224]    protein_coding

ENSG00000039139 DNAH5   15633   dynein axonemal heavy chain 5 [Source:HGNC Symbol;Acc:HGNC:2950]    protein_coding

ENSG00000039139 DNAH5   760 dynein axonemal heavy chain 5 [Source:HGNC Symbol;Acc:HGNC:2950]    protein_coding

ENSG00000039139 DNAH5   676 dynein axonemal heavy chain 5 [Source:HGNC Symbol;Acc:HGNC:2950]    protein_coding

ENSG00000039139 DNAH5   2081    dynein axonemal heavy chain 5 [Source:HGNC Symbol;Acc:HGNC:2950]    protein_coding

ENSG00000053328 METTL24 1119    methyltransferase like 24 [Source:HGNC Symbol;Acc:HGNC:21566]   protein_coding

ENSG00000053328 METTL24 777 methyltransferase like 24 [Source:HGNC Symbol;Acc:HGNC:21566]   protein_coding

.... (there are a total of about 26000 rows)

So I then tried to install "biomaRt" in R via bioconductor (BiocManager) - however I couldn't complete the installation of biomaRt due to some errors (this is in ubuntu computer). I then switched to windows computer and was able to install biomaRt in R - however it shows that connection to server is not good (I saw on ensembl web news they are migrating servers?).

So then I installed "EnsDb.Hsapiens.v86" in R, on windows computer. But due to my lack of knowledge right now, I couldn't figure out the code for the conversion of gene IDs (I have some knowledge in shell scripts, python and can understand code with annotations).

Could you guys point out some resources such as example code for such Gene ID conversion using either biomaRt or EnsDb.Hsapiens.v86 ? (I did read the reference manual for EnsDb.Hsapiens.v86 but couldn't figure out quickly how to use a .fa file to input the query GeneIDs)...

Thanks so much! & Sorry for the long post. Jian

RNA-Seq R bioconductor biomart • 4.0k views
ADD COMMENT
0
Entering edit mode

Hi! I am having a kind of similar issue, maybe someone could help me. My salmon gene quant output file is also on ensembl gene ID format with version numbers form, but when I try to match the ensembl list for mmusculus_gene_ensembl using biomart, I get no matching genes. I believe the issue is the version numbers that are in my salmon output files, but I do not seem to find a way of figuring this out

my salmon output file with gene ID format with version numbers

ENSMUSG00000064368.1          0.26113586           0.1586499
ENSMUSG00000064363.1          0.10649954           0.2782853
ENSMUSG00000064360.1         -0.04155152           0.7452338
ENSMUSG00000064358.1          0.01379491           0.2752508
ENSMUSG00000064357.1          0.03610713           0.3275805
ENSMUSG00000064356.3

vs the mmusculus_gene_ensembl

ENSMUSG00000064336      mt-Tf              MT      1              1
ENSMUSG00000064337    mt-Rnr1              MT      1             70
ENSMUSG00000064338      mt-Tv              MT      1           1025
ENSMUSG00000064339    mt-Rnr2              MT      1           1094
ENSMUSG00000064340     mt-Tl1              MT      1           2676
ENSMUSG00000064341     mt-Nd1  

I would really appreciate some help. thank you!

ADD REPLY
0
Entering edit mode

Remove the version numbers using directions here and it should work: Mapping Ensembl Gene IDs with dot suffix

ADD REPLY
0
Entering edit mode

thank you so much!

ADD REPLY
0
Entering edit mode
3.6 years ago
caggtaagtat ★ 1.9k

Hi,

when the server maintaince is done, I would first download the information about all genes with the settings like in the figure below (biomart for genome version GRCh37 as an example) and than gather the information by combining the two tables.

biomart 37

so in R it would than be something like:

old_table$gene_name <- biomart_table$gene_name[match(old_table$Gene_stable_ID_version, biomart_table$$Gene_stable_ID_version)]
old_table$gene_type <- biomart_table$gene_type[match(old_table$Gene_stable_ID_version, biomart_table$$Gene_stable_ID_version)]

Maybe the 50,000 genes input was to much? Either way, this way you will see which genes could not find their "partner" in the downloaded table.

ADD COMMENT
0
Entering edit mode

Thanks very much! I've done some testing of the code you provided - and ran into some issues. I'll do some more troubleshooting -will report back on how it works.

ADD REPLY
0
Entering edit mode
3.6 years ago

Hi yangjianhunt,

Install the Ensembl annotation package .

Next, try this:

library(EnsDb.Hsapiens.v86)
edb = EnsDb.Hsapiens.v86
# Assuming your gene counts are in a dataframe with genes as rows and samples as columns;
symbols = mapIds(x = edb, keys = rownames(counts), column = SYMBOL, keytype = GENEID)

For any other ID type you may need, run the following to check which are available in the annotation package - columns(edb)

Hope that helps!

ADD COMMENT
0
Entering edit mode

Thank you! I'll test your code and report back on how it works. Jian

ADD REPLY

Login before adding your answer.

Traffic: 2108 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6