Question: Help needed for Ensembl Gene ID conversion for RNA-seq data - biomart / EnsDb.Hsapiens.v86
1
gravatar for yangjianhunt
4 months ago by
yangjianhunt10
yangjianhunt10 wrote:

Hello All,

I am new to the RNA-seq world and especially new to the bioinformatics side. We recently completed a RNA-seq experiment (total RNAs) on human samples and we used illumina's Dragen RNA pipeline which generated salmon gene count (.sf) output files. In the files, the gene ID is in ensembl gene ID format with version numbers, as follows:

Name    Length  EffectiveLength TPM NumReads

ENSG00000223972.4   1483    1290.1  0.065   1.63

ENSG00000227232.4   1612    1415.02 10.139  281.06

ENSG00000243485.2   462 314.72  0   0

ENSG00000237613.2   889 720.64  0   0

ENSG00000268020.2   483 347.96  0   0

ENSG00000240361.1   940 774.95  0   0

ENSG00000186092.4   918 752.97  0   0

ENSG00000238009.2   2079    1905.46 1.007   37.61

ENSG00000239945.1   1319    1147.49 0.224   5.03

.... (there are a total of more than 50,000 ENSG numbers. )

I'd like to convert these ENSG IDs to ENSG stable IDs, gene symbols, and also have gene description, gene type, Gene length, if possible.

So far, I've tried to use the ensembl biomart webpage interface. I was able to paste all the >50000 ENSG IDs (as shown above), however, the output only has about 26,000 gene IDs; In addition, the order of the 26000 genes are different from my input. Do you know why this happens? I was expecting a csv table showing both the input and output both in the same rows. But I don't see input in the output file.

Out put file are as follows:

Gene stable ID  Gene name   Transcript length (including UTRs and CDS)  Gene description    Gene type

ENSG00000019995 ZRANB1  5695    zinc finger RANBP2-type containing 1 [Source:HGNC Symbol;Acc:HGNC:18224]    protein_coding

ENSG00000019995 ZRANB1  587 zinc finger RANBP2-type containing 1 [Source:HGNC Symbol;Acc:HGNC:18224]    protein_coding

ENSG00000039139 DNAH5   15633   dynein axonemal heavy chain 5 [Source:HGNC Symbol;Acc:HGNC:2950]    protein_coding

ENSG00000039139 DNAH5   760 dynein axonemal heavy chain 5 [Source:HGNC Symbol;Acc:HGNC:2950]    protein_coding

ENSG00000039139 DNAH5   676 dynein axonemal heavy chain 5 [Source:HGNC Symbol;Acc:HGNC:2950]    protein_coding

ENSG00000039139 DNAH5   2081    dynein axonemal heavy chain 5 [Source:HGNC Symbol;Acc:HGNC:2950]    protein_coding

ENSG00000053328 METTL24 1119    methyltransferase like 24 [Source:HGNC Symbol;Acc:HGNC:21566]   protein_coding

ENSG00000053328 METTL24 777 methyltransferase like 24 [Source:HGNC Symbol;Acc:HGNC:21566]   protein_coding

.... (there are a total of about 26000 rows)

So I then tried to install "biomaRt" in R via bioconductor (BiocManager) - however I couldn't complete the installation of biomaRt due to some errors (this is in ubuntu computer). I then switched to windows computer and was able to install biomaRt in R - however it shows that connection to server is not good (I saw on ensembl web news they are migrating servers?).

So then I installed "EnsDb.Hsapiens.v86" in R, on windows computer. But due to my lack of knowledge right now, I couldn't figure out the code for the conversion of gene IDs (I have some knowledge in shell scripts, python and can understand code with annotations).

Could you guys point out some resources such as example code for such Gene ID conversion using either biomaRt or EnsDb.Hsapiens.v86 ? (I did read the reference manual for EnsDb.Hsapiens.v86 but couldn't figure out quickly how to use a .fa file to input the query GeneIDs)...

Thanks so much! & Sorry for the long post. Jian

rna-seq biomart bioconductor R • 290 views
ADD COMMENTlink modified 4 months ago by sandeep.amberkar1850 • written 4 months ago by yangjianhunt10
0
gravatar for caggtaagtat
4 months ago by
caggtaagtat1.4k
caggtaagtat1.4k wrote:

Hi,

when the server maintaince is done, I would first download the information about all genes with the settings like in the figure below (biomart for genome version GRCh37 as an example) and than gather the information by combining the two tables.

biomart 37

so in R it would than be something like:

old_table$gene_name <- biomart_table$gene_name[match(old_table$Gene_stable_ID_version, biomart_table$$Gene_stable_ID_version)]
old_table$gene_type <- biomart_table$gene_type[match(old_table$Gene_stable_ID_version, biomart_table$$Gene_stable_ID_version)]

Maybe the 50,000 genes input was to much? Either way, this way you will see which genes could not find their "partner" in the downloaded table.

ADD COMMENTlink written 4 months ago by caggtaagtat1.4k

Thanks very much! I've done some testing of the code you provided - and ran into some issues. I'll do some more troubleshooting -will report back on how it works.

ADD REPLYlink written 4 months ago by yangjianhunt10
0
gravatar for sandeep.amberkar18
4 months ago by
sandeep.amberkar1850 wrote:

Hi yangjianhunt,

Install the Ensembl annotation package .

Next, try this:

library(EnsDb.Hsapiens.v86)
edb = EnsDb.Hsapiens.v86
# Assuming your gene counts are in a dataframe with genes as rows and samples as columns;
symbols = mapIds(x = edb, keys = rownames(counts), column = SYMBOL, keytype = GENEID)

For any other ID type you may need, run the following to check which are available in the annotation package - columns(edb)

Hope that helps!

ADD COMMENTlink modified 4 months ago • written 4 months ago by sandeep.amberkar1850

Thank you! I'll test your code and report back on how it works. Jian

ADD REPLYlink written 4 months ago by yangjianhunt10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1639 users visited in the last hour
_