Question: Dealing with ensembl gene ID and enterez ID duplicates
gravatar for 2405592M
16 months ago by
2405592M100 wrote:

Hi Guys,

I have done some differential expression analysis on some RNA-seq data (counts were mapped to the genome) and I am satisfied with the results (examined briefly by looking for the downregulation of my knockout gene and its associated interactome). I then attempted to put together a spreadsheet with both my DESeq2 data and gene annotations from BioMart so people interested in this dataset could examine the results too.

This was what I did for my BioMarts annotation:

ensembl = useDataset("mmusculus_gene_ensembl", mart = ensembl)

attributeNames <- c("ensembl_gene_id", "entrezgene_id", "external_gene_name", "description", "chromosome_name", "start_position", "end_position", "strand")

ourFilterType <- "ensembl_gene_id"

filterValues <- rownames(Day4_CONvsCRE)

full_mm10_annot_Day4_CONvsCRE <- getBM(attributes = attributeNames,
                                       filters = ourFilterType,
                                       values = filterValues,
                                       mart = ensembl)

I then merged the DESeq2 data with my wanted annotations with a few changes:

newcolnames <- c("GeneID", "Entrez", "Symbol", "Description", "Chr", "Start", "End", "Strand")
colnames(full_mm10_annot_Day4_CONvsCRE) <- newcolnames

Day4_CONvsCRE_table <- %>%
  rownames_to_column("GeneID") %>%
  left_join(full_mm10_annot_Day4_CONvsCRE, "GeneID") %>%
  rename(log2FC = log2FoldChange, FDR = padj)

write_tsv(Day4_CONvsCRE_table, "/mnt/data/BMOHAMED/Total_RNAseq/MDM2kd_seq/all_samples/differential_expression/Day4_CONvsCRE_Annotated.txt")

However, when I inspected the number of rows I had:


I got

[1] 21323 8

and when I did:


I was expecting the number of rows to be the same but i got 21263. I'm assuming that the reason for this is that I either got multiple enterez IDs for the same gene or have duplicate ensembl gene IDs or both. How do I solve this problem? I wanted to have the enterez IDs because my next step is to do a GSEA and KEGG, and from my (rudimentary) understanding, both require enterez IDs. How do I overcome this many-to-one relationship problem that I have?

Thanks in advance!

rna-seq deseq2 annotation • 844 views
ADD COMMENTlink written 16 months ago by 2405592M100

After further inspection, I have do have ensembl duplicates but this is because I have multiple enterez IDs for the same ensembl ID ... should I concatenate the multiple enterez IDs. Also, If I just accepted one of the enterez IDs and discarded the duplicates would I lose data? I'm currently under the assumption that since I mapped to the genome, it doesn't really matter for downstream analysis.

Thanks in advance

ADD REPLYlink written 16 months ago by 2405592M100
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 981 users visited in the last hour