Dealing with ensembl gene ID and enterez ID duplicates
4.7 years ago
2405592M ▴ 140

Hi Guys,

I have done some differential expression analysis on some RNA-seq data (counts were mapped to the genome) and I am satisfied with the results (examined briefly by looking for the downregulation of my knockout gene and its associated interactome). I then attempted to put together a spreadsheet with both my DESeq2 data and gene annotations from BioMart so people interested in this dataset could examine the results too.

This was what I did for my BioMarts annotation:

ensembl = useDataset("mmusculus_gene_ensembl", mart = ensembl)

attributeNames <- c("ensembl_gene_id", "entrezgene_id", "external_gene_name", "description", "chromosome_name", "start_position", "end_position", "strand")

ourFilterType <- "ensembl_gene_id"

filterValues <- rownames(Day4_CONvsCRE)

full_mm10_annot_Day4_CONvsCRE <- getBM(attributes = attributeNames,
                                       filters = ourFilterType,
                                       values = filterValues,
                                       mart = ensembl)

I then merged the DESeq2 data with my wanted annotations with a few changes:

newcolnames <- c("GeneID", "Entrez", "Symbol", "Description", "Chr", "Start", "End", "Strand")
colnames(full_mm10_annot_Day4_CONvsCRE) <- newcolnames

Day4_CONvsCRE_table <- %>%
  rownames_to_column("GeneID") %>%
  left_join(full_mm10_annot_Day4_CONvsCRE, "GeneID") %>%
  rename(log2FC = log2FoldChange, FDR = padj)

write_tsv(Day4_CONvsCRE_table, "/mnt/data/BMOHAMED/Total_RNAseq/MDM2kd_seq/all_samples/differential_expression/Day4_CONvsCRE_Annotated.txt")

However, when I inspected the number of rows I had:


I got

[1] 21323 8

and when I did:


I was expecting the number of rows to be the same but i got 21263. I'm assuming that the reason for this is that I either got multiple enterez IDs for the same gene or have duplicate ensembl gene IDs or both. How do I solve this problem? I wanted to have the enterez IDs because my next step is to do a GSEA and KEGG, and from my (rudimentary) understanding, both require enterez IDs. How do I overcome this many-to-one relationship problem that I have?

Thanks in advance!

RNA-Seq DESeq2 annotation
After further inspection, I have do have ensembl duplicates but this is because I have multiple enterez IDs for the same ensembl ID ... should I concatenate the multiple enterez IDs. Also, If I just accepted one of the enterez IDs and discarded the duplicates would I lose data? I'm currently under the assumption that since I mapped to the genome, it doesn't really matter for downstream analysis.

Thanks in advance


