Dealing with ensembl gene ID and enterez ID duplicates
0
0
Entering edit mode
4.5 years ago
2405592M ▴ 140

Hi Guys,

I have done some differential expression analysis on some RNA-seq data (counts were mapped to the genome) and I am satisfied with the results (examined briefly by looking for the downregulation of my knockout gene and its associated interactome). I then attempted to put together a spreadsheet with both my DESeq2 data and gene annotations from BioMart so people interested in this dataset could examine the results too.

This was what I did for my BioMarts annotation:

ensembl = useDataset("mmusculus_gene_ensembl", mart = ensembl)

attributeNames <- c("ensembl_gene_id", "entrezgene_id", "external_gene_name", "description", "chromosome_name", "start_position", "end_position", "strand")

ourFilterType <- "ensembl_gene_id"

filterValues <- rownames(Day4_CONvsCRE)

full_mm10_annot_Day4_CONvsCRE <- getBM(attributes = attributeNames,
                                       filters = ourFilterType,
                                       values = filterValues,
                                       mart = ensembl)

I then merged the DESeq2 data with my wanted annotations with a few changes:

newcolnames <- c("GeneID", "Entrez", "Symbol", "Description", "Chr", "Start", "End", "Strand")
colnames(full_mm10_annot_Day4_CONvsCRE) <- newcolnames

Day4_CONvsCRE_table <- as.data.frame(Day4_CONvsCRE) %>%
  rownames_to_column("GeneID") %>%
  left_join(full_mm10_annot_Day4_CONvsCRE, "GeneID") %>%
  rename(log2FC = log2FoldChange, FDR = padj)

write_tsv(Day4_CONvsCRE_table, "/mnt/data/BMOHAMED/Total_RNAseq/MDM2kd_seq/all_samples/differential_expression/Day4_CONvsCRE_Annotated.txt")

However, when I inspected the number of rows I had:

dim(full_mm10_annot_Day4_CONvsCRE)

I got

[1] 21323 8

and when I did:

length(unique(full_mm10_annot_Day4_CONvsCRE$GeneID))

I was expecting the number of rows to be the same but i got 21263. I'm assuming that the reason for this is that I either got multiple enterez IDs for the same gene or have duplicate ensembl gene IDs or both. How do I solve this problem? I wanted to have the enterez IDs because my next step is to do a GSEA and KEGG, and from my (rudimentary) understanding, both require enterez IDs. How do I overcome this many-to-one relationship problem that I have?

Thanks in advance!

RNA-Seq DESeq2 annotation • 2.0k views
ADD COMMENT
0
Entering edit mode

After further inspection, I have do have ensembl duplicates but this is because I have multiple enterez IDs for the same ensembl ID ... should I concatenate the multiple enterez IDs. Also, If I just accepted one of the enterez IDs and discarded the duplicates would I lose data? I'm currently under the assumption that since I mapped to the genome, it doesn't really matter for downstream analysis.

Thanks in advance

ADD REPLY

Login before adding your answer.

Traffic: 1759 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6