I have done some differential expression analysis on some RNA-seq data (counts were mapped to the genome) and I am satisfied with the results (examined briefly by looking for the downregulation of my knockout gene and its associated interactome). I then attempted to put together a spreadsheet with both my DESeq2 data and gene annotations from BioMart so people interested in this dataset could examine the results too.
This was what I did for my BioMarts annotation:
ensembl = useDataset("mmusculus_gene_ensembl", mart = ensembl) attributeNames <- c("ensembl_gene_id", "entrezgene_id", "external_gene_name", "description", "chromosome_name", "start_position", "end_position", "strand") ourFilterType <- "ensembl_gene_id" filterValues <- rownames(Day4_CONvsCRE) full_mm10_annot_Day4_CONvsCRE <- getBM(attributes = attributeNames, filters = ourFilterType, values = filterValues, mart = ensembl)
I then merged the DESeq2 data with my wanted annotations with a few changes:
newcolnames <- c("GeneID", "Entrez", "Symbol", "Description", "Chr", "Start", "End", "Strand") colnames(full_mm10_annot_Day4_CONvsCRE) <- newcolnames Day4_CONvsCRE_table <- as.data.frame(Day4_CONvsCRE) %>% rownames_to_column("GeneID") %>% left_join(full_mm10_annot_Day4_CONvsCRE, "GeneID") %>% rename(log2FC = log2FoldChange, FDR = padj) write_tsv(Day4_CONvsCRE_table, "/mnt/data/BMOHAMED/Total_RNAseq/MDM2kd_seq/all_samples/differential_expression/Day4_CONvsCRE_Annotated.txt")
However, when I inspected the number of rows I had:
 21323 8
and when I did:
I was expecting the number of rows to be the same but i got 21263. I'm assuming that the reason for this is that I either got multiple enterez IDs for the same gene or have duplicate ensembl gene IDs or both. How do I solve this problem? I wanted to have the enterez IDs because my next step is to do a GSEA and KEGG, and from my (rudimentary) understanding, both require enterez IDs. How do I overcome this many-to-one relationship problem that I have?
Thanks in advance!