Question: What is the gold standard for the one-to-many relationships in BioMart retrieved annotations and adding them to a DESeq2 results table
0
gravatar for 2405592M
9 months ago by
2405592M60
2405592M60 wrote:

Hey guys,

I've done differential expression analysis on an RNA-seq experiment using DESeq2 and I now want to add my BioMart annotations to my DESeq2 results table. I've selected my BioMart database and generated an annotation using:

ensembl <- useMart("ensembl")

datasets <- listDatasets(ensembl)
head(datasets)

ensembl = useDataset("mmusculus_gene_ensembl", mart = ensembl)

attributeNames <- c("ensembl_gene_id", "entrezgene_id", "external_gene_name", "description", "transcript_biotype", "chromosome_name", "start_position", "end_position", "strand", "transcript_length")

ourFilterType <- "ensembl_gene_id"

filterValues <- rownames(results_table)

full_mm10_annot <- getBM(attributes = attributeNames,
                         filters = ourFilterType,
                         values = filterValues,
                         mart = ensembl)

I then filtered out all rows containing duplicated ensembl ID's:

duplications_bioMart <- full_mm10_annot %>%
  add_count(ensembl_gene_id) %>%
  filter(n>1)

From what I can tell, the ensembl duplications seem to be coming from transcript_biotype.

head(duplications_bioMart)

A tibble: 6 x 11 ensembl_gene_id entrezgene_id external_gene_n… description transcript_biot… chromosome_name start_position

end_position strand <chr> <int> <chr>
<chr> <chr> <chr> <int>
<int> <int> 1 ENSMUSG0000002… 170755 Sgk3
serum/gluc… protein_coding 1 9798107
9900845 1 2 ENSMUSG0000002… 170755 Sgk3
serum/gluc… nonsense_mediat… 1 9798107
9900845 1 3 ENSMUSG0000002… 170755 Sgk3
serum/gluc… protein_coding 1 9798107
9900845 1 4 ENSMUSG0000002… 170755 Sgk3
serum/gluc… retained_intron 1 9798107
9900845 1 5 ENSMUSG0000002… 170755 Sgk3
serum/gluc… nonsense_mediat… 1 9798107
9900845 1 6 ENSMUSG0000002… 170755 Sgk3
serum/gluc… nonsense_mediat… 1 9798107
9900845 1

… with 2 more variables: transcript_length <int>, n <int>

what is the gold standard in RNA-seq analysis for dealing with these ensembl ID duplicates. My first thought was to take the protein coding transcript_biotypes only but I'm not sure if I'll be losing information (or gaining unecessary information). What does one do in these kinds of scenarios?

Library prep: Illumina Tru-seq, polyA selection Cheers!

ADD COMMENTlink modified 9 months ago by Mike Smith1.4k • written 9 months ago by 2405592M60
1
gravatar for Mike Smith
9 months ago by
Mike Smith1.4k
EMBL Heidelberg / de.NBI
Mike Smith1.4k wrote:

The reason you get multiple results to your biomaRt query is that Ensembl BioMart is transcript-centric. If you also asked for the ensembl_transcript_id attribute you'd see this was different for each row.

As far as filtering goes, this surely depends upon that data you already have i.e. are your counts & differential expression results based on genes or transcripts, and what you're trying to achieve with the biomaRt annotation? If these are gene based, then it doesn't make sense to try and select specific transcripts based on biotype or anything else, since you don't know which specific transcripts are present. You can just work with the gene-level annotation. Alternatively, if you want to work at the transcript level then you probably need to look more closely at your data, and possibly explore tools like DEXSeq to get a handle on differential transcript abundance.

ADD COMMENTlink written 9 months ago by Mike Smith1.4k

Hi Mike,

Thank you for your reply. My counts and differential expression are both gene based and yes, indeed the specific transcripts based on biotype etc are not needed in my analysis. I simply want to annotate my dataset and add entrez IDs so I can do GSEA/KEGG which from my understanding, need enterez IDs.

In summary, I've changed my filters and now use the following:

attributeNames <- c("ensembl_gene_id", "entrezgene_id", "external_gene_name", "description", "chromosome_name", "start_position", "end_position", "strand")

ourFilterType <- "ensembl_gene_id"

filterValues <- rownames(Day4_CONvsCRE)

full_mm10_annot_Day4_CONvsCRE <- getBM(attributes = attributeNames,
                                       filters = ourFilterType,
                                       values = filterValues,
                                       mart = ensembl)

I then proceeded to make a spreadsheet for those interested in this dataset, combining the differential expression analysis data and gene annotations etc:

newcolnames <- c("GeneID", "Entrez", "Symbol", "Description", "Chr", "Start", "End", "Strand")
colnames(full_mm10_annot_Day4_CONvsCRE) <- newcolnames

Day4_CONvsCRE_table <- as.data.frame(Day4_CONvsCRE) %>%
  rownames_to_column("GeneID") %>%
  left_join(full_mm10_annot_Day4_CONvsCRE, "GeneID") %>%
  rename(log2FC = log2FoldChange, FDR = padj)

write_tsv(Day4_CONvsCRE_table, "/mnt/data/BMOHAMED/Total_RNAseq/MDM2kd_seq/all_samples/differential_expression/Day4_CONvsCRE_Annotated.txt")

However, at this point, I'm expecting my ensembl_gene_IDs to be completely unique. When I do:

dim(full_mm10_annot_Day4_CONvsCRE)

I get 21323:8. But when I do:

length(unique(full_mm10_annot_Day4_CONvsCRE$GeneID))

I was expecting to get 21323:8, but I actually get 21263:8 suggesting that there are duplicates. I think I'm getting multiple Enterez IDs for the same gene but I'm not entirely sure if that's the case. Whats the best way to filter my results? Should I concatenate multiple enterez IDs or should I just use one enterez ID per gene?

ADD REPLYlink written 7 months ago by 2405592M60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 819 users visited in the last hour