mapIds function to add individual columns to results table.

Question

Why do I still have missing values in my gene list, even after cooksCutoff and independentFiltering have been set to FALSE ?

0

Entering edit mode

3.9 years ago

Sukrit • 0

Hi,

I have started using DESEQ2 package recently for RNA-Seq analysis. I observed that a lot of genes in my list had been assigned a missing value (NA). On reading further, I noticed it can be avoided by taking care of the cooksCutoff and independentFiltering, i.e. cooksCutoff = FALSE, independentFiltering = FALSE.

However, even after that, I found almost 3000 genes which had an NA value assigned to my gene list. Is there any other way to prevent the missing values from occurring at all? If not, how can I proceed further?

DESEQ2 missing symbol gene values • 3.3k views

ADD COMMENT • link 3.9 years ago by Sukrit • 0

0

Entering edit mode

What is the baseMean for these genes? Please paste a few lines from the results table.

ADD REPLY • link 3.9 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin,

I am attaching a snippet of the table below for your reference. The baseMean is not zero, and genes with a low baseMean still have a gene symbol assigned to them while those with a higher baseMean are not assigned a gene symbol.

enter image description here

ADD REPLY • link 3.9 years ago by Sukrit • 0

0

Entering edit mode

Hi, all of those genes shown in your screenshot have p-values. Can you show the entries for the genes that have NA p-values?

Some of these baseMeans are very low, and a lot of these should be filtered out, in my opinion. However, if this is a knock-out experiment, then, technically, one could expect a low baseMean, depending on the sample size per condition.

ADD REPLY • link 3.9 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin,

Yes, you guessed right, this is a knockout experiment for APEX1 gene. I do not have entries for genes that have NA p-values. I only have NA values in the gene list but not in any of the statistical parameters, baseMean or log2FoldChange.

I would like to prevent NA values appearing for the gene names. I do not have NA values in any other part of the table, except the "symbol" column.

ADD REPLY • link 3.9 years ago by Sukrit • 0

score 0 · Answer 1 · 2021-08-04

0

Entering edit mode

3.9 years ago

ATpoint 88k

Three possible reasons described in the vignette:

http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#pvaluesNA

Note on p-values set to NA: some values in the results table can be set to NA for one of the following reasons:

If within a row, all samples have zero counts, the baseMean column will be zero, and the log2 fold change estimates, p value and adjusted p value will all be set to NA.

If a row contains a sample with an extreme count outlier then the p value and adjusted p value will be set to NA. These outlier counts are detected by Cook’s distance. Customization of this outlier filtering and description of functionality for replacement of outlier counts and refitting is described below

If a row is filtered by automatic independent filtering, for having a low mean normalized count, then only the adjusted p value will be set to NA. Description and customization of independent filtering is described below

As you excluded 2) and 3) it is probably 1) being an all-zero gene.

ADD COMMENT • link 3.9 years ago by ATpoint 88k

0

Entering edit mode

Hi,

Thank you for your response. I am facing a different issue here. The padj and pvalue are not assigned NA but the gene symbol has been assigned an NA value. This prevents me from investigating those genes further, especially ones with a very low pvalue and padj. I have attached an image for the same, where you can observe padj values in the range of e-9 and e-13.

enter image description here

ADD REPLY • link 3.9 years ago by Sukrit • 0

0

Entering edit mode

Code for adding that column, so how did you identify those symbols?

ADD REPLY • link 3.9 years ago by ATpoint 88k

0

Entering edit mode

Hi,

Please find the code for the addition of the gene symbols.

mapIds function to add individual columns to results table.

Row names provided for the results table as key, and specified that keytype=ENSEMBL

The column argument tells the mapIds function which information we want, and

multiVals tells the function what to do if there are multiple possible values

ens.str <- substr(rownames(DEresults), 1, 15)

DEresults_filtered$symbol <- mapIds(org.Hs.eg.db, keys=ens.str_filtered, column="SYMBOL", keytype="ENSEMBL", multiVals="first", )

ADD REPLY • link 3.9 years ago by Sukrit • 0

2

Entering edit mode

I see. Some genes do not have HGNC gene names, simple as that. Welcome to the beautiful messy world of annotations :)

ADD REPLY • link 3.9 years ago by ATpoint 88k

0

Entering edit mode

This was my thought too. A few of them are pseudogenes but a lot are producing anti-sense RNA. However, thank you so much again for all the help.

ADD REPLY • link 3.9 years ago by Sukrit • 0