Question

How to replace row names in DESeq2 rlogTransformation matrix with actual gene name info present on another sheet?

0

Entering edit mode

2.9 years ago

jbnrodriguez ▴ 30

I'm new to R and DESeq2 and I'm trying to run differential expression as below

library(DESeq2)
count_file_names <- grep("counts",list.files("HTSeq_counts"),value=T)
host_type < c("Damaged","Control")
sample_information <-data.frame(sampleName = count_file_names, fileName = count_file_names, condition = host_type)
DESeq_data <- DESeqDataSetFromHTSeqCount(sampleTable = sample_information, directory = "HTSeq_counts", design = ~condition)
colData(DESeq_data)$condition <- factor(colData(DESeq_data)$condition,levels = c('Damaged','Control'))
rld <- rlogTransformation(DESeq_data, blind=T)

When I look at the rld matrix, it has got row names that aren't annotated and hence, not directly useful; note- I intend to generate plots based on this matrix in downstream steps and so, the corresponding actual gene names are required

print(rld)
class: DESeqTransform 
dim: 33219 38 
metadata(1): version
assays(1): ''
rownames(33219): g100.t1 g1000.t1 ... g9998.t1 g9999.t1
rowData names(7): baseMean baseVar ... dispFit rlogIntercept
colnames(38): Damaged_R1.counts Damaged_R2.counts ...
  Control_R4.counts Control_R5.counts
colData names(2): condition sizeFactor

I have a separate "Annotations.csv" which has the gene name information

gene_names <- read.csv("Annotations.csv", header=TRUE, sep=",", stringsAsFactors=FALSE)
head(gene_names)
    gene_id   name                                             product
1 g38227.t1   Stk10                Serine/threonine-protein kinase 10  
2 g38227.t2   Stk10                Serine/threonine-protein kinase 10  
3 g1000.t1    Ccnh                                          Cyclin-H  
4 g46237.t1 fam136a                                   Protein FAM136A  
5 g100.t1     H2B                                       Histone H2B  
6 g40390.t1   STAC2 SH3 and cysteine-rich domain-containing protein 2

and I would like to replace rownames in rld matrix according with the correct corresponding gene name as specified in gene_names. For example, the rld matrix rownames(33219) should appear as names "H2B CCnh ... and so on" instead of "g100.t1 g1000.t1 ... and so on". I tried the below R code

new.rld  <- cbind(name=gene_names$name[ match(rownames(rld), gene_names$gene_id) ], rld)

but it gave me the this error:

Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘bindCOLS’ for signature ‘"character"’

Any help in generating correctly the new.rld matrix would be very much appreciated!

DESeq2 R • 2.2k views

ADD COMMENT • link updated 2.9 years ago by Michael 54k • written 2.9 years ago by jbnrodriguez ▴ 30

score 1 · Answer 1 · 2021-08-29

1

Entering edit mode

2.9 years ago

Michael 54k

For a DE analysis I'd try to stick to this simple workflow that I am using myself. I then add the gene annotations, that is whatever is in gene_ids), in the end. The annotated matrix will then be in the data.frame merged.res.simple which you can run View, write.table, etc. on

    # read your data as you did before
DESeq_data <- DESeq2::DESeq(DESeq_data, parallel= TRUE)
res.simple <- results(DESeq_data, contrast = c("condition", "Damaged", "Control"), parallel= TRUE) # adjust this to meet your needs
res.simple <- cbind(row.names(res.simple), as.data.frame(res.simple))
merged.res.simple <- as.data.frame(merge(res.simple, gene_ids, by=1))

 # same should work for the log transformed data
 rld <- cbind(row.names(rld), as.data.frame(rld))
 new.rld <- as.data.frame(merge(rld,gene_ids, by=1)) # possibly add or remove some as.data.frame here and there

Remark: Some of your gene names are duplicated, I guess you have multiple transcripts of the same gene here. You need to check if that DE analysis is still valid because DESeq is for gene-level analysis, not for isoform level analysis. In any case, you cannot simply exchange the row names with the gene names, because they contain duplicates which is not allowed for row names.

If these g1.t01, g2.t02 etc. in fact represent different transcript isoforms, I would re-quantify, telling HTSeq to summarize the counts at the gene level.

ADD COMMENT • link 2.9 years ago by Michael 54k

0

Entering edit mode

Thanks a zillion Michael for pointing to me about the multiple transcript issue. I went for the transcript option instead of the standard summarization at the gene level due to the inability to relate the gene ids with the gene names on the original gtf file. Below are the first few lines of the gtf file that I'm using. I was wondering if by any chance you would have any ideas on addressing this? Apologies for asking too much here; I seem to have hit a brick wall here :-( Thanks again for sharing your super-important insight earlier

##gtf-version 3
Contig0 maker   gene    61140   67521   0.69    +   .   gene_id "maker-Contig0-pred_gff_AUGUSTUS-gene-0.0"; ID "maker-Contig0-pred_gff_AUGUSTUS-gene-0.0"; Name "maker-Contig0-pred_gff_AUGUSTUS-gene-0.0";
Contig0 maker   transcript  61140   67521   0.69    +   .   gene_id "maker-Contig0-pred_gff_AUGUSTUS-gene-0.0"; transcript_id "maker-Contig0-pred_gff_AUGUSTUS-gene-0.0-mRNA-1"; ID "maker-Contig0-pred_gff_AUGUSTUS-gene-0.0-mRNA-1"; Name "g11650.t1"; Parent "maker-Contig0-pred_gff_AUGUSTUS-gene-0.0"; _AED "0.25"; _QI "0|0|0|0.71|1|1|7|0|304"; _eAED "0.25"; original_biotype "mrna";
Contig0 maker   exon    61140   61197   .   +   .   gene_id "maker-Contig0-pred_gff_AUGUSTUS-gene-0.0"; transcript_id "maker-Contig0-pred_gff_AUGUSTUS-gene-0.0-mRNA-1"; ID "maker-Contig0-pred_gff_AUGUSTUS-gene-0.0-mRNA-1:exon:10"; Parent "maker-Contig0-pred_gff_AUGUSTUS-gene-0.0-mRNA-1";
Contig0 maker   exon    62283   62303   .   +   .   gene_id "maker-Contig0-pred_gff_AUGUSTUS-gene-0.0"; transcript_id "maker-Contig0-pred_gff_AUGUSTUS-gene-0.0-mRNA-1"; ID "maker-Contig0-pred_gff_AUGUSTUS-gene-0.0-mRNA-1:exon:11"; Parent "maker-Contig0-pred_gff_AUGUSTUS-gene-0.0-mRNA-1";
Contig0 maker   exon    63801   63963   .   +   .   gene_id "maker-Contig0-pred_gff_AUGUSTUS-gene-0.0"; transcript_id "maker-Contig0-pred_gff_AUGUSTUS-gene-0.0-mRNA-1"; ID "maker-Contig0-pred_gff_AUGUSTUS-gene-0.0-mRNA-1:exon:12"; Parent "maker-Contig0-pred_gff_AUGUSTUS-gene-0.0-mRNA-1";

ADD REPLY • link 2.9 years ago by jbnrodriguez ▴ 30

1

Entering edit mode

This is resolved (no more hitting on the wall!). Went for the featureCounts pipeline (since it seems to be better anyways for my paired-end read data)

featureCounts -C -T 8 -p -t exon -g gene_id -a File.gtf -o sample.counts sample.bam

ADD REPLY • link 2.9 years ago by jbnrodriguez ▴ 30

0

Entering edit mode

Great to hear, I am using featureCounts myself for ease of use and speed. Choice of either program should not affect the outcome of DE analysis and this should now be straight-forward.

ADD REPLY • link 2.9 years ago by Michael 54k

score 1 · Answer 2 · 2021-08-29

I devised a work-around solution for this as below:

x <- rownames(rld)
write.csv(x, file="rownames.csv")
library(dplyr)
bio1=read.csv('Annotations.csv')
bio2=read.csv('rownames.csv')
df <- left_join(bio2, bio1, by='gene_id')
write.csv(df,"mod_rownames.csv")
gene_names <- read.csv("mod_rownames.csv")
rownames(rld)=gene_names$name