Question

all(rownames(cts) %in% txdf$TXNAME) is FALSE in DTU Analysis in R

0

Entering edit mode

5 months ago

dylannicoembros • 0

Good afternoon,

I am trying to do a DTU analysis for my research, but I am kinda new to this stuff and I have some problems. In particular on point 5). I am following the workflow of Bioconductor vignette rnaseqDTU

and my pipeline is this: 1) read salmon quants

   ## List all directories containing data  
   samples <- list.files(path = "../data/quants", full.names = T)
   ## Obtain a vector of all filenames including the path
   files <- file.path(samples, "quant.sf")
   ## Since all quant files have the same name it is useful to have names for each element
   names(files) <- str_replace(samples, "../data/quants/", "") %>% 
     str_replace(".salmon", "")

2) Then, create sampleTable from an excel file containing clinical info. It is structured like the following:

   1  sex   age    condition
   2   M     20           A
   3   M     30           B

3) Tximeta imoprt counts with metadata

    # Import counts
    txi <- tximport(files, type="salmon", txOut=TRUE,
            countsFromAbundance="scaledTPM")

    rownames(txi$counts) <- gsub("\\..*", "", rownames(txi$counts))
    cts <- txi$counts
    cts <- cts[rowSums(cts) > 0,]
    head(cts)

4) Trascript to geneID

    gtf <- "../data/Homo_sapiens.GRCh38.110 (1).gtf"
    txdb.filename <- "../data/Homo_sapiens.GRCh38.110.sqlite"
    txdb <- makeTxDbFromGFF(gtf)
    saveDb(txdb, txdb.filename)

    # Load and add geneid
    txdb <- loadDb(txdb.filename)
    txdf <- select(txdb, keys(txdb, "GENEID"), "TXNAME", "GENEID")
    tab <- table(txdf$GENEID)
    txdf$ntx <- tab[match(txdf$GENEID, names(tab))]

    head(txdf)

5) Check consistency --> ERROR

   all(rownames(cts) %in% txdf$TXNAME) # FALSE

My cts object is this way and has length equal to 21436, instead the txdf$TXNAME has length 252894

                     Sample1  Sample2   Sample3                  
   ENST00000000003    
   ENST00000000004   
   ENST00000000005

A more detailed image of the cts object: ![cts][2]

cts object

Probably I am missing something and I cannot understand what.

r cts bioconductor dtu deseq2 • 629 views

ADD COMMENT • link 5 months ago by dylannicoembros • 0

score 2 · Accepted Answer · 2023-11-03

2

Entering edit mode

5 months ago

rpolicastro 13k

cts has gene IDs for the rownames (because you aggregated the transcript counts to the gene level) and you are comparing it to txdf$TXNAME which is transcript names. If you compare it to txdf$GENEID instead it would probably work.

For DTU analysis just make sure you are keeping track of whether your count matrix is at the gene or transcript level, since ultimately you'll need it at the transcript level for DTU.

ADD COMMENT • link 5 months ago by rpolicastro 13k

0

Entering edit mode

Thank you, you were right. I updated my code but the all() keeps returning FALSE. I'll update the question above

ADD REPLY • link 5 months ago by dylannicoembros • 0

0

Entering edit mode

What transcript file did you use for the salmon quant? To maintain consistency with downstream analysis I'll usually use gffread to generate the transcript file using the same GTF (and the associated genome assembly) I use later.

ADD REPLY • link 5 months ago by rpolicastro 13k

0

Entering edit mode

I used the following: Homo_sapiens.GRCh38.110 from ensembl (I used the same in my previous DGE analysis).The link is this one ensembl gtf's

ADD REPLY • link 5 months ago by dylannicoembros • 0