Question

error in TXIMPORT command for RSEM

1

Entering edit mode

3.9 years ago

StartR ▴ 30

Hi, I have downloaded the HT seq counts data (gene expression data) from TCGA for BRCA. I have files in *.rsem.genes.results format

when I run command:

txi.rsem <- tximport(files, type = "rsem", txIn = FALSE, txOut = FALSE)

It is giving me this error:

reading in files with read_tsv
1 Error in computeRsemGeneLevel(files, importer, geneIdCol, abundanceCol,  : 
  all(c(geneIdCol, abundanceCol, lengthCol) %in% names(raw)) is not TRUE
In addition: Warning message:
Unnamed `col_types` should have the same length as `col_names`. Using smaller of the two

when I read one file through read_tsv, it looks like:

> tmp <- read_tsv(files[1])
Parsed with column specification:
cols(
  gene_id = col_character(),
  raw_count = col_double(),
  scaled_estimate = col_double(),
  transcript_id = col_character()
)
> tmp
# A tibble: 20,531 x 4
   gene_id     raw_count scaled_estimate transcript_id                              
   <chr>           <dbl>           <dbl> <chr>                                      
 1 ?|100130426       0      0            uc011lsn.1                                 
 2 ?|100133144      65.5    0.00000178   uc010unu.1,uc010uoa.1                      
 3 ?|100134869      32.5    0.000000639  uc002bgz.2,uc002bic.2                      
 4 ?|10357         504.     0.0000316    uc010zzl.1                                 
 5 ?|10431        5298      0.000150     uc001jiu.2,uc010qhg.1                      
 6 ?|136542          0      0            uc011krn.1                                 
 7 ?|155060        564      0.00000604   uc003wfr.3,uc003wft.3,uc003wfu.2,uc011kup.1
 8 ?|26823           1      0.0000000788 uc011mlh.1                                 
 9 ?|280660          0      0            uc010nib.1                                 
10 ?|317712          0      0            uc010ihw.1                                 
# ... with 20,521 more rows

Is there a way tximport will deal with this kind of TCGA data?

tximport rsem • 3.3k views

ADD COMMENT • link updated 3.9 years ago by Kevin Blighe 87k • written 3.9 years ago by StartR ▴ 30

Kevin Blighe · Answer 1 · 2020-05-26

1

Entering edit mode

3.9 years ago

Asaf 10k

This is clearly not RSEM output, it doesn't contain the length column, I think this is what's throwing the error. How comes the gene_id looks like it? Is it "?" in the input files?

ADD COMMENT • link 3.9 years ago by Asaf 10k

0

Entering edit mode

Indeed, tximport looks for these columns: https://github.com/mikelove/tximport/blob/master/R/tximport.R#L394-L406

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k

0

Entering edit mode

yes, I agree, but I got to process this data I got from TCGA - BRCA (HTseq counts) project, to make a matrix of TPM.

there are genes in the tmp obj starting from row 30 :

> tmp[30:40,]
# A tibble: 11 x 4
   gene_id    raw_count scaled_estimate transcript_id                                       
   <chr>          <dbl>           <dbl> <chr>                                               
 1 A1BG|1          356.    0.00000549   uc002qsd.3,uc002qsf.1                               
 2 A1CF|29974        0     0            uc001jjh.2,uc001jji.2,uc001jjj.2,uc001jjk.1,uc009xo~
 3 A2BP1|547~       10     0.0000000723 uc002cyr.1,uc002cys.2,uc002cyt.2,uc002cyv.1,uc002cy~
 4 A2LD1|877~      276.    0.00000495   uc001voq.1,uc001vor.2,uc001vos.2                    
 5 A2ML1|144~      705     0.00000468   uc001quz.3,uc001qva.1,uc001qvb.1,uc010sgm.1         
 6 A2M|2         29840.    0.000283     uc001qvj.1,uc001qvk.1,uc009zgk.1                    
 7 A4GALT|53~      470     0.00000680   uc003bdb.2,uc010gzd.2                               
 8 A4GNT|511~        5     0.0000000874 uc003ers.2                                          
 9 AAA1|4047~        0     0            uc003tdz.2,uc003teb.1,uc003tek.3,uc010kwo.1,uc010kw~
10 AAAS|8086      2439     0.0000410    uc001scr.3,uc001scs.3                               
11 AACSL|729~        0     0            uc003mjk.2,uc011dgk.1,uc011dgl.1

Can we make a matrix from this kind of data through tximport? I need a TPM matrix from TCGA BRCA data set, and I can get it if I will get txi$abundance ..

ADD REPLY • link updated 3.9 years ago by Kevin Blighe 87k • written 3.9 years ago by StartR ▴ 30

0

Entering edit mode

...but, wait a second, you have HT-seq counts data, so, why were you trying to import it as RSEM? - they are different.

If you have HT-seq counts, then import them to DESeq2 via DESeqDataSetFromMatrix(), or take a look here: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#htseq

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi, thanks, it makes sense, but I do not want to detect differentially expressed genes - I want to convert a matrix with counts (either raw or normalized) to TPM matrix, So count matrix that contains all samples in columns and all genes in rows can be converted into TPM matrix?

ADD REPLY • link 3.9 years ago by StartR ▴ 30

0

Entering edit mode

Before we become completely confused here, which data have you downloaded? It has been implied that you have both HT-seq and RSEM data, but these are produced from different methods. Only the HT-seq files will contain a raw count, whereas RSEM files will contain an estimated count. If you literally point me to a file that you downloaded, I will be in a better position to assist.

Generally, though, to calculate TPM, you need to know the gene CDS length, which you currently do not have; however, it can be obtained in different ways.

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k

0

Entering edit mode

yes, i am sorry and thank you for being patient with me! this data is also new for me, so I followed these steps to download the data from TCGA using ""TCGAbiolinks" package in R, although I have download the whole data set, I am making it simpler here, to download for a few samples, so that it will be easier for you to see this data and get a grip:

library(TCGAbiolinks)

# You can define a list of samples to query and download providing relative TCGA barcodes.
listSamples <- c("TCGA-E9-A1NG-11A-52R-A14M-07","TCGA-BH-A1FC-11A-32R-A13Q-07",
                 "TCGA-A7-A13G-11A-51R-A13Q-07","TCGA-BH-A0DK-11A-13R-A089-07",
                 "TCGA-E9-A1RH-11A-34R-A169-07","TCGA-BH-A0AU-01A-11R-A12P-07",
                 "TCGA-C8-A1HJ-01A-11R-A13Q-07","TCGA-A7-A13D-01A-13R-A12P-07",
                 "TCGA-A2-A0CV-01A-31R-A115-07","TCGA-AQ-A0Y5-01A-11R-A14M-07")

# Query platform Illumina HiSeq with a list of barcode 
query <- GDCquery(project = "TCGA-BRCA", 
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification",
                  experimental.strategy = "RNA-Seq",
                  platform = "Illumina HiSeq",
                  file.type = "results",
                  barcode = listSamples, 
                  legacy = TRUE)

# Download a list of barcodes with platform IlluminaHiSeq_RNASeqV2
GDCdownload(query = query, directory = 'BRCA_test', method = 'api')

If you will run these commands, you will see the files are created in dir "BRCA_test", each folder contains files with extension *.rsem.genes.results

The file looks like the one I mentioned in my original query, containing gene_id, raw_counts, scaled_estimates, transcript_id.

ADD REPLY • link updated 3.9 years ago by Kevin Blighe 87k • written 3.9 years ago by StartR ▴ 30

0

Entering edit mode

Now, I am able to get a dds object from DESeq2 package for this data using:

dds <- DESeqDataSetFromMatrix(countData = BRCA_mat,
                              colData = sampleData,
                                   design = ~ condition)

I used "raw_counts" to generate "BRCA_mat"

Now my question is from raw count matrix, can I get TPM matrix.

I am aware that I will need featureLength, meanFragmentLength to calculate TPM - but given the data I get from TCGA, I do not have this data on length.

So is it possible to get TPM or even FPKM matrix from raw count martix?

Even if I will get FPKM, I will convert it to TPM. Thanks.

ADD REPLY • link updated 3.9 years ago by Kevin Blighe 87k • written 3.9 years ago by StartR ▴ 30

0

Entering edit mode

I see - thanks. You can calculate TPM from the BRCA_mat object; however, you will still require the gene lengths. The TPM calculation is elaborated ere: https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained.

DESeq2 does not produce TPM data. However, you could easily use DESeq2 to produce variance-stabilised or regularised log data, which are 'better' than TPM, in my opinion.

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k

0

Entering edit mode

And even if i do this: dds <- estimateSizeFactors(dds) counts(dds, normalized=TRUE)

Will these normalized counts be equivalent to TPM?

I think not, This is just dividing each column of

counts(dds) by sizeFactors(dds)

but it will not normalize for gene length,

so I want TPM matrix from raw counts matrix. Please help. Thanks in advance.

ADD REPLY • link 3.9 years ago by StartR ▴ 30