error in TXIMPORT command for RSEM
1
0
Entering edit mode
17 months ago
StartR ▴ 20

Hi, I have downloaded the HT seq counts data (gene expression data) from TCGA for BRCA. I have files in *.rsem.genes.results format

when I run command:

txi.rsem <- tximport(files, type = "rsem", txIn = FALSE, txOut = FALSE)


It is giving me this error:

reading in files with read_tsv
1 Error in computeRsemGeneLevel(files, importer, geneIdCol, abundanceCol,  :
all(c(geneIdCol, abundanceCol, lengthCol) %in% names(raw)) is not TRUE
Unnamed col_types should have the same length as col_names. Using smaller of the two


> tmp <- read_tsv(files[1])
Parsed with column specification:
cols(
gene_id = col_character(),
raw_count = col_double(),
scaled_estimate = col_double(),
transcript_id = col_character()
)
> tmp
# A tibble: 20,531 x 4
gene_id     raw_count scaled_estimate transcript_id
<chr>           <dbl>           <dbl> <chr>
1 ?|100130426       0      0            uc011lsn.1
2 ?|100133144      65.5    0.00000178   uc010unu.1,uc010uoa.1
3 ?|100134869      32.5    0.000000639  uc002bgz.2,uc002bic.2
4 ?|10357         504.     0.0000316    uc010zzl.1
5 ?|10431        5298      0.000150     uc001jiu.2,uc010qhg.1
6 ?|136542          0      0            uc011krn.1
7 ?|155060        564      0.00000604   uc003wfr.3,uc003wft.3,uc003wfu.2,uc011kup.1
8 ?|26823           1      0.0000000788 uc011mlh.1
9 ?|280660          0      0            uc010nib.1
10 ?|317712          0      0            uc010ihw.1
# ... with 20,521 more rows


Is there a way tximport will deal with this kind of TCGA data?

tximport rsem • 1.3k views
1
Entering edit mode
17 months ago
Asaf 8.6k

This is clearly not RSEM output, it doesn't contain the length column, I think this is what's throwing the error. How comes the gene_id looks like it? Is it "?" in the input files?

0
Entering edit mode
0
Entering edit mode

yes, I agree, but I got to process this data I got from TCGA - BRCA (HTseq counts) project, to make a matrix of TPM.

there are genes in the tmp obj starting from row 30 :

> tmp[30:40,]
# A tibble: 11 x 4
gene_id    raw_count scaled_estimate transcript_id
<chr>          <dbl>           <dbl> <chr>
1 A1BG|1          356.    0.00000549   uc002qsd.3,uc002qsf.1
2 A1CF|29974        0     0            uc001jjh.2,uc001jji.2,uc001jjj.2,uc001jjk.1,uc009xo~
3 A2BP1|547~       10     0.0000000723 uc002cyr.1,uc002cys.2,uc002cyt.2,uc002cyv.1,uc002cy~
4 A2LD1|877~      276.    0.00000495   uc001voq.1,uc001vor.2,uc001vos.2
5 A2ML1|144~      705     0.00000468   uc001quz.3,uc001qva.1,uc001qvb.1,uc010sgm.1
6 A2M|2         29840.    0.000283     uc001qvj.1,uc001qvk.1,uc009zgk.1
7 A4GALT|53~      470     0.00000680   uc003bdb.2,uc010gzd.2
8 A4GNT|511~        5     0.0000000874 uc003ers.2
9 AAA1|4047~        0     0            uc003tdz.2,uc003teb.1,uc003tek.3,uc010kwo.1,uc010kw~
10 AAAS|8086      2439     0.0000410    uc001scr.3,uc001scs.3
11 AACSL|729~        0     0            uc003mjk.2,uc011dgk.1,uc011dgl.1


Can we make a matrix from this kind of data through tximport? I need a TPM matrix from TCGA BRCA data set, and I can get it if I will get txi\$abundance ..

0
Entering edit mode

...but, wait a second, you have HT-seq counts data, so, why were you trying to import it as RSEM? - they are different.

If you have HT-seq counts, then import them to DESeq2 via DESeqDataSetFromMatrix(), or take a look here: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#htseq

0
Entering edit mode

Hi, thanks, it makes sense, but I do not want to detect differentially expressed genes - I want to convert a matrix with counts (either raw or normalized) to TPM matrix, So count matrix that contains all samples in columns and all genes in rows can be converted into TPM matrix?

0
Entering edit mode

Before we become completely confused here, which data have you downloaded? It has been implied that you have both HT-seq and RSEM data, but these are produced from different methods. Only the HT-seq files will contain a raw count, whereas RSEM files will contain an estimated count. If you literally point me to a file that you downloaded, I will be in a better position to assist.

Generally, though, to calculate TPM, you need to know the gene CDS length, which you currently do not have; however, it can be obtained in different ways.

0
Entering edit mode

yes, i am sorry and thank you for being patient with me! this data is also new for me, so I followed these steps to download the data from TCGA using ""TCGAbiolinks" package in R, although I have download the whole data set, I am making it simpler here, to download for a few samples, so that it will be easier for you to see this data and get a grip:

library(TCGAbiolinks)

# You can define a list of samples to query and download providing relative TCGA barcodes.
listSamples <- c("TCGA-E9-A1NG-11A-52R-A14M-07","TCGA-BH-A1FC-11A-32R-A13Q-07",
"TCGA-A7-A13G-11A-51R-A13Q-07","TCGA-BH-A0DK-11A-13R-A089-07",
"TCGA-E9-A1RH-11A-34R-A169-07","TCGA-BH-A0AU-01A-11R-A12P-07",
"TCGA-C8-A1HJ-01A-11R-A13Q-07","TCGA-A7-A13D-01A-13R-A12P-07",
"TCGA-A2-A0CV-01A-31R-A115-07","TCGA-AQ-A0Y5-01A-11R-A14M-07")

# Query platform Illumina HiSeq with a list of barcode
query <- GDCquery(project = "TCGA-BRCA",
data.category = "Gene expression",
data.type = "Gene expression quantification",
experimental.strategy = "RNA-Seq",
platform = "Illumina HiSeq",
file.type = "results",
barcode = listSamples,
legacy = TRUE)



If you will run these commands, you will see the files are created in dir "BRCA_test", each folder contains files with extension *.rsem.genes.results

The file looks like the one I mentioned in my original query, containing gene_id, raw_counts, scaled_estimates, transcript_id.

0
Entering edit mode

Now, I am able to get a dds object from DESeq2 package for this data using:

dds <- DESeqDataSetFromMatrix(countData = BRCA_mat,
colData = sampleData,
design = ~ condition)


I used "raw_counts" to generate "BRCA_mat"

Now my question is from raw count matrix, can I get TPM matrix.

I am aware that I will need featureLength, meanFragmentLength to calculate TPM - but given the data I get from TCGA, I do not have this data on length.

So is it possible to get TPM or even FPKM matrix from raw count martix?

Even if I will get FPKM, I will convert it to TPM. Thanks.

0
Entering edit mode

I see - thanks. You can calculate TPM from the BRCA_mat object; however, you will still require the gene lengths. The TPM calculation is elaborated ere: https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained.

DESeq2 does not produce TPM data. However, you could easily use DESeq2 to produce variance-stabilised or regularised log data, which are 'better' than TPM, in my opinion.

0
Entering edit mode

And even if i do this: dds <- estimateSizeFactors(dds) counts(dds, normalized=TRUE)

Will these normalized counts be equivalent to TPM?

I think not, This is just dividing each column of

counts(dds) by sizeFactors(dds)

but it will not normalize for gene length,