TCGA STAR - Counts, where are estimated counts?
1
1
Entering edit mode
20 months ago
beketamas ▴ 10

I would like to extract the estimated counts records for TCGA samples from transcriptome tables. Previously it was available in the HTSeq-Counts files but GDC updated the file format recently to STAR-Counts. In the HTSeq-Counts version, the raw_data column stored those records, but now I'm kind of struggling to find out were they've put it. None of the available columns seems to store that information anymore.

If it's really not available here, does someone know which file stores it, or how to calculate it (I guess I need gene length information as well)

STAR file format: rna_seq.augmented_star_gene_counts.tsv

TCGA HTSeq TPM STAR estimated_counts • 2.7k views
ADD COMMENT
1
Entering edit mode

In this case it seems unstranded library preparation method was used (Since it has higher counts than the stranded one)

You should consider unstranded counts as raw counts. Just to be sure, please check the kit that was used.

ADD REPLY
0
Entering edit mode

Thank you!

ADD REPLY
2
Entering edit mode
20 months ago
Zhenyu Zhang ★ 1.2k

As mentioned by Hyper_Odin, "unstranded" is equivalent to the previous HTSeq raw count.

The other two counting mode considering strandness are also available in the file. If those two (stranded_first and stranded_second) have very unbalance counts, it's normally a good indication that the library is stranded.

It's a tricky question on which one to use. Please remember that counts from different strandness are not comparable to each other. If you know the library preparation is stranded, and you are only interested in a project that have been uniformed processed, the stranded library will give you more power. Otherwise, If you are not sure about strandness or if you are doing analysis with multiple projects, I will suggest to use unstranded even if you know some data are stranded.

There is a related question about why GDC didn't tell user explicitly about the strandness to reduce confusion.

  1. In early days, GDC did a survey with multiple sequencing centers/ data submitters, and found even some these professionals can not get the strandness correctly.
  2. stranded kit does not guarantee stranded library, as the kit can be also used to generate strandless library (such as in TCGA)
  3. As described above, unstranded count is still the safest choice.
ADD COMMENT
0
Entering edit mode

Thank you for the reply it is super helpful! The unstranded column is what I need. One question still remains for me, because previously in the HTSeq format, raw_counts were floats (as they were normalized), not integers like here in the unstranded column. I would like to calculate fold-change with DeSeq2 tool, and I'm not sure if this unstranded column is enough for the tool, or if I should normalize it, or at least give the gene lengths (which I don't have right now).

ADD REPLY
0
Entering edit mode

All previous GDC HTSeq Count data are integers, not floats. You might have downloaded the separate FPKM and UQ-FPKM normalized HTSeq count files, which have float normalized values. In the new STAR Count file, these normalized values are included in the same file in different columns.

And please don't use float numbers for DESeq2. These packages explicitly expects count being integer and un-normalized.

ADD REPLY

Login before adding your answer.

Traffic: 1824 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6