Question

TCGA STAR - Counts, where are estimated counts?

1

Entering edit mode

20 months ago

beketamas ▴ 10

I would like to extract the estimated counts records for TCGA samples from transcriptome tables. Previously it was available in the HTSeq-Counts files but GDC updated the file format recently to STAR-Counts. In the HTSeq-Counts version, the raw_data column stored those records, but now I'm kind of struggling to find out were they've put it. None of the available columns seems to store that information anymore.

If it's really not available here, does someone know which file stores it, or how to calculate it (I guess I need gene length information as well)

STAR file format: rna_seq.augmented_star_gene_counts.tsv

TCGA HTSeq TPM STAR estimated_counts • 2.7k views

ADD COMMENT • link updated 18 months ago by Zhenyu Zhang ★ 1.2k • written 20 months ago by beketamas ▴ 10

1

Entering edit mode

In this case it seems unstranded library preparation method was used (Since it has higher counts than the stranded one)

You should consider unstranded counts as raw counts. Just to be sure, please check the kit that was used.

ADD REPLY • link 20 months ago by Hyper_Odin ▴ 310

0

Entering edit mode

Thank you!

ADD REPLY • link 20 months ago by beketamas ▴ 10

score 2 · Answer 1 · 2022-08-28

As mentioned by Hyper_Odin, "unstranded" is equivalent to the previous HTSeq raw count.

The other two counting mode considering strandness are also available in the file. If those two (stranded_first and stranded_second) have very unbalance counts, it's normally a good indication that the library is stranded.

It's a tricky question on which one to use. Please remember that counts from different strandness are not comparable to each other. If you know the library preparation is stranded, and you are only interested in a project that have been uniformed processed, the stranded library will give you more power. Otherwise, If you are not sure about strandness or if you are doing analysis with multiple projects, I will suggest to use unstranded even if you know some data are stranded.

There is a related question about why GDC didn't tell user explicitly about the strandness to reduce confusion.

In early days, GDC did a survey with multiple sequencing centers/ data submitters, and found even some these professionals can not get the strandness correctly.
stranded kit does not guarantee stranded library, as the kit can be also used to generate strandless library (such as in TCGA)
As described above, unstranded count is still the safest choice.