Question

TCGA library type information

1

Entering edit mode

2.9 years ago

komal.rathi ★ 4.1k

Hi,

Where can I find the library type (poly-A or ribo-depleted) for each RNA-seq sample/study in TCGA? I have tried looking through various papers and GDC portal and couldn't find an exact answer.

Thanks!

tcga • 2.2k views

ADD COMMENT • link updated 18 months ago by Zhenyu Zhang ★ 1.2k • written 2.9 years ago by komal.rathi ★ 4.1k

1

Entering edit mode

I just noticed that you were asking about TCGA specifically, instead of GDC in general. It's pretty straightforward in TCGA.

If you understand the TCGA barcode (https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/), you can simply parse out the 20th character of aliquot barcode (or aliquot.submitter_id in GDC), "R" means polyA and "T" means ribo-depletion (https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/portion-analyte-codes)

ADD REPLY • link 18 months ago by Zhenyu Zhang ★ 1.2k

0

Entering edit mode

+1 for this. Maybe downloading metadata using TCGABiolinks and searching for "TotalRNASeqV2" might yield some results - I will try it later and let you know.

ADD REPLY • link 2.9 years ago by Barry Digby ★ 1.3k

0

Entering edit mode

Any luck, Barry?

ADD REPLY • link 2.9 years ago by Kevin Blighe 87k

1

Entering edit mode

Nothing, unfortunately. GDCquery() only uses 'Illumina' for the platform argument, despite the man pages describing a wide array of options. I double-checked and downloaded all GDCqueries for each project beginning with TCGA and yep, Illumina is the only level in the platform column.

Attempting to filter by experimental.strategy = "Total RNA-Seq" returns NULL objects for all 'TCGA-' projects too.

In short, I don't think TCGAbiolinks is a valid option

ADD REPLY • link 2.9 years ago by Barry Digby ★ 1.3k

0

Entering edit mode

Perhaps ask the TCGA / GDC directly, @komal.rathi

ADD REPLY • link 2.9 years ago by Kevin Blighe 87k

score 1 · Answer 1 · 2021-06-01

1

Entering edit mode

2.9 years ago

Zhenyu Zhang ★ 1.2k

If you add a custom filter in GDC portal of read_group.library_selection, you will see

Poly-T Enrichment
rRNA Depletion

ADD COMMENT • link 2.9 years ago by Zhenyu Zhang ★ 1.2k

score 0 · Answer 2 · 2022-08-30

The library prep information, if present, is found in read-group fields that are associated with aligned read files (bam files). These are controlled-access files but the metadata is freely available, and can be related to the read-count per gene (gene-expression matrix) which is probably what you have from your GDC searches.

So your path to obtain the information you need is:

gene expression matrix -> bam file -> read group -> library prep info

All the fields associated with files in GDC are here: https://docs.gdc.cancer.gov/API/Users_Guide/Appendix_A_Available_Fields/#file-fields

The following fields are sometimes available for bam files, if the submitter added them (but are empty for the gene expression files)

analysis.metadata.read_groups.library_preparation_kit_catalog_number
analysis.metadata.read_groups.library_preparation_kit_name
analysis.metadata.read_groups.library_preparation_kit_vendor
analysis.metadata.read_groups.library_preparation_kit_version
analysis.metadata.read_groups.library_selection
analysis.metadata.read_groups.library_strand
analysis.metadata.read_groups.library_strategy

For example, let's say you want library prep info from the gene expression file 39f97389-0f71-4942-a7f9-b2df25a8365d.rna_seq.augmented_star_gene_counts.tsv

Search for that file in GDC portal, it will take you to a page all about that file - scroll down to "Analysis" and there is an entry "source files" with a number "1" next to it - click on that.

Now you should be on the page for the associated "sequencing reads" file (8af7bef7-0923-4431-b7e0-9cecbb7579fa.rna_seq.transcriptome.gdc_realn.bam, a bam file). Under 'read groups' there is some information on the read lengths etc. However, it's still not enough detail.

Let's use the API with curl: make a plain text file called "payload" with the following xml:

{
    "filters":{
        "op":"=",
        "content":{
            "field":"files.file_name",
            "value":"8af7bef7-0923-4431-b7e0-9cecbb7579fa.rna_seq.transcriptome.gdc_realn.bam"
         }
    },
    "format":"tsv",
    "fields":"analysis.metadata.read_groups.read_group_id,analysis.metadata.read_groups.library_selection,analysis.metadata.read_groups.library_strategy,analysis.metadata.read_groups.library_preparation_kit_name",
    "size":"100"
}

Change the bam name in the filter to any other bam file you might want to query, and change the fields to any other valid field names that you might be interested in, for the /files endpoint in the user guide's appendix.

Then use this payload to query the files endpoint at gdc:

curl --request POST --header "Content-Type: application/json" --data @payload 'https://api.gdc.cancer.gov/files' > response.txt

The 'response.txt' file in this example should contain:

analysis.metadata.read_groups.0.library_preparation_kit_name    analysis.metadata.read_groups.0.library_selection   analysis.metadata.read_groups.0.library_strategy    analysis.metadata.read_groups.0.read_group_id   id
TruSeq Stranded Total RNA Library Prep Kit with Ribo-Zero Gold  rRNA Depletion  RNA-Seq 7f44fbe0-a4ef-4765-a47b-4869195559ce    80ca4e7a-e74f-4db0-a534-14d431537aa9

Warning - you could get a large file with lots of mostly empty columns if there are multiple entries for a field in one of the results. This is because we are coercing the result into a tsv table ("format":"tsv"). Whichever entry that is in your response with the highest number of read groups will define the number of columns there - all the others will be filled with empty strings for those columns. You can see which columns can potentially proliferate because they have a zero in them (e.g. "read_groups.0.").