The library prep information, if present, is found in read-group fields that are associated with aligned read files (bam files). These are controlled-access files but the metadata is freely available, and can be related to the read-count per gene (gene-expression matrix) which is probably what you have from your GDC searches.
So your path to obtain the information you need is:
gene expression matrix -> bam file -> read group -> library prep info
All the fields associated with files in GDC are here: https://docs.gdc.cancer.gov/API/Users_Guide/Appendix_A_Available_Fields/#file-fields
The following fields are sometimes available for bam files, if the submitter added them (but are empty for the gene expression files)
For example, let's say you want library prep info from the gene expression file
Search for that file in GDC portal, it will take you to a page all about that file - scroll down to "Analysis" and there is an entry "source files" with a number "1" next to it - click on that.
Now you should be on the page for the associated "sequencing reads" file (
8af7bef7-0923-4431-b7e0-9cecbb7579fa.rna_seq.transcriptome.gdc_realn.bam, a bam file). Under 'read groups' there is some information on the read lengths etc. However, it's still not enough detail.
Let's use the API with curl: make a plain text file called "payload" with the following xml:
Change the bam name in the filter to any other bam file you might want to query, and change the fields to any other valid field names that you might be interested in, for the /files endpoint in the user guide's appendix.
Then use this payload to query the files endpoint at gdc:
curl --request POST --header "Content-Type: application/json" --data @payload 'https://api.gdc.cancer.gov/files' > response.txt
The 'response.txt' file in this example should contain:
analysis.metadata.read_groups.0.library_preparation_kit_name analysis.metadata.read_groups.0.library_selection analysis.metadata.read_groups.0.library_strategy analysis.metadata.read_groups.0.read_group_id id
TruSeq Stranded Total RNA Library Prep Kit with Ribo-Zero Gold rRNA Depletion RNA-Seq 7f44fbe0-a4ef-4765-a47b-4869195559ce 80ca4e7a-e74f-4db0-a534-14d431537aa9
Warning - you could get a large file with lots of mostly empty columns if there are multiple entries for a field in one of the results. This is because we are coercing the result into a tsv table (
"format":"tsv"). Whichever entry that is in your response with the highest number of read groups will define the number of columns there - all the others will be filled with empty strings for those columns. You can see which columns can potentially proliferate because they have a zero in them (e.g. "read_groups.0.").