Question

Efficient Operations on GDC/TCGA data

0

Entering edit mode

2.2 years ago

BioInfoBeginner ▴ 50

Hi all,

I'm a bit of a novice and was hoping someone might be able to suggest more efficient ways of working with GDC/TCGA data. I find myself regularly wanting to select based on some column in one of the many clinical files that can be downloaded and then use those groups for analysis back in the count data etc. I've been going about it typically by using grep to search for matches of say, "Primary Tumor", and then use the associated case ID and jump into R to find those IDs in case.submitter_id but it seems to be a very slow way of doing it and when I try to do it all in R by I'm no faster (create a new df using the two columns of interest then finding unique case ID's and matching to the count case ID's again). I'm sure there's a better way to do this but I haven't had a breakthrough in figuring it out :|

GDC TCGAbiolinks python R • 792 views

ADD COMMENT • link updated 2.1 years ago by Zhenyu Zhang ★ 1.2k • written 2.2 years ago by BioInfoBeginner ▴ 50

1

Entering edit mode

Are you asking about how to filter cases or files by "primary tumor"? Go to repository https://portal.gdc.cancer.gov/repository?facetTab=files, switch on left panel from file filter to case filter, click "Add a Case/Biospecimen Filter", and search for tissue_type

ADD REPLY • link 2.2 years ago by Zhenyu Zhang ★ 1.2k

0

Entering edit mode

My apologies for the poorly formed question, in hindsight I should have stepped away from the computer before writing my question in a venting style.

I think upon some reflection what I'm really asking is this: for those of you with more experience and who access GDC/TCGA for your analyses, do you find most productivity comes out of staying with bash until necessary to swap to R/python for perhaps plotting, or do you have a better workflow by importing directly into R/python by using one of the tools within those languages? And the context for the question is that I am a beginner and having tried both approaches, I had real trouble executing both ways so was hoping to glean how the experts prefer to access that database and proceed with analysis. Thank you for your reply!

ADD REPLY • link 2.2 years ago by BioInfoBeginner ▴ 50

1

Entering edit mode

Unless you know exactly what your workflow should be, I will discourage you to use too many bash scripts in a investigative analysis. The issue is that you will end up with many different scripts, and even the same script of different versions. It will be very difficult for you to track your workflow, and ensure reproducibility.

Just my 2 cents, I would suggest you to use R/Python as early in your analysis as possible; limit your bash usage to those situations that are absolutely needed, or steps that will greatly reduce your coding efforts; use github for versioning control.

ADD REPLY • link 2.1 years ago by Zhenyu Zhang ★ 1.2k