Hi Can anyone guide me on how to retrieve drug resistant and drug sensitive patient data from the TCGA database ? Great thanks in advance.
Hi Can anyone guide me on how to retrieve drug resistant and drug sensitive patient data from the TCGA database ? Great thanks in advance.
To get you started, please take a look at my previous answers:
You should be able to infer (from the available clinical data) the patients who relapsed while taking therapeutic agents.
Kevin
Hi @Chaimaa
I used both of the columns for categorisation. Cases with ' new_tumor_event_indicator' as 'YES' & tumor_status' as 'WITH TUMOR' were grouped as 'resistant'. The sensitive cases would correspond to the absence of new tumor event & tumor free status. R commands are available to select rows with specific criteria in a particular column. After getting the barcodes for each group, you can use TCGAbiolinks for downloading and analysing the data. Or else you can use edgeR for differentially expression analysis. Hope this helps!
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Kevin is right, but I think you need to be able to make sure that you are careful about making sure that you understand the metadata.
For example, I think the clinical data is usually after the 1st bio-specimen collection. So, if you were looking for pre-existing changes (before treatment), then that could be an issue. In other words, you might need to be careful about think whether gene expression changes are pre-treatment or post-treatment (and whether patients have had multiple drugs and/or multiple follow-ups, etc.).
Thank you Charles for the additional guidance.
Thanks for the guidance ,Kevin. I did get the drug information from the clinical_drug file. Information on disease relapse was found in the clinical_follow_up file. How can I combine the information and get the barcodes of the patients, who relapsed after treatment with a particular kind of drug?
You should be able to connect the different clinical data spreadsheets via the UUID and Barcode (?)
I merged the clinical_drug and follow_up file using the merge function.
Hopefully I can get the corresponding barcodes and do a differential expression analysis with gene expression data? Great thanks!
Yes, that looks good. Can you match this to the expression data? The expression data should have the same barcode (?).
Yes, extracted the barcodes for a particular drug and grouped as 'resistant' and 'sensitive' based on new_tumour_event. There are multiple follow_up_barcodes and multiple drugs for the same patient barcode. And also there are two or three FPKM files for the same barcode when explored for expression data in the GDC portal.
Can you please guide me on how to select the appropriate barcodes and FPKM files? Hope the FPKM files can be further analyzed for differential expression using edgeR (?).
Please try to obtain HTseq raw count files, and then normalise those in EdgeR. You cannot use FPKM expression values for differential expression analysis.
Ok. Thank you Kevin.
The raw count files were fed into edgeR and normalised. The two groups are not well separated in the MDS plot. Should I remove the cases that not separated and then proceed? Great thanks in advance.
No, you should not remove them without major justification. Can you generate a PCA bi-plot for PC1 versus 2?
PCA was conducted on the raw counts matrix(matrix_5FU) with 14 different cases.
-- removing the lower 10% of variables based on variance
Warning messages: 1: Removed 2 rows containing missing values (geom_path). 2: Removed 2 rows containing missing values (geom_point).
Okay, that sample on the right is definitely an outlier by PCA. What was your input to the
pca()
function, though? If you are using EdgeR, it should be the log CPM expression values.Sorry, it was the raw count matrix which was given as input. I am attaching the sreeplot and biplot of PCA which was done on log CPM expression values.
Can you please guide me on the interpretation of the newly created biplot? I wasn't able to label the cases in the plot as well. Thanks!
It looks like you are using PCAtools, so, you can set labels via the
lab
parameter.I did not reply to your earlier comment because I am giving you the opportunity to make your own interpretation. One could argue that the sample on the right is an outlier that may affect your statistical interpretations; however, for now, I would not remove anything from the dataset.
I didnt want to remove the samples, but needed an expert opinion! Great thanks for your time, Kevin.
Is there any command to fetch raw HTseq counts from TCGA corresponding to a considerable number of patient barcodes?
fawazfebin : Please use these directions to post images. How to add images to a Biostars post
Sure. Please excuse me for the inconvenience caused.
@ fawazfebin, could you please tell me how you grouped the drugs into 'resistant' and 'sensitive' , based on which information and from which data , i have downloaded the follow up dataset from GDAC but i cant found these informations
Under the 'Clinical' files you can find different .txt files which gives you information about the drug used (clinical_drug .txt file) and disease recurrence ( follow_up .txt file). The 'new_tumor_event' cases can be grouped as 'resistant' and the cases without a 'new_tumor_event' and that are 'tumor_free' can be grouped as sensitive.
@fawazfebin, hi dear, i'm confused, do you mean the informations in this column "new_ tumor_ event_ after_ initial_ treatment or the one which have "WITH TUMOR" and "TUMOR FREE" as shown in the below figure? and after you group them into those 2 groups, how you label them for further analysis as 0 and 1 or as categorical variables , say if i want to use lasso or logistic regression to find the features related between these clinical informations and gene expressio data? Appreciate your help!