Question

Drug resistant vs. Drug sensitive data retreival from TCGA

0

Entering edit mode

5.0 years ago

fawazfebin ▴ 100

Hi Can anyone guide me on how to retrieve drug resistant and drug sensitive patient data from the TCGA database ? Great thanks in advance.

data retrieval TCGA drug resistance • 3.5k views

ADD COMMENT • link 4.6 years ago by fawazfebin ▴ 100

GenoMax · Answer 1 · 2019-11-08

7

Entering edit mode

5.0 years ago

Kevin Blighe 88k

To get you started, please take a look at my previous answers:

You should be able to infer (from the available clinical data) the patients who relapsed while taking therapeutic agents.

Kevin

ADD COMMENT • link 5.0 years ago by Kevin Blighe 88k

2

Entering edit mode

Kevin is right, but I think you need to be able to make sure that you are careful about making sure that you understand the metadata.

For example, I think the clinical data is usually after the 1st bio-specimen collection. So, if you were looking for pre-existing changes (before treatment), then that could be an issue. In other words, you might need to be careful about think whether gene expression changes are pre-treatment or post-treatment (and whether patients have had multiple drugs and/or multiple follow-ups, etc.).

ADD REPLY • link 5.0 years ago by Charles Warden 8.3k

1

Entering edit mode

Thank you Charles for the additional guidance.

ADD REPLY • link 4.9 years ago by fawazfebin ▴ 100

0

Entering edit mode

Thanks for the guidance ,Kevin. I did get the drug information from the clinical_drug file. Information on disease relapse was found in the clinical_follow_up file. How can I combine the information and get the barcodes of the patients, who relapsed after treatment with a particular kind of drug?

ADD REPLY • link 4.9 years ago by fawazfebin ▴ 100

1

Entering edit mode

You should be able to connect the different clinical data spreadsheets via the UUID and Barcode (?)

ADD REPLY • link 4.9 years ago by Kevin Blighe 88k

1

Entering edit mode

I merged the clinical_drug and follow_up file using the merge function.

merged_data <- merge(follow_up,clinical_drug,by = "bcr_patient_barcode")

Hopefully I can get the corresponding barcodes and do a differential expression analysis with gene expression data? Great thanks!

ADD REPLY • link 4.9 years ago by fawazfebin ▴ 100

1

Entering edit mode

Yes, that looks good. Can you match this to the expression data? The expression data should have the same barcode (?).

ADD REPLY • link 4.9 years ago by Kevin Blighe 88k

0

Entering edit mode

Yes, extracted the barcodes for a particular drug and grouped as 'resistant' and 'sensitive' based on new_tumour_event. There are multiple follow_up_barcodes and multiple drugs for the same patient barcode. And also there are two or three FPKM files for the same barcode when explored for expression data in the GDC portal.

Can you please guide me on how to select the appropriate barcodes and FPKM files? Hope the FPKM files can be further analyzed for differential expression using edgeR (?).

ADD REPLY • link 4.9 years ago by fawazfebin ▴ 100

1

Entering edit mode

Please try to obtain HTseq raw count files, and then normalise those in EdgeR. You cannot use FPKM expression values for differential expression analysis.

ADD REPLY • link 4.9 years ago by Kevin Blighe 88k

0

Entering edit mode

Ok. Thank you Kevin.

ADD REPLY • link 4.9 years ago by fawazfebin ▴ 100

0

Entering edit mode

The raw count files were fed into edgeR and normalised. The two groups are not well separated in the MDS plot. Should I remove the cases that not separated and then proceed? Great thanks in advance.

MDSplot

ADD REPLY • link 4.9 years ago by fawazfebin ▴ 100

1

Entering edit mode

No, you should not remove them without major justification. Can you generate a PCA bi-plot for PC1 versus 2?

ADD REPLY • link 4.9 years ago by Kevin Blighe 88k

0

Entering edit mode

PCA was conducted on the raw counts matrix(matrix_5FU) with 14 different cases.

p.5FU <- pca(matrix_5FU[,2:14], removeVar = 0.1)

-- removing the lower 10% of variables based on variance

screeplot(p.5FU)

Warning messages: 1: Removed 2 rows containing missing values (geom_path). 2: Removed 2 rows containing missing values (geom_point).

Screeplot

Biplot

ADD REPLY • link 4.9 years ago by fawazfebin ▴ 100

1

Entering edit mode

Okay, that sample on the right is definitely an outlier by PCA. What was your input to the pca() function, though? If you are using EdgeR, it should be the log CPM expression values.

ADD REPLY • link 4.9 years ago by Kevin Blighe 88k

0

Entering edit mode

Sorry, it was the raw count matrix which was given as input. I am attaching the sreeplot and biplot of PCA which was done on log CPM expression values.

> matrix_5FU <- read.delim('5FU.csv',sep = ',',header = TRUE)

> Group <- c(1,1,1,1,1,1,1,2,2,2,2,2,2)

> gns5FU <- select(org.Hs.eg.db, keys=rownames(matrix_5FU),columns=c("SYMBOL","GENENAME"), keytype="ENTREZID")
'select()' returned 1:1 mapping between keys and columns

> y.5FU <- DGEList(counts=matrix_5FU[,2:14], genes=gns5FU,group = Group)

> CPM.5FU.log <- cpm(y.5FU,log = TRUE) 

> screeplot(p.5FU.log)

> biplot(p.5FU.log)

Screeplot

Biplot

ADD REPLY • link updated 4.9 years ago by GenoMax 147k • written 4.9 years ago by fawazfebin ▴ 100

0

Entering edit mode

Can you please guide me on the interpretation of the newly created biplot? I wasn't able to label the cases in the plot as well. Thanks!

ADD REPLY • link 4.9 years ago by fawazfebin ▴ 100

1

Entering edit mode

It looks like you are using PCAtools, so, you can set labels via the lab parameter.

I did not reply to your earlier comment because I am giving you the opportunity to make your own interpretation. One could argue that the sample on the right is an outlier that may affect your statistical interpretations; however, for now, I would not remove anything from the dataset.

ADD REPLY • link 4.9 years ago by Kevin Blighe 88k

0

Entering edit mode

I didnt want to remove the samples, but needed an expert opinion! Great thanks for your time, Kevin.

Is there any command to fetch raw HTseq counts from TCGA corresponding to a considerable number of patient barcodes?

ADD REPLY • link 4.9 years ago by fawazfebin ▴ 100

0

Entering edit mode

fawazfebin : Please use these directions to post images. How to add images to a Biostars post

ADD REPLY • link 4.9 years ago by GenoMax 147k

0

Entering edit mode

Sure. Please excuse me for the inconvenience caused.

ADD REPLY • link 4.9 years ago by fawazfebin ▴ 100

0

Entering edit mode

@ fawazfebin, could you please tell me how you grouped the drugs into 'resistant' and 'sensitive' , based on which information and from which data , i have downloaded the follow up dataset from GDAC but i cant found these informations

ADD REPLY • link 4.7 years ago by Chaimaa ▴ 260

1

Entering edit mode

Under the 'Clinical' files you can find different .txt files which gives you information about the drug used (clinical_drug .txt file) and disease recurrence ( follow_up .txt file). The 'new_tumor_event' cases can be grouped as 'resistant' and the cases without a 'new_tumor_event' and that are 'tumor_free' can be grouped as sensitive.

ADD REPLY • link 4.6 years ago by fawazfebin ▴ 100

0

Entering edit mode

@fawazfebin, hi dear, i'm confused, do you mean the informations in this column "new_ tumor_ event_ after_ initial_ treatment or the one which have "WITH TUMOR" and "TUMOR FREE" as shown in the below figure? and after you group them into those 2 groups, how you label them for further analysis as 0 and 1 or as categorical variables , say if i want to use lasso or logistic regression to find the features related between these clinical informations and gene expressio data? Appreciate your help!

ADD REPLY • link 4.6 years ago by Chaimaa ▴ 260

score 1 · Answer 2 · 2020-03-26

Hi @Chaimaa

I used both of the columns for categorisation. Cases with ' new_tumor_event_indicator' as 'YES' & tumor_status' as 'WITH TUMOR' were grouped as 'resistant'. The sensitive cases would correspond to the absence of new tumor event & tumor free status. R commands are available to select rows with specific criteria in a particular column. After getting the barcodes for each group, you can use TCGAbiolinks for downloading and analysing the data. Or else you can use edgeR for differentially expression analysis. Hope this helps!