Question: How to deal with the abnormal samples from TCGA found in a PCA?
gravatar for lykosherdshep
11 months ago by
lykosherdshep90 wrote:

I have downloaded a dataset with 551 samples through TCGAbiolinks in R, and I prepared to do diffrentially expressed analysis, howerver, when I did data data quality assessment by sample distance clustering and PCA, there seemed some samples are abnormal, what should I do to them? Here is the picture. pca with data variance stabilized by DESeq2 function "vst"

heatmap of sample distance

rna-seq next-gen R • 556 views
ADD COMMENTlink modified 11 months ago • written 11 months ago by lykosherdshep90
gravatar for Kevin Blighe
11 months ago by
Kevin Blighe61k
Kevin Blighe61k wrote:

There is no issue, depending on your perspective. The NT samples are from normal tissue. The TP samples are from tumour tissue. It is well documented that biopsies taken from 'normal' / stromal tissue surrounding the primary tumour can resemble the tumour itself. This is where we enter the debate of when cancer begins. A geneticist may say that it begins in the very first cell that accumulates a point mutation in, e.g., an oncogene, but this still may occur many years before conventional diagnostic / screening methods can detect a palpable tumour.

The other possibility is that there is a sample mix-up, either in your code, the inner code of TCGAbiolinks, and / or at the TCGA source.


ADD COMMENTlink written 11 months ago by Kevin Blighe61k

Hi Kevin,

I wanted to add on a quick side question to this thread (I don't think it warrants a new post, still discussing sample variation present in TCGA-PRAD DE analysis)

When creating the design for DESeq2, the simplest way to do it is design = ~ Condition where condition has tumor/normal information. I was under the impression it is essential to account for patient variation in the dataset to maximise the Tumor vs. Normal comparison, thus making a metadata file like so:

Patient Condition
1   Tumor
2   Tumor
3   Normal
4   Tumor
5   Normal

And using design = ~ Patient + Condition for DESeq. Have I got the right idea here or is the simpler design preferable.



ADD REPLYlink written 1 day ago by Barry Digby400

Yes, that is indeed a good recommendation.

ADD REPLYlink written 1 day ago by Kevin Blighe61k

Thanks for your quickly reply , dose it mean that normally some normal samples should surrounded by tumour samples in PCA and their distance may not cluster in a heatmap? and I'll check if there are some same samples to confirm no mix-up samples from TCGA. here is my code. Thanks you to remove my doubts, I have been Googling this problem for one day.

counts_query <- GDCquery(project = "TCGA-PRAD",
                           data.category = "Transcriptome Profiling",
                           data.type = "Gene Expression Quantification",
                           workflow.type = "HTSeq - Counts")
GDCdownload(counts_query, files.per.chunk = 8)
counts_se <- GDCprepare(counts_query, save = TRUE,
                        save.filename = "counts_prostate.rda", 
                        summarizedExperiment = TRUE)
counts_se$shortLetterCode <- factor(counts_se$shortLetterCode) 
counts_dds <- DESeqDataSet(counts_se, design = ~shortLetterCode)
counts_dds <- counts_dds[,counts_dds$shortLetterCode != "TM"]
counts_dds$shortLetterCode <- droplevels(counts_dds$shortLetterCode)
counts_dds$shortLetterCode <- relevel(counts_dds$shortLetterCode, ref = "NT")
counts_dds <- counts_dds[rowSums(counts(counts_dds)) > 1,]
vst_counts <- vst(counts_dds, blind = F)
CairoPNG(filename = "vst_meansdplot.png")
meanSdPlot(assay(vst_counts), ranks = T)
vst_dist <- dist(t(assay(vst_counts)))
dist_matrix <- as.matrix(vst_dist)
annotation <- data.frame(vst_counts$shortLetterCode)
colnames(dist_matrix) <- rownames(annotation)
rownames(dist_matrix) <- rownames(annotation)
CairoSVG(file = "sample_distance_heatmap.svg")
         clustering_distance_cols = vst_dist,
         clustering_distance_rows = vst_dist,
         show_colnames = F,
         show_rownames = F,
         annotation_col = annotation,
         annotation_row = annotation,
         annotation_names_col = F,
         annotation_names_row = F,
         main = "sample-distance")
CairoPNG(filename = "vst_pca.png")
DESeq2::plotPCA(vst_counts, intgroup = "shortLetterCode")
ADD REPLYlink written 11 months ago by lykosherdshep90

It makes sense to me. For one, PCA is looking at co-variation across your samples. Your heatmap, on the other hand, is based on a distance matrix that is calculated by Euclidean distances between your samples.

Looking again at your PCA bi-plot, the percent (%) variation explained by PC1 is relatively low, which indicates that there is no great separation across your samples. This is then confirmed in the heatmap.

I am not too familiar with the PRAD TCGA dataset, though. Is a lack of heterogeneity expected for Prostate Adenocarcinoma?

ADD REPLYlink written 11 months ago by Kevin Blighe61k

Yes, and I don't know how to deal with these samples.

ADD REPLYlink written 11 months ago by lykosherdshep90

What is the aim of your study?

ADD REPLYlink written 11 months ago by Kevin Blighe61k

To find the differentially expressed gene in prostate cancer, following by GO , KEGG analysis.Sorry for reply so late, the site biostar restrict the number of the newer post.

ADD REPLYlink written 11 months ago by lykosherdshep90

I see, well, you can conduct the analysis and then see what you find. Judging by the PCA bi-plot, the list of differentially expressed genes will not be large.

ADD REPLYlink written 11 months ago by Kevin Blighe61k

I did it, the code and results are below:

counts_de <- DESeq(counts_dds)
CairoPNG(filename = "de_despest.png")
counts_res <- results(counts_de, 
                      contrast = c("shortLetterCode", "TP", "NT"), 
                      alpha = 0.05)


out of 54091 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)       : 12299, 23%
LFC < 0 (down)     : 8661, 16%
outliers [1]       : 0, 0%
low counts [2]     : 15731, 29%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

Can the abnormal samples be a problem for my workflow? whether is it necessary for me to omit them? I have read some tutorial, they all don't refer to the solusion of the samples with a lack of heterogeneity.

ADD REPLYlink modified 11 months ago by Kevin Blighe61k • written 11 months ago by lykosherdshep90

From what I can see, for RNA-seq studies in cancer, unless you have a true normal sample group, then perhaps it is better to just look at the tumour samples. You can perform differential expression comparisons between the tumour sub-type, for example (I am not sure of the histopathological sub-types of prostate cancer); you can also retain highly expressed / not expressed genes based on Z-score cut-offs (this is how cBioPortal does it).

Basically, the TCGA never performed RNA-seq on a large number of normal samples, and the 'normals' that they used are, as you can see, mostly not normal at all. Some of the cancers do have true normal cells derived from, I believe, blood lymphocytes, but these are in a severe minority.

So, comparisons in terms of dififerntial expression between tumour and normal are not suitable for different reasons.

ADD REPLYlink written 11 months ago by Kevin Blighe61k

I'm a bit confused about the control samples selection between a true normal sample with tissue from a no-cancer individual and a normal tissue from a cancer people for RNA-seq, I can understand it is a better choice to chose a true normal sample from what you said, but in my opinion, it may be good choice to select a control sample with tissue from a cancer, because I think a matched group need control some condition which aren't experiment design. Can you tell me some reasons for the choice of a true normal sample.

I have just now read a post(TCGA: Does TCGA cancer studies have mRNA expression data for Control/Normal samples? ) where I can find there are both true normal samples(from people without cancer) and a normal sample from normal tissue(individuals have cancer) in TCGA with a different barcode(Tumor types range from 01 - 09, normal types from 10 - 19 and control samples from 20 - 29 So 0 are tumors and 1 are normals e.g, 01 are primary tumours). So it can be a method if I can filter these samples, can it?

I have also find a paper( where it refers to a database GTEx to be a reference normal tissue sample, can it be a choice?

ADD REPLYlink written 11 months ago by lykosherdshep90

Well, yes, you can obtain the true 'blood' normals, if you wish; however, due to the way that you are running it, i.e., with TCGAbiolinks, you are restricting yourself to the data that TCGAbiolinks uses. I always obtain the data myself from the GDC Data Portal (or the GDC Legacy Archive), although, doing this requires some patience to first learn how to obtain the data.

For further information on sample types and their codes, Ill direct you to 2 resources:

Regarding GTEx, I see no reason why you cannot use it for examining 'normal' expression in your tissue of interest.

ADD REPLYlink modified 11 months ago • written 11 months ago by Kevin Blighe61k

Thanks for you to tell me these, I am still confused, I still don’t know the necessity of choosing the true normal samples, Could you tell me some reasons for it?

ADD REPLYlink written 11 months ago by lykosherdshep90

Well, would it make sense to conduct a differential expression comparison between, for example, 200 tumour samples and just 3 [true] normal samples? If you filter for matched tumour-normal pairs (normal of any type), then you may have just a 2 vs 2 comparison. The dataset with the largest tumour-normal pairs is BRCA (breast cancer), with ~112 tumour-normal pairs, if I recall. For PRAD, I think that the number is either 0 or just 2. You can match these based on the TCGA barcode.

I do not know the exact task that you have been given, but I imagine that the direction is coming from your supervisor (?) - what, exactly, have you been asked to do? As an analyst, you are also capable of making decisions.

In a nutshell, the use of normal samples in TCGA RNA-sequencing was never a priority of the TCGA consortium. Normal samples for DNA-based analysis was [the priority]. In certain analyses in the published literature for TCGA data, normal RNA-seq samples were never even used. Instead, cluster-based approaches using just the tumour samples were used.

ADD REPLYlink written 11 months ago by Kevin Blighe61k

Thanks very much, I read what you have said again, and I think I have understood what you explain. Thank you for removing my confussion.

ADD REPLYlink written 11 months ago by lykosherdshep90

Thanks for your patient explanation, can I have this understanding that it's the number of the normal samples in TCGA database that mainly restrict the use to differential expression gene in TCGA and the quality of the normal samples? is it right?

I'm a beginner in bioinformatics, and I'm also a student in medical. I am just intersted in prostate cancer, and so I keep it my studing pathway in learning bioinformatics that find the hub gene in prostate cancer. Maybe your advice of the tumour subtype will also be a better method for me.

ADD REPLYlink written 11 months ago by lykosherdshep90

I also check the batch effects in , It seems there is no problem.

ADD REPLYlink written 11 months ago by lykosherdshep90
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 751 users visited in the last hour