Question

How to perform transcriptomic clustering to find bad samples?

0

Entering edit mode

4.2 years ago

c_u ▴ 520

Hi,

I have about a dozen RNA-seq samples from a human tissue (DRG), which during RNA extraction, were enriched for one of the cell types found in that tissue (neurons). In other words, all samples should ideally be from neurons. But I have the suspicion that some of the samples are bad, in that they either have heavy contamination from other cell types OR the library preparation was somehow compromised.

To find the samples that are indeed coming from neurons (irrespective of the phenotypic difference that I am studying, which in this case is pain), I am thinking of doing clustering of all the samples based on their coding transcriptome i.e. the TPM values of all the coding genes in each sample. I hope to be able to find a big cluster that would represent samples that are actually coming from neurons, and then some samples to be outliers indicating undesirable samples.

Is there a known tool/package that could help me do this?

NOTE - I don't want to use the phenotype information (like WT/experimental) which DESeq2 uses for its analysis. I just want to cluster them based on their overall individual gene expression

My whole code is here -

counts <- featureCounts(nthreads=3, isGTFAnnotationFile=TRUE, annot.ext="/Volumes/bam/DRG/annotations/Homo_sapiens.GRCh38.95.gtf", files=c('40T4L.fastqAligned.sortedByCoord.out.bam', '41T7R.fastqAligned.sortedByCoord.out.bam', '42T7L.fastqAligned.sortedByCoord.out.bam', '42T7R.fastqAligned.sortedByCoord.out.bam', '44T10R.fastqAligned.sortedByCoord.out.bam', '44T11L.fastqAligned.sortedByCoord.out.bam', '44T11R.fastqAligned.sortedByCoord.out.bam', '45T10L.fastqAligned.sortedByCoord.out.bam', '45T11R.fastqAligned.sortedByCoord.out.bam', '46T8L.fastqAligned.sortedByCoord.out.bam', '46T8R.fastqAligned.sortedByCoord.out.bam', '47T7L.fastqAligned.sortedByCoord.out.bam', '47T7R.fastqAligned.sortedByCoord.out.bam'))$counts
sampleTable <- data.frame(condition = factor(c('P', 'P', 'P', 'P', 'NP', 'NP', 'NP', 'NP', 'P', 'P', 'P', 'P', 'P')))
coldata <- sampleTable
deseqdata <- DESeqDataSetFromMatrix(countData=counts, colData=coldata, design=~condition)
dds <- DESeq(deseqdata)
vsd <- vst(dds)
plotPCA(vsd, ntop=1000) + geom_text(aes(label=name),vjust=2,check_overlap = TRUE,size = 4)

RNA-Seq transcriptome • 1.2k views

ADD COMMENT • link 4.2 years ago by c_u ▴ 520

score 3 · Accepted Answer · 2020-02-11

3

Entering edit mode

4.2 years ago

ATpoint 82k

Check plotPCA from DESeq2, here is the manual. This is an easy way to get a feel for your data while identifying odd samples. A heatmap of sample-to-sample distance would be a second option, again see the DESeq2 section for it. Be sure to create the DESeq2 object from raw counts, not TPM.

ADD COMMENT • link 4.2 years ago by ATpoint 82k

0

Entering edit mode

Thanks a lot! I am trying to implement plotPCA, and I have used the code here to load my BAMs to FeatureCounts and transfer that to DESeq2. Now, to use plotPCA, should I run DESeq2 on the samples first? Because the link you shared seems to apply plotPCA to data on which vst has been applied, and that they do after running DESeq on dds. I ask this because I was not really planning to do any DE analysis right now, just wanted to see how the samples are on the transcriptomic level (and if there are outliers)

ADD REPLY • link 4.2 years ago by c_u ▴ 520

0

Entering edit mode

OK, so I ran DESeq on the samples and then used plotPCA. My question, however, remains that is there a way that can cluster the samples just based on the individual gene expression values, not relying on the phenotype. Just changing the condition vector for DESeq2 from the phenotype classes to the filenames didn't work (it gave the error the design matrix has the same number of samples and coefficients to fit so estimation of dispersion is not possible.). So, is there a way to cluster the samples based on their individual expression levels, and not based on their phenotype?

ADD REPLY • link 4.2 years ago by c_u ▴ 520

1

Entering edit mode

there a way that can cluster the samples just based on the individual gene expression values, not relying on the phenotype

clustering in plotPCA is already based on normalized expression values of the ntop genes with the highest variance. Phenotype information are only used to label your samples

ADD REPLY • link 4.2 years ago by andres.firrincieli 3.6k

1

Entering edit mode

PCA is based on the gene expression, so is the heatmap I linked. The phenotype is simply for coloring the data points. Please show all code you used if you want to debug this error.

ADD REPLY • link 4.2 years ago by ATpoint 82k

0

Entering edit mode

Thanks! I have added my whole code to the question now

ADD REPLY • link 4.2 years ago by c_u ▴ 520

0

Entering edit mode

You can skip DESeq() it is not necessary prior to running vst.

plotPCA(vsd, ntop=1000, "condition") should do it plus any optional ggplot parameters you like.

ADD REPLY • link 4.2 years ago by ATpoint 82k

0

Entering edit mode

Thanks I tried it without running DESeq() and was able to get the results. Interestingly, as you said, the results are exactly the same!

ADD REPLY • link 4.2 years ago by c_u ▴ 520