Question

Identify mislabeled sample from RNAseq profile

1

Entering edit mode

4.7 years ago

Ram 43k

Hello,

We have RNAseq data on a sample from a few years ago that looked odd (possibly mislabeled), so we re-sequenced a bunch of candidates that could possibly have been that sample.

Is there any way to find out which of the re-sequenced samples the original sample is closest to? I could run PCA but these re-sequenced candidates and the original would show some sort of batch effect and that might interfere with how the samples stratify on a PCA plot. Is there any way I could account for that?

I'd appreciate any pointers. Thank you!

rnaseq • 1.1k views

ADD COMMENT • link 4.7 years ago by Ram 43k

1

Entering edit mode

Maybe try correlation (Pearson) analysis between your old mislabeled sample and the new ones?

ADD REPLY • link 4.7 years ago by Benn 8.3k

0

Entering edit mode

Thank you, Benn! I'll run a Pearson correlation first thing.

ADD REPLY • link 4.7 years ago by Ram 43k

1

Entering edit mode

You can call SNP's on all samples and compare them. Ideally you would have an independent set (plates) that is just run to check SNP's but this would be a good substitute.

ADD REPLY • link 4.7 years ago by GenoMax 141k

0

Entering edit mode

How would I compare SNPs? Compare pairwise conservation stats?

ADD REPLY • link 4.7 years ago by Ram 43k

score 2 · Accepted Answer · 2019-07-30

2

Entering edit mode

4.7 years ago

ATpoint 81k

You can try the plotPCA function from DESeq2 after normalizing data with vst which has a mode blind that can be set to TRUE and FALSE. In FALSE, it will respect the design formula so the batch effect would be respected. Pearson as suggested might make sense in combination with hierarchical clustering to visualize the results (see corrplot package). Another PCA-like metric is the Multidimensional Scaling (plotMDS function) from edgeR. In any case the individual approaches should at least in part confirm each other before you make a decision.

Apart from those dimensionality-reduction techniques, it might be possible to perform variant calling (even though difficult from RNA-seq) and see if some characteristic SNPs or patterns of SNPs can give valuable information. You could take variants that are common in humans (e.g. find some with high MAF in the hg38 VCF from dbSNP latest release containing 1KG and TOPMED MAFs) and then see if for those variants the resequenced sample is similar to the odd ones.

ADD COMMENT • link 4.7 years ago by ATpoint 81k

0

Entering edit mode

How would I use corrplot::corrplot() (or would I use corrplot::cor.mtest() for a matrix of 4 samples by 20,000 genes?

ADD REPLY • link 4.7 years ago by Ram 43k

1

Entering edit mode

You can also make a simple correlation plot without corrplot package RamRS:

Pearson <- cor(matrix)

library(gplots)

heatmap.2(Pearson , key = TRUE, col="bluered", density.info=c("none"), scale = c("none"), trace=c("none"))

ADD REPLY • link 4.7 years ago by Benn 8.3k

1

Entering edit mode

Using an example dataset from DESeq2 it would look like:

require(DESeq2)
require(corrplot)

corrplot(corr = cor(assay(makeExampleDESeqDataSet(betaSD = 2))),
         method = "color", ord = "hclust", addCoef.col = "white", 
         cl.lim = c(0,1), number.cex = 0.75, tl.col="black")

Essentially corrplot simply wraps some convenience around cor(), nothing more. it does hclust so similar samples will be grouped together. heatmap.2 is of course also totally find, the cor() input is the same.

ADD REPLY • link 4.7 years ago by ATpoint 81k

0

Entering edit mode

That's pretty much the result of a bunch of cors that I ran. Thank you - the correlation between this unknown sample and the rest is abysmal (<0.35 all). PCA throws it so far away the others are not even in the same zip code (not that the others cluster together either but they're relatively closer to one another).

I guess I'm going to have to talk to more people here to find a way. That, or call SNPs on RNAseq data.

ADD REPLY • link 4.7 years ago by Ram 43k