Question

Variation in level 5 LINCS gene perturbation Tx data and best practice to process

0

Entering edit mode

22 months ago

Yep ▴ 20

I am trying to use level 5 gene KO/KD/OE & compound perturbation data (moderated z-scores for the 978 landmark genes) from LINCS (source: clue.io or sigcom LINCS) to construct aligned features of genes and drugs (i.e. the z-scores of expression changes of landmark genes when a gene is being knocked down/out/over-expressed, or a drug is applied).

The problem is that repetitive measurements in one cell line of the same gene perturbation treatment, i.e. measurements of the same (gene, cell_line, treatment_time) tuples that only differ in the det_well variable in metadata, e.g. OFL001_HA1E_96H:P10 and OFL002_HA1E_96H:O11, can generate significantly different transcriptomics signatures that does not correlate using Spearman correlation. I can imagine this would be similar for drugs (compound perturbation), but luckily those variations in drugs are due to dosage, which the gene perturbations do not have. It looks like the cause of such variation is purely experimental, but there seems to be no guidelines whatsoever.

Also, I'm wondering how I should probably combine those signatures that differ in cell lines & wells for each gene/compound. I don't care about cell lines and I just want to have a general/universal signature for all genes that can align well with such signatures for drugs (because there are few other data with cell specificity). I checked the OE data, and found that there are at most 3000 gene OE signatures for each cell line. It is a headache because the signature variation across cell lines are, of course, pretty huge.

EDIT:

I would greatly appreciate it if anyone can point me to some papers that use the data in this way!!
Temporarily, I think the cell lines and time should still be separated. Though the data would be scarce in this way.

Please see the attached screenshot for an overview of the (meta)data that describes the configurations of experiments (already grouped by gene and cell line). Just a side note, there are also records where there are more than one overlapping 1's in is_hiq, is_exemplar_sig, and is_ncs_sig, or there are no overlapping 1's, so we can't just decide from those columns. Sample

cmap L1000 perturbation transcriptomics LINCS • 597 views

ADD COMMENT • link 21 months ago by Yep ▴ 20