Question: Doing differential gene expression without knowing sample classes
1
gravatar for F
5 weeks ago by
F3.4k
Iran
F3.4k wrote:

Hi

I have been given a big set of RNA-seq, one sample looks like this

                   readcounts_union readcounts_intersectionNotEmpt  genelength  R/FPKM_union    R/FPKM_intersectionNotEmpt
ENSG00000258486.2   1151554 1151554 597 79153.32269 78738.12898
ENSG00000265150.1   1089307 1089307 297 150505.7244 149716.2562
ENSG00000202198.1   996127  996128  331 123494.0095 122846.3529

I also have case ID for each sample like

OC/SH/061g/159  SLX-14829.D709-D505
OC/AH/183   SLX-14880.D703-D506
OC/AH/143   SLX-14880.D704-D506

BUT I don't know what these IDs are, which is normal, which is tumor, and there is no one to ask from

I have to reduce the features in RNA-seq data and extract the most informative genes for integrating with proteomics; In such case people usually do differential expression but I don't know the class of samples to think about DESeq2 or edgeR

So, if you were me, how would you deal with this data? How would you extract the most informative features? Is it possible to do this at all without knowing the samples identification?

Thank you for any idea

edger rna-seq deseq2 • 176 views
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by F3.4k
6

I'd reject the data.

ADD REPLYlink written 5 weeks ago by russhh4.6k
2

Agree with russhh, on principal.

I am asking myself the following:

  1. from where did F obtain this data?
  2. why is there no information on sample grouping?

If, genuinely, nobody knows the sample groups, then do the PCA bi-plot, as implied by Genomax, and send that back to whoever it is with whom you are working. If you want, also check the component loadings along PC1 and PC2 so that you can see which genes are the main source of variation along these [principal components]. Through this process, you may actually infer the sample groupings.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by Kevin Blighe47k

The problem is that the collaborator (data owner) replies with too much delay even I am waiting for a month for an answer. That is way I either should extract informative features from this unknown RNA-seq or find another RNA-seq in internet to provide differentially expressed genes between carcinoma and matched normal samples.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by F3.4k

So, the collaborator is the one who is disorganised and who messed up.

ADD REPLYlink written 5 weeks ago by Kevin Blighe47k
2

Em, can't you ask the person who gave you the data what are the IDs?

ADD REPLYlink written 5 weeks ago by grant.hovhannisyan1.7k
4
gravatar for WouterDeCoster
5 weeks ago by
Belgium
WouterDeCoster40k wrote:

Either use a clustering-based approach (unbiased) to separate samples into biological groups (or into technical batch effect groups) or use some biological evidence (biased) e.g. expression of a marker gene, tumor suppressor gene,...

ADD COMMENTlink written 5 weeks ago by WouterDeCoster40k
3
gravatar for genomax
5 weeks ago by
genomax70k
United States
genomax70k wrote:

Why should this be any different than how you would do a usual DE analysis? Thing you need to know is which samples are replicates (if any) and how they are to be grouped (unless you have just 1 of everything, which would be difficult to deal with).

Start with some PCA type analysis to see if you can identify groups.

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by genomax70k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 984 users visited in the last hour