I have been given a big set of RNA-seq, one sample looks like this
readcounts_union readcounts_intersectionNotEmpt genelength R/FPKM_union R/FPKM_intersectionNotEmpt ENSG00000258486.2 1151554 1151554 597 79153.32269 78738.12898 ENSG00000265150.1 1089307 1089307 297 150505.7244 149716.2562 ENSG00000202198.1 996127 996128 331 123494.0095 122846.3529
I also have case ID for each sample like
OC/SH/061g/159 SLX-14829.D709-D505 OC/AH/183 SLX-14880.D703-D506 OC/AH/143 SLX-14880.D704-D506
BUT I don't know what these IDs are, which is normal, which is tumor, and there is no one to ask from
I have to reduce the features in RNA-seq data and extract the most informative genes for integrating with proteomics; In such case people usually do differential expression but I don't know the class of samples to think about DESeq2 or edgeR
So, if you were me, how would you deal with this data? How would you extract the most informative features? Is it possible to do this at all without knowing the samples identification?
Thank you for any idea