Question

Example of processed RNASeq datasets

0

Entering edit mode

5.9 years ago

Dirk ▴ 100

I've found a number of informative tutorials on pipelines for processing RNASeq data, but can't seem to find any datasets or tutorials showing how to download or find RNASeq datasets (public) that have already been processed. I'm interested in a number of methods that need RNA expression matrices (e.g. Isoforms vs. experiments, with RPKMs (?)). Is that easily obtained? I believe the file format(s) I need are .gct and .res, but I can't be sure.

datasets RNA-Seq • 1.5k views

ADD COMMENT • link 5.9 years ago by Dirk ▴ 100

0

Entering edit mode

recount2 has a lot of processed count matrices for different phenotypes. It's not in RPKM, but that's not a good normalisation method anyway.

ADD REPLY • link 5.9 years ago by WouterDeCoster 47k

0

Entering edit mode

What's your ultimate goal?; why do they have to be already processed datasets?; in which disease area are you interested?; from where did you hear about GCT and RES files? Also, as per Wouter, don't go using RPKM data if you are also planning to conduct differential expression across samples.

ADD REPLY • link 5.9 years ago by Kevin Blighe 87k

0

Entering edit mode

The ultimate goal is to investigate general machine learning method performance on these datasets, as compared to popular programs/algorithms for gene/pathway enrichment programs (IPA, PANTHER, etc). Here, I mean to look into how read-counts for different genes are related to different phenotypes (e.g. simple classification), and how much RNAseq data is needed to accurately predict a larger number of genes (in a manner similar to the L1000).

They have to be processed because i'm still a novice, and want to figure out if this line of investigation is amenable to my methods before I devote too much time to learning all of the existing processing pipelines. The disease/phenotype doesn't matter. I had generally understood that normalization methods (like RPKM) were shown to be preferred for analyses--should I start with just raw read numbers?

ADD REPLY • link 5.9 years ago by Dirk ▴ 100

0

Entering edit mode

I see - forgive my very direct questions. Does it have to be RNA-seq? I ask because it is a lot more easy to download microarray gene expression data in a format that is already ready for analysis. For example, if you search for studies at the Gene Expression Omnibus (GEO), you may eventually see a page like this: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE55457

If you then click on the Analyse with GEO2R button, you will be brought to a new page where you'll see a tab called R script, which gives you then exact code that you need to produce the normalised (logged base 2) dataset, which you can then use for your machine learning.

Again, the RPKM method is not great - in fact, it has come under criticism from different corners and should generally be phased out of research, along with FPKM, too. Neither of these methods normalise across samples in a dataset, which makes multiple sample comparisons improper.

If you really want RNA-seq datasets, then I may be able to help with TCGA data, but you may have to contact me through official means for that. I am currently processing the entire TCGA RNA-seq data using updated normalisation methods, and could assist.

ADD REPLY • link 5.9 years ago by Kevin Blighe 87k

0

Entering edit mode

It doesn't have to be RNASeq, but that was my preference as there seems to be a plethora of data that exists for this platform, and the general consensus seems to be that its a more flexible approach that is more robust. I will definitely look at all of the microarray data, though. It functionally seems to be the same thing at the level i care about, for sure.

Thank you for the offer on the RNASeq data! How would i go about contacting you officially?

ADD REPLY • link 5.9 years ago by Dirk ▴ 100

0

Entering edit mode

Well, perhaps we can do this here. In which cancer are you particularly interested? I am now planning to put these datasets on my GitHub account ( https://github.com/kevinblighe/TCGA-RNAseq ) because there appears to be reasonable demand out there to get the TCGA RNA-seq data in a usable format.

There's no data there yet, but the ones that I've listed will be there tomorrow a some point. I am aiming to reprocess all TCGA RNA-seq data and will make an announcement eventually here on Biostars.

ADD REPLY • link 5.9 years ago by Kevin Blighe 87k

0

Entering edit mode

Dear Dirk, the files are actually too large to be hosted on GitHub, so, I will have to find some other hosting server.

ADD REPLY • link 5.9 years ago by Kevin Blighe 87k