Question: Example of processed RNASeq datasets
gravatar for Dirk
2.7 years ago by
Dirk80 wrote:

I've found a number of informative tutorials on pipelines for processing RNASeq data, but can't seem to find any datasets or tutorials showing how to download or find RNASeq datasets (public) that have already been processed. I'm interested in a number of methods that need RNA expression matrices (e.g. Isoforms vs. experiments, with RPKMs (?)). Is that easily obtained? I believe the file format(s) I need are .gct and .res, but I can't be sure.

datasets rna-seq • 679 views
ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by Dirk80

recount2 has a lot of processed count matrices for different phenotypes. It's not in RPKM, but that's not a good normalisation method anyway.

ADD REPLYlink written 2.7 years ago by WouterDeCoster45k

What's your ultimate goal?; why do they have to be already processed datasets?; in which disease area are you interested?; from where did you hear about GCT and RES files? Also, as per Wouter, don't go using RPKM data if you are also planning to conduct differential expression across samples.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Kevin Blighe69k

The ultimate goal is to investigate general machine learning method performance on these datasets, as compared to popular programs/algorithms for gene/pathway enrichment programs (IPA, PANTHER, etc). Here, I mean to look into how read-counts for different genes are related to different phenotypes (e.g. simple classification), and how much RNAseq data is needed to accurately predict a larger number of genes (in a manner similar to the L1000).

They have to be processed because i'm still a novice, and want to figure out if this line of investigation is amenable to my methods before I devote too much time to learning all of the existing processing pipelines. The disease/phenotype doesn't matter. I had generally understood that normalization methods (like RPKM) were shown to be preferred for analyses--should I start with just raw read numbers?

ADD REPLYlink written 2.7 years ago by Dirk80

I see - forgive my very direct questions. Does it have to be RNA-seq? I ask because it is a lot more easy to download microarray gene expression data in a format that is already ready for analysis. For example, if you search for studies at the Gene Expression Omnibus (GEO), you may eventually see a page like this:

If you then click on the Analyse with GEO2R button, you will be brought to a new page where you'll see a tab called R script, which gives you then exact code that you need to produce the normalised (logged base 2) dataset, which you can then use for your machine learning.

Again, the RPKM method is not great - in fact, it has come under criticism from different corners and should generally be phased out of research, along with FPKM, too. Neither of these methods normalise across samples in a dataset, which makes multiple sample comparisons improper.

If you really want RNA-seq datasets, then I may be able to help with TCGA data, but you may have to contact me through official means for that. I am currently processing the entire TCGA RNA-seq data using updated normalisation methods, and could assist.

ADD REPLYlink written 2.7 years ago by Kevin Blighe69k

It doesn't have to be RNASeq, but that was my preference as there seems to be a plethora of data that exists for this platform, and the general consensus seems to be that its a more flexible approach that is more robust. I will definitely look at all of the microarray data, though. It functionally seems to be the same thing at the level i care about, for sure.

Thank you for the offer on the RNASeq data! How would i go about contacting you officially?

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Dirk80

Well, perhaps we can do this here. In which cancer are you particularly interested? I am now planning to put these datasets on my GitHub account ( ) because there appears to be reasonable demand out there to get the TCGA RNA-seq data in a usable format.

There's no data there yet, but the ones that I've listed will be there tomorrow a some point. I am aiming to reprocess all TCGA RNA-seq data and will make an announcement eventually here on Biostars.

ADD REPLYlink written 2.7 years ago by Kevin Blighe69k

Dear Dirk, the files are actually too large to be hosted on GitHub, so, I will have to find some other hosting server.

ADD REPLYlink written 2.7 years ago by Kevin Blighe69k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1004 users visited in the last hour