I am well familiar with microarray-based transcriptomics but don't have much experience with RNA-Seq.
I am interested in published transcriptomics data, found in the GEO database at the NCBI. For microarray projects, one can download the data in various formats that I can work with. However, for RNA-Seq projects, the GEO database offers only the download as "bedgraph" files. I read and understand what these are, but I am not sure how to use them for analyzing transcriptomics data.
I expected some output with gene names and expression values for the different conditions. What I get is a bedgraph format (one track per condition) The GEO data is not human, and the first three columns of the bedgraph file are supposed to contain position information. This is a small section of one of the files:
track type=bedGraph name="TopHat - read coverage" C36799851 0 27 0 C36799851 27 98 1 C36800049 0 0 0 C36800049 0 1 2
I understand that these files are meant to be displayed in the UCSC genome browser. I tried to upload the file, but got an error message about too little memorey (the bedgraph files are huge). So, my first two questions are:
- how am I supposed to find the correct genome browser that maches the bedgraph file? I know the organism, but there might be different versions, releases etc
- should I use the 'upload' function, and what can I do about the memory problem?
My most important question, however, is more fundamental. Even if I manage to display multiple tracks like this in the browser, how can I make sense of these data, e.g. by searching for genes that show big expression changes between two conditions? There must be a solution without using a genome browser - maybe by mapping the positional information in the bedgraph files to the genes.
If you are interested in using this data then you should find original fastq files, do alignment/counting yourself instead of depending on these derived files.
Ok, but I assume the derived files must be good for SOMETHING. Otherwise, they wouldn't offer them for download.
Yes, for visualization on a genome browser. For anything else they are utterly useless. You really should get raw data and get raw counts from that.
Ok. three aspects: i) as explained by Luis Nassar and the documents he links to, the GEO file as such cannot be displayed in a genome browser, it has be be converted to a 'bigwig' file and put on a publicly accessible web server (which I don't have). If the GEO database offers this file for display in a genome browser, why don't they just offer a bigwig file directly on one of their servers? ii) I tried to get the raw data, but there is no easy path from the GEO entry to some FASTQ file. Maybe there is one, but I just don't get it. iii) with the microarray based projects, GEO offers a path to the raw data, but they also offer processed data in several formats. I am not talking about 'analyzed data', but some kind of text format that has gene names and intensities for the different conditions.