Question: GEO database - how to use their bedgraph files
0
gravatar for Suicyte
4 months ago by
Suicyte10
Germany
Suicyte10 wrote:

I am well familiar with microarray-based transcriptomics but don't have much experience with RNA-Seq.

I am interested in published transcriptomics data, found in the GEO database at the NCBI. For microarray projects, one can download the data in various formats that I can work with. However, for RNA-Seq projects, the GEO database offers only the download as "bedgraph" files. I read and understand what these are, but I am not sure how to use them for analyzing transcriptomics data.

I expected some output with gene names and expression values for the different conditions. What I get is a bedgraph format (one track per condition) The GEO data is not human, and the first three columns of the bedgraph file are supposed to contain position information. This is a small section of one of the files:

track type=bedGraph name="TopHat - read coverage"
C36799851       0       27      0
C36799851       27      98      1
C36800049       0       0       0
C36800049       0       1       2

I understand that these files are meant to be displayed in the UCSC genome browser. I tried to upload the file, but got an error message about too little memorey (the bedgraph files are huge). So, my first two questions are:

  1. how am I supposed to find the correct genome browser that maches the bedgraph file? I know the organism, but there might be different versions, releases etc
  2. should I use the 'upload' function, and what can I do about the memory problem?

My most important question, however, is more fundamental. Even if I manage to display multiple tracks like this in the browser, how can I make sense of these data, e.g. by searching for genes that show big expression changes between two conditions? There must be a solution without using a genome browser - maybe by mapping the positional information in the bedgraph files to the genes.

Any idea?

rna-seq bedgraph • 248 views
ADD COMMENTlink modified 4 months ago by Luis Nassar330 • written 4 months ago by Suicyte10

If you are interested in using this data then you should find original fastq files, do alignment/counting yourself instead of depending on these derived files.

ADD REPLYlink written 4 months ago by genomax78k

Ok, but I assume the derived files must be good for SOMETHING. Otherwise, they wouldn't offer them for download.

ADD REPLYlink written 4 months ago by Suicyte10

Yes, for visualization on a genome browser. For anything else they are utterly useless. You really should get raw data and get raw counts from that.

ADD REPLYlink written 4 months ago by ATpoint30k

Ok. three aspects: i) as explained by Luis Nassar and the documents he links to, the GEO file as such cannot be displayed in a genome browser, it has be be converted to a 'bigwig' file and put on a publicly accessible web server (which I don't have). If the GEO database offers this file for display in a genome browser, why don't they just offer a bigwig file directly on one of their servers? ii) I tried to get the raw data, but there is no easy path from the GEO entry to some FASTQ file. Maybe there is one, but I just don't get it. iii) with the microarray based projects, GEO offers a path to the raw data, but they also offer processed data in several formats. I am not talking about 'analyzed data', but some kind of text format that has gene names and intensities for the different conditions.

ADD REPLYlink modified 4 months ago • written 4 months ago by Suicyte10
3
gravatar for Luis Nassar
4 months ago by
Luis Nassar330
UCSC Genome Browser
Luis Nassar330 wrote:

Hello,

To answer your first question regarding the correct Genome Browser, you will have to check the assembly used in GEO and see if we have the corresponding assembly for that organism as a native assembly. You can see this in the organism gateway page, e.x. https://genome.ucsc.edu/cgi-bin/hgGateway. This page includes an NCBI assembly accession number which should match. If you are unsure, you can send us the assembly or GEO page and we can check.

If we do not have the assembly, you have the option of creating an assembly hub (https://genome.ucsc.edu/goldenpath/help/hubQuickStartAssembly.html), though that may not be worth doing if you just want to get to the raw data.

Regarding the second question, we have a limit on the size of files that can be uploaded as custom tracks. After that limit, you would have to create a big* data track, hosted in a remote location. In the case of bedGraph (https://genome.ucsc.edu/goldenPath/help/bedgraph.html), you would convert it to bigWig. See the following page and example for more information (https://genome.ucsc.edu/goldenPath/help/bigWig.html#Ex3). Here is a help page on hosting as well (https://genome.ucsc.edu/goldenPath/help/hgTrackHubHelp.html#Hosting).

Once you have the data in the Genome Browser, you could do additional manipulations more than just visualization. For example, intersecting the data with Gene Tracks (if they are available for that assembly), using things like the Table Browser (https://genome.ucsc.edu/cgi-bin/hgTables).

As you have said, however, if you are just trying to map the data to genes there may be other more direct approaches to take.

Hopefully this answers some of your questions. If you have additional questions regarding the Genome Browser, the best way to reach us is to email genome@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genome-www@soe.ucsc.edu.

We do periodically check biostars, in which case the UCSC tag is helpful.

ADD COMMENTlink written 4 months ago by Luis Nassar330
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 758 users visited in the last hour