Question: histone mark data from roadmap epigenomics
gravatar for tonja.r
3.9 years ago by
tonja.r450 wrote:

I have downloaded a histone mark data from roadmap epigenomics:

but I am not quite sure I understand the content. It is chip-seq analysis, with peak calling, so I expected to see coordinates of the peaks and their height.

However I see such information in a file: chromosome, start, end, id, something, strand

chr1    9984    10183    B09JPABXX110526:5:1101:19182:47634    0    -

chr1    9990    10189    B09JPABXX110526:5:2207:9781:41112    0    -

I was intended to use this data to generate plots with GVIZ to get such plots but I am not sure if start and end positions will be enough somehow.



chip-seq R • 2.7k views
ADD COMMENTlink modified 3.9 years ago by Saulius Lukauskas530 • written 3.9 years ago by tonja.r450
gravatar for Saulius Lukauskas
3.9 years ago by
London, UK
Saulius Lukauskas530 wrote:

Ah, Roadmap data. It's a bit of a head-ache to understand it sometimes, but bear with me:

What you are looking at is raw (unconsolidated) input that was output by Pash mapper. It is a misformated bed file. All values in start column should have one subtracted from their coordinate (not respecting the strand), i.e. it should be:

chr1    9983    10183    B09JPABXX110526:5:1101:19182:47634    0    -

chr1    9989    10189    B09JPABXX110526:5:2207:9781:41112    0    -

I am not really sure about what name parameter encodes or why the score is zero...

Once you fix this (bedtools slop -l 1 -r 0 -g hg19.genome), just pass the fixed bed-file to some sort of pileup tool (for instance, MACS).

The way ROADMAP does it:

  1. They shorten all the reads back to 36 (so its consistent across all experiments)
  2. They filter duplicate reads, and reads that could not be mapped to the genome at all had they been 36bp long
  3. They estimate fragment length using SPP and run macs2 with appropriate fragment lengths (--nomodel --ext-size=fragment_length) to generate both peak lists and the signal/foldchange track.

I really suggest you skip doing the preprocessing yourself and just get the data from their download page 

Namely you want to look at c (peak calling) or section d (signal tracks). Either go with consolidated reads for your cell line (consolidated = all technical/biological replicates lumped together, recommended), or with unconsolidated (which is what you are looking at at the moment).

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by Saulius Lukauskas530

Thank you a lot for the explanation! However, it seems that naming of consolidated data for CD4 differ from the naming of unconsolidated data. Should I beware of something there as well? 


ADD REPLYlink written 3.9 years ago by tonja.r450

No, that's expected. The consolidated names use format "<roadmap epigenome id>.<track>" i.e. E008.H3K56ac for H3K56ac on H9 cells. Unconsolidated names would have the institute and donor IDs as well. See the spreadsheet of which unconsolidated reads are in consolidated datasets.

ADD REPLYlink written 3.9 years ago by Saulius Lukauskas530

Sauliau, very informative answer, I did learn something new, thanks.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by PoGibas4.7k
gravatar for Friederike
3.9 years ago by
United States
Friederike2.8k wrote:

Can you provide the link from where you downloaded the data?

The snippet you're showing looks like it's a BED file of individual reads, i.e. each line in the file corresponds to a single read. In your case, the reads are overlapping and 200 bp long. Since Roadmap didn't make use of 200 bp long reads, this probably means that the original reads were artificially extended to represent 200 bp (which probably equals the fragment size that was used (see This is a bit unorthodox - I haven't personally worked with Roadmap data, but I heard that the best place to download it is not the Roadmap page itself, but this place here:

There, you can also scroll down a bit to download the peak regions instead of the reads if that's what you need/want. Does that help?

ADD COMMENTlink written 3.9 years ago by Friederike2.8k

The data on the page you provided is amazing, however, we want to use the data from Farh paper : FOXP3+CD25hiCD127lo/−regulatory (Tregs), CD25−CD45RA+CD45RO− naive (Tnaive) and CD25−CD45RA−CD45RO+ memory (Tmem) T cells, and ex vivo phorbol myristate acetate (PMA)/ionomycin stimulated CD4+ T cells separated into IL-17-positive (CD25−IL17A+; TH17) and IL-17-negative (CD25−IL17A−; THstim).
I was able to find them on NCBI page. And as it seems to me know, I can visualize the data almost as I want by clicking on "View Data" but I still have no idea what kind of data they use to it, as some track have only sra and bed formats and other bed,wig.


ADD REPLYlink written 3.9 years ago by tonja.r450

lines 399-405 of the metadata spreadsheet they provide indicate that the cells you mention should be available through the website I recommended

what kind of data format you will need also depends on the script you want to use for your visualization (i.e. what format does the visualization scipt require?)

ADD REPLYlink written 3.9 years ago by Friederike2.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 672 users visited in the last hour