histone mark data from roadmap epigenomics
2
3
Entering edit mode
7.7 years ago
tonja.r ▴ 600

but I am not quite sure I understand the content. It is chip-seq analysis, with peak calling, so I expected to see coordinates of the peaks and their height.

However I see such information in a file: chromosome, start, end, id, something, strand

chr1    9984    10183    B09JPABXX110526:5:1101:19182:47634    0    -
chr1    9990    10189    B09JPABXX110526:5:2207:9781:41112    0    -


I was intended to use this data to generate plots with GVIZ to get such plots but I am not sure if start and end positions will be enough somehow.

R ChIP-Seq • 3.8k views
9
Entering edit mode
7.7 years ago

Ah, Roadmap data. It's a bit of a head-ache to understand it sometimes, but bear with me:

What you are looking at is raw (unconsolidated) input that was output by Pash mapper. It is a misformatted bed file. All values in start column should have one subtracted from their coordinate (not respecting the strand), i.e. it should be:

chr1    9983    10183    B09JPABXX110526:5:1101:19182:47634    0    -
chr1    9989    10189    B09JPABXX110526:5:2207:9781:41112    0    -


I am not really sure about what name parameter encodes or why the score is zero...

Once you fix this (using bedtools slop, bedtools slop -l 1 -r 0 -g hg19.genome), just pass the fixed bed-file to some sort of pileup tool (for instance, MACS).

1. They shorten all the reads back to 36 (so its consistent across all experiments)
2. They filter duplicate reads, and reads that could not be mapped to the genome at all had they been 36bp long
3. They estimate fragment length using SPP and run macs2 with appropriate fragment lengths (--nomodel --ext-size=fragment_length) to generate both peak lists and the signal/foldchange track.

I really suggest you skip doing the preprocessing yourself and just get the data from their download page

Namely you want to look at c (peak calling) or section d (signal tracks). Either go with consolidated reads for your cell line (consolidated = all technical/biological replicates lumped together, recommended), or with unconsolidated (which is what you are looking at at the moment).

0
Entering edit mode

Thank you a lot for the explanation! However, it seems that naming of consolidated data for CD4 differ from the naming of unconsolidated data. Should I beware of something there as well?

1
Entering edit mode

No, that's expected. The consolidated names use format "<roadmap epigenome id>.<track>" i.e. E008.H3K56ac for H3K56ac on H9 cells. Unconsolidated names would have the institute and donor IDs as well. See the spreadsheet of which unconsolidated reads are in consolidated datasets.

0
Entering edit mode

Sauliau, very informative answer, I did learn something new, thanks.

0
Entering edit mode
7.7 years ago

There, you can also scroll down a bit to download the peak regions instead of the reads if that's what you need/want. Does that help?

0
Entering edit mode

The data on the page you provided is amazing, however, we want to use the data from Farh paper: FOXP3+CD25hiCD127lo/-regulatory (Tregs), CD25-CD45RA+CD45RO- naive (Tnaive) and CD25-CD45RA-CD45RO+ memory (Tmem) T cells, and ex vivo phorbol myristate acetate (PMA)/ionomycin stimulated CD4+ T cells separated into IL-17-positive (CD25-IL17A+; TH17) and IL-17-negative (CD25-IL17A-; THstim).

I was able to find them on NCBI page. And as it seems to me know, I can visualize the data almost as I want by clicking on "View Data" but I still have no idea what kind of data they use to it, as some track have only sra and bed formats and other bed,wig.

0
Entering edit mode

Lines 399-405 of the metadata spreadsheet they provide indicate that the cells you mention should be available through the website I recommended

What kind of data format you will need also depends on the script you want to use for your visualization (i.e. what format does the visualization script require?)