histone mark data from roadmap epigenomics
2
3
Entering edit mode
7.7 years ago
tonja.r ▴ 600

I have downloaded a histone mark data from roadmap epigenomics: CD4+_CD25-_CD45RO+_memory_primary_cells/H3K4me3/GSM772862_BI.CD4+_CD25-_CD45RO+_Memory_Primary_Cells.H3K4me3.Donor_62.bed

but I am not quite sure I understand the content. It is chip-seq analysis, with peak calling, so I expected to see coordinates of the peaks and their height.

However I see such information in a file: chromosome, start, end, id, something, strand

chr1    9984    10183    B09JPABXX110526:5:1101:19182:47634    0    -
chr1    9990    10189    B09JPABXX110526:5:2207:9781:41112    0    -

I was intended to use this data to generate plots with GVIZ to get such plots but I am not sure if start and end positions will be enough somehow.

< image not found >

R ChIP-Seq • 3.8k views
ADD COMMENT
9
Entering edit mode
7.7 years ago

Ah, Roadmap data. It's a bit of a head-ache to understand it sometimes, but bear with me:

What you are looking at is raw (unconsolidated) input that was output by Pash mapper. It is a misformatted bed file. All values in start column should have one subtracted from their coordinate (not respecting the strand), i.e. it should be:

chr1    9983    10183    B09JPABXX110526:5:1101:19182:47634    0    -
chr1    9989    10189    B09JPABXX110526:5:2207:9781:41112    0    -

I am not really sure about what name parameter encodes or why the score is zero...

Once you fix this (using bedtools slop, bedtools slop -l 1 -r 0 -g hg19.genome), just pass the fixed bed-file to some sort of pileup tool (for instance, MACS).

The way ROADMAP does it:

  1. They shorten all the reads back to 36 (so its consistent across all experiments)
  2. They filter duplicate reads, and reads that could not be mapped to the genome at all had they been 36bp long
  3. They estimate fragment length using SPP and run macs2 with appropriate fragment lengths (--nomodel --ext-size=fragment_length) to generate both peak lists and the signal/foldchange track.

I really suggest you skip doing the preprocessing yourself and just get the data from their download page

Namely you want to look at c (peak calling) or section d (signal tracks). Either go with consolidated reads for your cell line (consolidated = all technical/biological replicates lumped together, recommended), or with unconsolidated (which is what you are looking at at the moment).

ADD COMMENT
0
Entering edit mode

Thank you a lot for the explanation! However, it seems that naming of consolidated data for CD4 differ from the naming of unconsolidated data. Should I beware of something there as well?

ADD REPLY
1
Entering edit mode

No, that's expected. The consolidated names use format "<roadmap epigenome id>.<track>" i.e. E008.H3K56ac for H3K56ac on H9 cells. Unconsolidated names would have the institute and donor IDs as well. See the spreadsheet of which unconsolidated reads are in consolidated datasets.

ADD REPLY
0
Entering edit mode

Sauliau, very informative answer, I did learn something new, thanks.

ADD REPLY
0
Entering edit mode
7.7 years ago

Can you provide the link from where you downloaded the data?

The snippet you're showing looks like it's a BED file of individual reads, i.e. each line in the file corresponds to a single read. In your case, the reads are overlapping and 200 bp long. Since Roadmap didn't make use of 200 bp long reads, this probably means that the original reads were artificially extended to represent 200 bp (which probably equals the fragment size that was used (see http://informatics.fas.harvard.edu/wp-content/uploads/2014/06/chipseq2.png)). This is a bit unorthodox - I haven't personally worked with Roadmap data, but I heard that the best place to download it is not the Roadmap page itself, but this place here: http://egg2.wustl.edu/roadmap/web_portal/processed_data.html#ChipSeq_DNaseSeq

There, you can also scroll down a bit to download the peak regions instead of the reads if that's what you need/want. Does that help?

ADD COMMENT
0
Entering edit mode

The data on the page you provided is amazing, however, we want to use the data from Farh paper: FOXP3+CD25hiCD127lo/-regulatory (Tregs), CD25-CD45RA+CD45RO- naive (Tnaive) and CD25-CD45RA-CD45RO+ memory (Tmem) T cells, and ex vivo phorbol myristate acetate (PMA)/ionomycin stimulated CD4+ T cells separated into IL-17-positive (CD25-IL17A+; TH17) and IL-17-negative (CD25-IL17A-; THstim).

I was able to find them on NCBI page. And as it seems to me know, I can visualize the data almost as I want by clicking on "View Data" but I still have no idea what kind of data they use to it, as some track have only sra and bed formats and other bed,wig.

ADD REPLY
0
Entering edit mode

Lines 399-405 of the metadata spreadsheet they provide indicate that the cells you mention should be available through the website I recommended

What kind of data format you will need also depends on the script you want to use for your visualization (i.e. what format does the visualization script require?)

ADD REPLY

Login before adding your answer.

Traffic: 2541 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6