I have peak files for different histone modifications. How can I see them?
2
0
Entering edit mode
9.7 years ago
Affan ▴ 300

So I have some data from this paper. For example, one of the data files is this peak file for histone modification H3K4me2. Its a .BED file. I'm supposed to take this data and process it using methods of another paper I am reading. I've read the papers and understand what I am supposed to do but since this is my first time working with actual data, I am a bit lost about it.

So if I understand this correctly, the peak files (for example like above) gives me reads of DNA sequences that are bound to that specific histone modification. So in theory I am looking at a lot of peaks when referenced against the genome (which is mm9). I have the reference files in .fa format. Different chromosomes are in separate files.

So what software can I use to deal with this data? What does the .BED format tell me? I am looking to work with in R or MATLAB. Basically, how do I get started?

Secondary question, in the GEO link, at the bottom they have two additional "replicate" text files. What is the important of those files?

Edit: Additionally, if anyone has any papers to a "introductory workflow" for this type of data/processing, please let me know.

Edit (based on alex and devon's replies)

Thanks, so I put it up on the genome browser but I was expecting to see peaks. I don't really see peaks. I think its because the data in the file looks like: chr1 4847005 4847705. Does this mean that the authors already looked at the peaks using raw data and then concluded "between 4847005 and 4847705 all the nucleosomes have this particular histone modification". And that's what I am looking at in the genome browser? Here is a screenshot of the custom track on the genome browser:

< image not found >

visualization • 4.5k views
ADD COMMENT
1
Entering edit mode

I'm actually doing similar analysis. See the review references I posted on my question: Fastq file to ALN file

ADD REPLY
0
Entering edit mode

The paper that the data comes from talks about how it is a higher quality data than previous studies. It's got more sequencing depth. That being said, I am going to download raw data and trying to replicate their peak files for my own sake and learning.

ADD REPLY
3
Entering edit mode
9.7 years ago

The peak files (in BED format) tell you where their peak caller says peaks (significantly higher than expected numbers of aligned reads) are located. You can look at that in IGV or most any other genome browser to see how those peaks might relate to a gene that you're interested in. BED format is a simple text file, so you can open that in pretty much any language if you want.

Regarding how you get started, it depends entirely on what you want to do.

The "replicate" files seem to be a poor substitute for SAM/BAM files.

Honestly, if you want to do anything with this dataset then you'd be better off downloading the raw reads from SRA, mapping them, and then continuing with whatever you want to do.

ADD COMMENT
0
Entering edit mode

Thanks, so I put it up on the genome browser but I was expecting to see peaks. I don't really see peaks. I think its because the data in the file looks like: chr1 4847005 4847705. Does this mean that the authors already looked at the peaks using raw data and then concluded "between 4847005 and 4847705 all the nucleosomes have this particular histone modification". And that's what I am looking at in the genome browser? Here is a screenshot of the custom track on the genome browser:

< image not found >

ADD REPLY
1
Entering edit mode

Sorry, I must have missed your comment!

So the peak files are a bit annoying in that they just mark where they found a peak. They don't actually show anything about how high the peak was, or where its summit is, or what the neighbouring coverage looks like. If you sorted and viewed the "replicate" files (they're basically BED files, though you might have to edit them) then perhaps you could use that to actually see the evidence for the peak calls.

BTW, I wouldn't recommend drawing the conclusion that all nucleosomes have a given modification within a peak region. A peak just means that nucleosomes are significantly more likely to have that modification than those elsewhere.

ADD REPLY
0
Entering edit mode

so all these small black bars (http://i.imgur.com/yWjAv9j.png) means what exactly then? I thought it meant that between the two positions, the nucleosomes all have this modification (given that the data is good and had good enrichment). My goal here is to use this data as follows: split up the mouse genome in 200bp resolution bins, and for each bin determine if a modification is there or not. So I was thinking that using these files which give me "chr1 4847005 4847705" I can safely say that when I am in those positions I can mark a "yes for the histone mod".

Again thank you very much. I don't have a biology/bioinformatics background and I find that talking to someone like yourself helps me understand better.

ADD REPLY
1
Entering edit mode

They're presumably just called peaks. It would be difficult to ever determine that all of the histones in a given stretch of DNA always have some given modification. ChIP-seq will tell you whether a region is enriched for some modification. I've not seen an absolute method, which is unsurprising given that it'd be difficult to accomplish (you'd have to come up with a good way to target a region of interest after cross-linking, after which you could probably use mass spec. to get good quantification, as any other method would depend on equal efficiency of IP/fluorescence/etc., which would be a questionable assumption).

What's the end goal here? That's probably the better place to start.

ADD REPLY
1
Entering edit mode

The data is from the paper I linked above. I am a applied math student who is working with Hidden Markov Models. I am trying to apply the methods of Ernst and Kellis on this dataset. They split up the genome in 200bp and without a priori information, they assign a state to each of these 200bp regions. The states are later identified as biologically significant areas (promoters, active intergenic and so on).

The reason for this question was to answer some questions about their figure 1. As you can see, they have chip seq data for the marks (noted by different colors for different modifications) and the peaks are shown. Then they binarize the data (ie the black bars) to show whether each 200bp region exhibits that mark or not. The threshold for whether a region displays that mark is based on the number of tags (chip seq reads) in that region combined with a poisson distribution.

So I guess, I understand you now. The chip seq data does give you some sense of enrichment for the modifications. However you can never be 100% certain that all the nucleosomes in that peak area (ie "chr1 4847005 4847705") have the histone modification. You'd have to use some sort of probability function to determine that.

Am I correct?

ADD REPLY
1
Entering edit mode

Ah, makes sense now. Yeah, you've got the idea. Making data like this binary works OK for some uses but not others. How well it works will also depend on how good the underlying data is (there are a lot of crappy antibodies out there and those will lead to vastly poorer results if one were to do this).

ADD REPLY
0
Entering edit mode

Well I guess the problem is that I have data that says where the peak is but I don't have the number of reads in that peak region. I guess this could be solved by looking at raw data (or the replicates like you mentioned?).

The data apparently is of really good quality and is the best amongst similar type of data out there.

Anyways, thank you for everything. I voted your answers and accepted the answer.

ADD REPLY
0
Entering edit mode

Devon, sorry to bother you and you've been a great help. I was kinda hoping this notification might lead you to my comment earlier as I am not sure if I got the right understanding. Thanks.

ADD REPLY
1
Entering edit mode
9.7 years ago

To visualize, perhaps look into preparing a custom track on the UCSC Genome Browser.

To do fast and memory efficient set and statistical operations on BED files, consider looking into a command-line toolkit like BEDOPS, instead or R or MATLAB.

R, in particular, is not very memory efficient. For example, GenomicRanges, a package used to work with genomic data, suggests iterating over chunks of large input sets in its documentation.

ADD COMMENT
0
Entering edit mode

Thank you. I've added it as a custom track but I am confused to what I am looking at. Please see my comment above.

ADD REPLY

Login before adding your answer.

Traffic: 1940 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6