Question: How to analyze ChIP-Seq data to determine whether peaks are found on Exons, Introns, or Exon-Intron Junctions?
1
gravatar for System
3.9 years ago by
System170
United States
System170 wrote:

Hello everyone!

I'm relatively new to Bioinformatics so please bare with me. A Ph.D student in my lab has asked me to analyze ChIP-Seq data and determine whether peaks fall into exons, introns, or exon-intron junction categories. The gene she is looking at is TNFAIP3 and the mark she wants me to analyze is Pol II. I am using the hg19 human model.

Unsure how to start I am thinking of doing the following:

1)  Download BED files for TNFAIP3 and get individual files for Exons, and Introns.

2) Annotate peaks of BED files using a python script already written using Homer. 

After this I am unsure how to continue. Am I done once I have these annotations in a txt format that can be opened on excel? Her second question revolves around analyzing peaks for common sequence motifs so how would I prepare this information to continue into that?

ADD COMMENTlink modified 3.9 years ago by Joseph Pearson450 • written 3.9 years ago by System170

What is the format of the ChIP-seq data you are talking about ? Is it raw reads, mapped reads or something else ?

ADD REPLYlink written 3.9 years ago by Carlo Yague4.6k

I'm not entirely sure how to answer your question since I had no hand in the creation of the ChIP-Seq data. I have multiple file formats, but for Peaks I have a .fastq.gz file format however I also have a .narrowPeaks format (similar to BED format) for each individual protein she wants me to look at. With a bit of googling I am under the impression that raw reads = fastq format and mapped reads = BED format.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by System170

The .narrowPeaks format is most likely a bed-format file with one or more columns indicating scores (heights, statistical significance) for each peak. Maybe from MACS2?

https://github.com/taoliu/MACS/

 

ADD REPLYlink written 3.9 years ago by Joseph Pearson450

Also, if you're looking at a single gene, it might be informative to look at the aligned ChIP-seq file on a genome browser.

You'll have to align your fastq file, then you can generate a .wig, .bedgraph, etc file from your .bam file.

ADD REPLYlink written 3.9 years ago by Joseph Pearson450

Yes this is correct, I compiled a script that will use MACS2 to call peaks from a .sam file and that's how I generated a the .narrowPeak file. I have also already generated a .bigwig file so that I can view my data on IGV. Attached is the data that I'm currently looking at ... however, she wants a database that tells her where each peak is located (whether it is at an exon, intron, or exon-intron junction). But I do not know how to call these peaks and load them into a database for her.

This is an image we generated for a previous paper written. She wants to eventually look at these enrichment regions and see if there are any common sequence motifs which I will probably run using MEME or DREME once I figure out how to actually use the program. The way she wrote down what she wanted me to do was this "Generate the site-database of exons, exon/intron junctions or introns for each individual protein within intragenic enrichment". Maybe that will help.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by System170
2
gravatar for Joseph Pearson
3.9 years ago by
UNC Chapel Hill
Joseph Pearson450 wrote:

Since you have a NarrowPeak file of ChIP signal peaks, and you have aligned ChIP signal, you can use either file (with slight differences) to measure whether one feature (exon, intron, exon-intron boundaries, intron-exon boundaries, 5' vs 3' exons) tends to have a greater signal (normalized to feature length). Bedtools should be able to do this, once you've constructed your different feature files in BED format (or similar), and I'm sure there are more sophisticated solutions.

ADD COMMENTlink written 3.9 years ago by Joseph Pearson450

After some searching this is what I am thinking of trying next which is close to what you suggested.

1) Overlap list of peaks with list of exons / introns. Though I have no idea where I would get the list of exons / introns, possibly from UCSC?

2) Run intersect.Bed in Bedtools using the following command: intersectBed -a peaks.bed -b introns.bed -wa -wb

 

Right now I went ahead and annotated the protein peaks in the entire genome and this seems to contain everything she is looking for ... except it is obviously extremely messy.

 

PS: Sorry for being difficult heh.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by System170

The UCSC Table browser has a number of powerful tools for extracting coordinates, sequences, etc. You can definitely get your list of exons (or introns) for your gene (or region, or genome) from the Table Browser.

ADD REPLYlink written 3.9 years ago by Joseph Pearson450
0
gravatar for Carlo Yague
3.9 years ago by
Carlo Yague4.6k
Belgium
Carlo Yague4.6k wrote:

For you, probably the easiest way would be to load the narrowpeak file into a compatible genome browser (such as IGV). It'll allow you to visualize the enrichment signal over the whole genome, then you can zoom in any gene you like. It's easy and you can make a nice figure instead of just telling her where pol II is (pol II is probably everywhere by the way).

PS : Always avoid using Excel when dealing with genome-wide data :)

ADD COMMENTlink written 3.9 years ago by Carlo Yague4.6k

I have already loaded the narrowpeak file into IGV. Here is the image I am currently looking at:

Obviously this gives some of the data she wants, but she wants a database of every single peak and whether it is located on an exon, intron, or exon-intron junction. Obviously excel would be horrid for genome-wide data, I use MySQL for most of our database but she isn't very computer oriented so I wanted to give her an easy to read excel file version, and if needed I will get her any other data she needs via MySQL searches.

This is an image we generated for a previous paper written. She wants to eventually look at these enrichment regions and see if there are any common sequence motifs which I will probably run using MEME or DREME once I figure out how to actually use the program. The way she wrote down what she wanted me to do was this "Generate the site-database of exons, exon/intron junctions or introns for each individual protein within intragenic enrichment". Maybe that will help.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by System170
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1359 users visited in the last hour