Why do peak shifts occur?
1
2
Entering edit mode
6.2 years ago

Why do positive and negative tags after tag generation, shift peaks to the middle of the two tag location?As intag

Peak Calling • 4.0k views
ADD COMMENT
0
Entering edit mode

Basically, why do the tags extend in the direction of the other polarity tags? And why do they even have to be shifted there?

ADD REPLY
0
Entering edit mode

It is not clear what you are asking...

In the protocol, the regions of DNA where the protein of interest has bound will be 'cut' [excised] and then sequenced - both the coding and non-coding strands are sequenced, and both from their respective 5' end.

When we align these reads back to the genome, we will be capable of determining the original strand from which the reads originated [i.e. coding or non-coding]. When an in silico aligner looks at a read, it is aware that either the read or its reverse-complement may align, and through this we can infer the strand from which it originated.

For peak merging, the algorithms will look at metrics such as peak height, peak width, peak density, etc. before deciding if 2 peaks relate to the same original protein contact point.

ADD REPLY
8
Entering edit mode
6.2 years ago

You should look at the image a bit more patiently, it contains the answer to your question (if I understood your minimalist question right). You should keep in mind that this issue stems from the time when reads were still around 36 bp, not 100bp as they are today.

We want to know where the yellow bubble has bound to the DNA. We enrich for the DNA-bubble-complex and digest away the bubble. What we're left with is pieces of double-stranded DNA where the bubble was bound. For the sake of simplicity, we can imagine that the bubble had bound exactly in the middle of the fragment, just as it is depicted above.

Now, the DNA fragment that we enriched, was longer than 36 bp, say, 500 bp. The long enriched fragments will be broken up into smaller pieces, which will still be longer than what we could sequence with 36bp reads (say, 200bp). So, all we were going to see in the raw data were 36bp of the 5' ends of those 200bp fragments. Since the original enriched piece of DNA was double-stranded, we will have fragments from both, forward and reverse strand. As the image above nicely shows, the region where the yellow bubble was will be in the middle between those ends. If you took the pile-up of those 36bp tags at face value, you would see the strongest enrichments _around_ the yellow bubble, not in its actually binding site location.

In order to pin down the location of the yellow bubble, the ends of the forward-strand-reads and the ends of the reverse-strand-reads were therefore shifted towards each other, assuming an average fragment size, e.g. 200bl. This was just meant to sharpen the signal because without the shift, the signal would be artificially broadened (i.e., it would include the fringes of the original fragment and it would be somewhat bimodal with the valley in between two peaks actually corresponding to the region that's more likely to contain the actual binding site).

ADD COMMENT
0
Entering edit mode

Thanks for the Explanation!

ADD REPLY
0
Entering edit mode

you're welcome! glad to see it may have helped.

ADD REPLY
0
Entering edit mode

Is this shifting behaviour of the reads valid for RNA-seq as it is for ChIP-seq?

ADD REPLY
0
Entering edit mode

only if you seek to find binding sites of proteins on transcripts using antibody-based enrichment of your factor of interest as it is binding to RNA. generally, it would be imprecise say that the reads are shifting -- the reads are always going to represent the ends of the (c)DNA fragments that were put onto the flow cell. the only reason reads used to be shifted computationally for ChIP-seq analysis was because we were interested in the information that was _not_ being captured, i.e. those parts of the fragments where the protein had bound that often tended to be in the center of those fragments.

for typical RNA-seq, the main information of interest is usually just the abundance of transcripts, which can be determined based on the sequences that we do capture

ADD REPLY

Login before adding your answer.

Traffic: 3788 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6