Coverage For Pair-End Rna-Seq, Extend Reads Or Not?
2
2
Entering edit mode
8.1 years ago

HI, all

I have a question when computing region coverage of pair-end RNASeq data.

As showed by the sketch map, when computing region coverage, whether I should use actually mapped reads or extended reads? The coverage for impacted region may be different. I wonder which one is reasonable.

I want to compute and compare coverage for small regions like 100 bp, not as large as a gene. Also, one of the compared sample is sequenced Pair-endly, the other is sequenced single-endly. Which way of computing is more suitable in this case?

I have a Rip-Seq and want to use this RNA-Seq as a control to call peaks. The genome would be binned into 100bp regions and coverage of each region will be computed for both Rip-Seq and RNA-Seq. A following fisher's test would be used to select significantly enriched regions of Rip-Seq.

The region is defined. "How many fragments were sequenced from this region?" is what I want to ask . If one of my defined regions happens to located in the internal region of two ends of a fragment, the coverage of it in RNA-Seq would be 0 if just mapped tags used. However, this region is surely covered by reads.

Thank you very much!

Reference                       ========--------------------------------------------------===========
Actually mapped reads            ^^^^                                                            
Extended reads                   ^^^^^^^                                                  $$Impacted region **** ******* Reference ====================================================================== Actually mapped reads ^^^^$$$$Extended reads ^^^^^^^^^^^^^^^^$$$Impacted region ************************************************************* #ignore the relative ratio of ^ and$

= represents exon bases; - represents intron bases
^ represents left-end reads; \$ represents right-end reads


Update: Add my usage for this kind of data to make question clear.

rna-seq paired-end • 4.4k views
4
Entering edit mode
8.1 years ago
Ryan Thompson ★ 3.5k

Since you explained in a comment that you are working in a peak-finding context, then what you care about is the size of the fragment from which a read or read pair was derived. For paired reads, getting the bounds of the fragment is trivial. For single-end reads, if you want to do the same, you must estimate the fragment size by finding how the peaks on opposite strands are shifted relative to each other (assuming you have non-strand-specific sequencing data). The process is described in the MACS paper and on this page about HOMER. This method was designed for ChIP-Seq data (DNA), but might work for RIP-seq as well. The fragment size represents the "footprint" size of the protein that you IP'ed. If your single-end data are strand-specific, then you have no way to estimate the footprint size, and you will have to guess. If the paired-end and single-end data are for the same IP, then you can use the mean fragment length from the paired end data as the fragment size for the single-end data.

Now, you can fill in the gaps between pairs in your paired-end data, and you can extend your single-end data to the mean fragment length (footprint size), and then your data will be comparable between single and paired-end.

0
Entering edit mode

Thans Ryan. Thanks for the nice explanation of ChIP-Seq peak calling process. My pair-end seq is input and single-end seq is IP result. The length of single-end reads is about 100 bp. The length of pair-end reads is 330 bp with 100 bp sequenced at each end. Is it ok to compare this two types of data?

0
Entering edit mode

Since your paired-end data is the input, you cannot use it to determine your protein's footprint. You'll need to use the peak-shift-size method on the single-end data if you can. Once you have an estimate of the protein's footprint size, probably the best approach is to assign each read a single base location at the estimated midpoint of the fragment that it represents. For example, if you estimate a footprint size of 150 bp, then you should take the 5-prime coordinate of each read and then shift it 75 bp (i.e. half of 150) in the 3-prime direction. Then whatever bin it lands in is the one you would count it in. But also keep splicing in mind. Since your reads are 100 bp long, you can just take the mapping coordinate of the 75th nucleotide in the read (for the example of a 150bp footprint size).

0
Entering edit mode

Thank you very much. I just confused about the shift part, even in analyzing ChIP-Seq data. I think the operation of shifting nucleotides based on the assumption that protein binds to middle of sequenced fragments. However, this can not always be true. So is there any explanation for the usage of shifting?

0
Entering edit mode

The assumption is not that the binding occurs in the middle of the fragments. It is that the binding occurs within the fragment. The most likely binding location is determined not from a single fragment, but from the superposition of multiple fragments.

0
Entering edit mode

The goal of the shifting is to find the midpoint of each sequenced fragment. The distribution of these midpoints should be centered on the binding site of the protein.

Some protocols have a nuclease step that digests any DNA that is not "protected" by being bound to a protein. In that case the estimated fragment size is actually representative of the protein's footprint. For example, micrococcal nuclease digests chromatin into individual nucleosomes, so the fragment length is the length of DNA wrapped around one histone. However, if your protocol does not include such a step (probably most likely for transcription factor ChIP), the fragment midpoints are still unbiased estimators of the binding site location, though the fragment size will be larger than the protein's footprint size.

2
Entering edit mode
8.1 years ago

Generally, I don't think you'll need to extend the reads. For differential expression analysis, ultimately, the number of reads is what is important, not the number of covered bases. For variant calling/finding, having the actual number of covered bases is what is important. In both cases, extending the reads is not involved in the counting process.

0
Entering edit mode

Thanks, Sean. I have two types of RNA-Seq, one is pair-end, the other is single-end. I want to compare the coverage difference at small regions like 100bp. Would this be a problem then?

0
Entering edit mode

Exactly what question are you trying to answer? If the region is pre-defined, you can ask the question, "How many fragments were sequenced from this region?" which makes the question of "filling in" coverage between paired reads moot.

0
Entering edit mode

I have a Rip-Seq and want to use this RNA-Seq as a control to call peaks. The genome would be binned into 100bp regions and coverage of each region will be computed for both Rip-Seq and RNA-Seq. A following fisher's test would be used to select significantly enriched regions of Rip-Seq.

The region is defined. The question you quoted is what I want. If one of my defined regions happens to located in the internal region of two ends of a fragment, the coverage of it in RNA-Seq would be 0 if just mapped tags used. However, this region is surely covered by reads.

0
Entering edit mode

You can ignore the read length in your situtation, I think. You simply need to count molecules, not bases covered.

0
Entering edit mode

Yes, Sean. The things I am counting is the number of reads. For bins located in "impacted regions" (showed in schematic), their true number of reads is 1 if internal regions are filled. If I do not fill internal regions, the number of reads there would be 0. Is that right?

0
Entering edit mode

You should count each read/fragment only once (only one bin). How you do that is up to you.

0
Entering edit mode

Thanks, Sean. I get it. Thank you for the explaination for the key points in analyzing data for different usage.