Normalising ChIP-seq data to RPM using a 1 or 2 count for paired-end data?
3
1
Entering edit mode
6.3 years ago
James Ashmore ★ 3.1k

I want to normalise my paired-end ChIP-seq data (4 samples treated, 4 samples control) to reads per million (RPM). So far my calculation takes the number of reads mapped to a position, divides it by the total number of mapped reads in the sample, and then multiples this by one million. My question is, should the number of mapped paired-end reads be counted as 1 or 2 in the calculation? I believe Samtools reports the total number of mapped reads as half the actual number of paired-end reads which map?

My current pseudo script for normalising to RPM:

• mappedReads = samtools view -c -F 4 input.bam
• scalingFactor = 1000000 / mappedReads
• bedtools genomecov -ibam input.bam -bg - scale scalingFactor -g chrom.sizes > output.bedGraph
• bedGraphToBigWig input.bedGraph chrom.sizes output.bigWig
ChIP-Seq normalisation paired-end • 7.7k views
0
Entering edit mode

For paired end data, rpm=fpm (fragments per million). So, it should be counted as one.

0
Entering edit mode

Just to clarify, you are saying that each pair of reads should be counted as just one? So in a perfect world where 100% of my reads align, the number of total alignments would be half the total number of reads in the data?

0
Entering edit mode

Yup. Count a pair of reads as one. For single end reads each read represents a fragment it is sequenced from whereas for paired-end, a pair of sequence represents the fragment it is sequenced from. As simple as that.

8
Entering edit mode
6.3 years ago
Fidel ★ 2.0k

You can use deepTools to do the normalisation that you want. The output is a bigwig file. It is quite fast if you have multiple cores.

bamCoverage -b file.bam --normalizeUsingRPKM -o file.bw

The program first extends the read to match the paired-end length before computing the coverage. I opted to count paired end reads as 2 and not as 1 to avoid a bias when a read is not properly paired which could be a significant fraction.

0
Entering edit mode

Thank you for the suggestion, I used Bowtie2 to map my reads but indicated to remove discordant and mixed alignments, in this case counting each pair of reads as 1 would be preferred

0
Entering edit mode

One option is to filter your bam file to include only the first mate, then use it with bamCompare.  If all your reads are properly paired this should work.

0
Entering edit mode

I just pushed a commit to github enabling bamCoverage to select reads based on the SAM flag. Now you can do something like this to get what you want:

bamCoverage -b file.bam --normalizeUsingRPKM -o file.bw --samFlag 64

--samFlag 64 means to select only the first mate.

0
Entering edit mode
6.3 years ago
Manvendra Singh ★ 2.1k

I think each read should be counted as 1

or

you can try MAnorm for normalization if it suits for your analysis or comparisons

0
Entering edit mode

Initially I just want to create a coverage track for the normalised data.

0
Entering edit mode

Okay, I agree with Fidel

0
Entering edit mode
6.3 years ago
Ian 5.7k

In case you do not already know MACS2 will generate coverage plots for the non-redundant fragments used in the analysis (by default 1 unique fragment per strand) normalised to millions of reads using the '--SPMR' flag.

0
Entering edit mode

I think MACS does not use the paired-end data to extend the reads. Does it extend the reads?

0
Entering edit mode

MACS2 does not need to extend reads as it knows the start and end of the fragment by using the paired end reads information in the BAM file.

0
Entering edit mode

For this you need to use the BAMPE mode of MACS2 which I am not sure it properly works. We tried it and didn't work, also I see numerous complains in the mailing list. Moreover, the bam file has to be sorted by name. I always use MACS2 for peak calling but not coverage tracks.