Question: FRIP score ATAC-seq
1
gravatar for Lucy
15 months ago by
Lucy40
Lucy40 wrote:

Hi,

I am trying to calculate the fraction of reads in peaks (FRIP score) for my ATAC-seq samples. What I have so far is written below, however I think it is wrong somehow as I am getting a very low score (when I calculate the FRIP using DiffBind, it comes out much better/more like the expected value e.g. 0.37).

val1=$(bedtools intersect -a <sample_BAM> -b <BED_peakfile> -wa -u | wc -l)
val2=$(samtools view <sample_BAM> | wc -l)
FRIP=$(awk "BEGIN {print "${val1}"/"${val2}"}")

For the peak file, I am using my optimal peak list for the particular cell type (generated using IDR).

Can anyone see what I am doing wrong?

Thanks,

Lucy

atac-seq frip • 2.4k views
ADD COMMENTlink modified 15 months ago by chaton20 • written 15 months ago by Lucy40

Hi Lucy,

for the future, please use the formatting bar to indicate code. This improves readability. I did it for you this time.

Code

ADD REPLYlink modified 15 months ago • written 15 months ago by ATpoint26k
3
gravatar for ATpoint
15 months ago by
ATpoint26k
Germany
ATpoint26k wrote:

The problem with your solution is that bedtools intersect uses only the read length, probably 50 or 75bp, to do the intersection, rather than the actual fragment length, which for ATAC-seq should be between 80 and a few hundred bp. The same goes for all bedtools commands as far as I know.

I would use featureCounts from the subread package. It is actually made to count reads per reference regions in order to make count matrices but it also outputs the percentage of reads/fragments assigned to it, which is what you want:

## Make a custom "SAF" file which featureCounts needs:
awk 'OFS="\t" {print $1"-"$2+1"-"$3, $2+1, $3, "+"}' foo.narrowPeak > foo.saf

## run featureCounts (add -T for multithreading)
featureCounts -p -a peaks.saf -F SAF -o out.txt atacseq.bam

This will produce an output to screen with the % of assigned reads (=FRiP)and also a file out.txt.summary, from which you can programatically calculate it for a lot of samples.

FRiP

ADD COMMENTlink modified 15 months ago • written 15 months ago by ATpoint26k

Hello, I don't get your point about bedtools intersect. According to ENCODE, FRiP means "Fraction of all mapped reads that fall into the called peak regions, i.e. usable reads in significantly enriched peaks divided by all usable reads." It's about "reads" rather than "fragment". So, why can not we use bedtools intersect?

ADD REPLYlink written 11 months ago by niuyw30

In paired-end sequencing, we use the word fragment because the two reads that are produced always originate from the same DNA fragment and are therefore not independent of each other as reads from single-end sequencing would be. As FRiP comes from single-end ChIP-seq data, this is why they probably termed it reads. ATAC-seq is most commonly paired-end. You can use BEDtools for paired-end data but it requires more pre-processing of your data, that is why I use featureCounts, being faster and more convinient with plenty of customizable options. Choice is still yours. FRiP is probably not a very objective measure anyway, as it highly depends on how you prefilter your data, e.g. in terms of mapping quality, the definition of a properly-paired reads and the stringency of your peak calling (last sentence thinking aloud).

ADD REPLYlink modified 11 months ago • written 11 months ago by ATpoint26k

I see. Thank you very much!

I processed my data as follows:

step1: align reads to the genome with bowtie2 (mm10, in my case).
step2: remove duplicate reads using Picard
step3: remove non-unique alignment (samtools view -q 30), mitochondrial chromosome, and keep only properly mapped reads (samtools flag 1804).
step4: call peaks with MACS2.

According to the latest ATAC-seq paper (Corces et al. 2017), library size, %mtDNA and enrichment of TSSs should be calculated using reads from step1. FRiP should be calculated using reads from step2. Is that true?

ADD REPLYlink written 11 months ago by niuyw30

I do not know the paper in detail (or rather the methods section), but unlike mtDNA content, I would calculate everything from step3.

ADD REPLYlink written 11 months ago by ATpoint26k

Got it. Thank you :)

ADD REPLYlink written 11 months ago by niuyw30

Great thank you - this worked well.

ADD REPLYlink written 11 months ago by Lucy40

Helloļ¼Œthanks for the code.

I found the results from featureCoutns and intersectBed were close. The bam used here was the final bam for peak calling (same as the ENCODE pipeline).

samtools view -c ${sample}.bam
38439210

bedtools sort -i ${sample}_peaks.narrowPeak | bedtools merge -i stdin | bedtools intersect -u -nonamecheck -a ${sample}.bam -b stdin -ubam | samtools view -c
428359

featureCounts -p -a M0.A.005_peaks.saf -F SAF -o readCountInPeaks.txt ${sample}.bam
||    Total alignments : 19219605                                             ||
||    Successfully assigned alignments : 230826 (1.2%)                        ||

> 428359/38439210
[1] 0.0111438
> 230826/19219605
[1] 0.01200992

featureCounts counts the number of fragments, while bedtools counts the number of reads. But the number of reads is twice as big as the number of fragments (only considering properly mapped reads).

ADD REPLYlink written 9 months ago by niuyw30
2
gravatar for chaton
15 months ago by
chaton20
chaton20 wrote:

The problem is mainly that you used bedtools intersect the wrong way. If you want to calculate the number of reads covering you bed file, then your reference (file after -a) is the bed file and the query (file after -b) is your ordered bam file. You can also use the -c option and specify a sorted order via a genome reference file (see the help page of the bedtools suite). To count your total number of reads, there is also a -c option in samtools view... I suggest you to proceed like the following:

1) calculate your bed file coverage:

$ sort -k1V -k2,2n bedfile.bed
$ bedtools intersect -a bedfile.bed -b bamfile.bam -c -sorted -g ref.genome | awk '{i+=$n}END{print i}'

where $n is the column where you have your number of reads per bed region (column added by bedtools intersect, depends on the number of columns in your starting bed file), and i will give you the total number of reads covering your enriched regions. Then you just have to divide this value by the total number of reads of your bam file.

2) get your bam file size (nb of reads):

$ samtools view -c bamfile.bam

A little less automatic than with subread maybe but at least you don't need to install anything new... Good luck!

ADD COMMENTlink written 15 months ago by chaton20

Good catch with the switched -a/b, didn't even notice that :)

ADD REPLYlink modified 15 months ago • written 15 months ago by ATpoint26k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1626 users visited in the last hour