FRIP score ATAC-seq
2
1
Entering edit mode
5.6 years ago
Lucy ▴ 140

Hi,

I am trying to calculate the fraction of reads in peaks (FRIP score) for my ATAC-seq samples. What I have so far is written below, however I think it is wrong somehow as I am getting a very low score (when I calculate the FRIP using DiffBind, it comes out much better/more like the expected value e.g. 0.37).

val1=$(bedtools intersect -a <sample_BAM> -b <BED_peakfile> -wa -u | wc -l)
val2=$(samtools view <sample_BAM> | wc -l)
FRIP=$(awk "BEGIN {print "${val1}"/"${val2}"}")

For the peak file, I am using my optimal peak list for the particular cell type (generated using IDR).

Can anyone see what I am doing wrong?

Thanks,

Lucy

ATAC-seq FRIP • 14k views
ADD COMMENT
0
Entering edit mode

Hi Lucy,

for the future, please use the formatting bar to indicate code. This improves readability. I did it for you this time.

Code

ADD REPLY
11
Entering edit mode
5.6 years ago
ATpoint 82k

The problem with your solution is that bedtools intersect uses only the read length, probably 50 or 75bp, to do the intersection, rather than the actual fragment length, which for ATAC-seq should be between 80 and a few hundred bp. The same goes for all bedtools commands as far as I know.

I would use featureCounts from the subread package. It is actually made to count reads per reference regions in order to make count matrices but it also outputs the percentage of reads/fragments assigned to it, which is what you want:

## Make a custom "SAF" file which featureCounts needs:
awk 'OFS="\t" {print $1"-"$2+1"-"$3, $2+1, $3, "+"}' foo.narrowPeak > foo.saf

## run featureCounts (add -T for multithreading)
featureCounts -p -a peaks.saf -F SAF -o out.txt atacseq.bam

This will produce an output to screen with the % of assigned reads (=FRiP)and also a file out.txt.summary, from which you can programatically calculate it for a lot of samples.

FRiP

ADD COMMENT
2
Entering edit mode

Just a small note in case someone has the same problem as I did.

I'm using the R implementation of the subread package for this. I had to modify the awk command so that it includes the chromosome as a column for it to work.

## Make a custom "SAF" file which featureCounts needs:
awk 'OFS="\t" {print $1"-"$2+1"-"$3, $1, $2+1, $3, "+"}' foo.narrowPeak > foo.saf

From the link above:

The SAF annotation format has five required columns, including GeneID, Chr, Start, End and Strand. These columns can be in any order. More columns can be included in the annotation. Columns are tab-delimited. Column names are case insensitive. GeneID column may contain integers or character strings. Chromosomal names included in the Chr column must match those used inclued in the mapping results, otherwise reads will fail to be assigned. Users may provide a SAF annotation in the form of a data frame or a file using the annot.ext argument.

ADD REPLY
0
Entering edit mode

Hello, I don't get your point about bedtools intersect. According to ENCODE, FRiP means "Fraction of all mapped reads that fall into the called peak regions, i.e. usable reads in significantly enriched peaks divided by all usable reads." It's about "reads" rather than "fragment". So, why can not we use bedtools intersect?

ADD REPLY
0
Entering edit mode

In paired-end sequencing, we use the word fragment because the two reads that are produced always originate from the same DNA fragment and are therefore not independent of each other as reads from single-end sequencing would be. As FRiP comes from single-end ChIP-seq data, this is why they probably termed it reads. ATAC-seq is most commonly paired-end. You can use BEDtools for paired-end data but it requires more pre-processing of your data, that is why I use featureCounts, being faster and more convinient with plenty of customizable options. Choice is still yours. FRiP is probably not a very objective measure anyway, as it highly depends on how you prefilter your data, e.g. in terms of mapping quality, the definition of a properly-paired reads and the stringency of your peak calling (last sentence thinking aloud).

ADD REPLY
0
Entering edit mode

I see. Thank you very much!

I processed my data as follows:

step1: align reads to the genome with bowtie2 (mm10, in my case).
step2: remove duplicate reads using Picard
step3: remove non-unique alignment (samtools view -q 30), mitochondrial chromosome, and keep only properly mapped reads (samtools flag 1804).
step4: call peaks with MACS2.

According to the latest ATAC-seq paper (Corces et al. 2017), library size, %mtDNA and enrichment of TSSs should be calculated using reads from step1. FRiP should be calculated using reads from step2. Is that true?

ADD REPLY
0
Entering edit mode

I do not know the paper in detail (or rather the methods section), but unlike mtDNA content, I would calculate everything from step3.

ADD REPLY
0
Entering edit mode

Got it. Thank you :)

ADD REPLY
0
Entering edit mode

Great thank you - this worked well.

ADD REPLY
0
Entering edit mode

Helloļ¼Œthanks for the code.

I found the results from featureCoutns and intersectBed were close. The bam used here was the final bam for peak calling (same as the ENCODE pipeline).

samtools view -c ${sample}.bam
38439210

bedtools sort -i ${sample}_peaks.narrowPeak | bedtools merge -i stdin | bedtools intersect -u -nonamecheck -a ${sample}.bam -b stdin -ubam | samtools view -c
428359

featureCounts -p -a M0.A.005_peaks.saf -F SAF -o readCountInPeaks.txt ${sample}.bam
||    Total alignments : 19219605                                             ||
||    Successfully assigned alignments : 230826 (1.2%)                        ||

> 428359/38439210
[1] 0.0111438
> 230826/19219605
[1] 0.01200992

featureCounts counts the number of fragments, while bedtools counts the number of reads. But the number of reads is twice as big as the number of fragments (only considering properly mapped reads).

ADD REPLY
2
Entering edit mode
5.6 years ago
chaton ▴ 20

The problem is mainly that you used bedtools intersect the wrong way. If you want to calculate the number of reads covering you bed file, then your reference (file after -a) is the bed file and the query (file after -b) is your ordered bam file. You can also use the -c option and specify a sorted order via a genome reference file (see the help page of the bedtools suite). To count your total number of reads, there is also a -c option in samtools view... I suggest you to proceed like the following:

1) calculate your bed file coverage:

$ sort -k1V -k2,2n bedfile.bed
$ bedtools intersect -a bedfile.bed -b bamfile.bam -c -sorted -g ref.genome | awk '{i+=$n}END{print i}'

where $n is the column where you have your number of reads per bed region (column added by bedtools intersect, depends on the number of columns in your starting bed file), and i will give you the total number of reads covering your enriched regions. Then you just have to divide this value by the total number of reads of your bam file.

2) get your bam file size (nb of reads):

$ samtools view -c bamfile.bam

A little less automatic than with subread maybe but at least you don't need to install anything new... Good luck!

ADD COMMENT
0
Entering edit mode

Good catch with the switched -a/b, didn't even notice that :)

ADD REPLY

Login before adding your answer.

Traffic: 2971 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6