Using bedtools intersect on multiple bed files to retain peaks in 2 out of 3 files
3
0
Entering edit mode
4.5 years ago
camerond ▴ 190

I have 4 standard bed files, a large general file and three smaller files.

I'd like to run bedtools intersect to retain the peaks in the general file that overlap with peaks in at least two of the three smaller files keeping only the peak boundary limits of the general file peaks. This is the general command I'm looking at for this:

bedtools intersect -wa -a general.bed -b file1.bed file2.bed file3.bed

However, the flags for restricting the output based on the files passed to -b relate only to the proportion of overlap in base pairs between the files (i.e. using -F -f -r etc.). I don't see a way to restrict the general file based on overlaps with peaks contained in a set number of files sent to -b - I hope this makes sense.

I have considered creating a consensus peak set of the smaller files first using Diffbind retaining peaks contained in 2/3 files, then intersecting this consensus peak file with the general file, but I fear this method may drop/miss peaks that I'd like to retain.

Any suggestions regarding whether I can do this using bedtools, or an alternative method would be greatly appreciated.

bed bedtools • 3.6k views
ADD COMMENT
2
Entering edit mode
4.5 years ago

It can be done in pairwise manner no?

g: general

f: file

  • g-f1 overlaps, then resulting file overlaps with f2 --> overlaps1
  • g-f1 overlaps, then resulting file overlaps with f3 --> overlaps2
  • g-f2 overlaps, then resulting file overlaps with f3 --> overlaps3

Then take the unique peaks present after merging the three resulting files: overlaps1/2/3.

ADD COMMENT
0
Entering edit mode

@benformatics Yes - I was hoping there was something streamlined for this but perhaps it doesn't exist. Many thanks for the suggestion.

ADD REPLY
0
Entering edit mode
4.5 years ago
camerond ▴ 190

As suggested above by @benformatics I used the following in Snakemake:

rule bedtools_intersect_pairwise_method:
    input:
        general="general_file.bed",
        rep1="file1.bed"
        rep2="file2.bed"
        rep3="file3.bed"
    output:
        "FINAL_pairwise.bed"
    shell:
        """
        bedtools intersect -wa -a {input.general} -b {input.rep1} > gen_rep1_ovrlps.bed;
        bedtools intersect -wa -a {input.general} -b {input.rep2} > gen_rep2_ovrlps.bed;
        bedtools intersect -wa -a gen_rep1_ovrlps.bed -b {input.rep2} > dual_ovrlps_1.bed;
        bedtools intersect -wa -a gen_rep1_ovrlps.bed -b {input.rep3} > dual_ovrlps_2.bed;
        bedtools intersect -wa -a gen_rep2_ovrlps.bed -b {input.rep3} > dual_ovrlps_3.bed;
        cat dual_ovrlps_1.bed dual_ovrlps_2.bed dual_ovrlps_3.bed > all_ovrlps.bed;
        sort-bed all_ovrlps.bed > all_ovrlps.srtd.bed;
        bedtools merge -i all_ovrlps.srtd.bed > {output};
        rm *ovrlps*;
        """
ADD COMMENT
0
Entering edit mode
4.0 years ago
jrmerritt6 • 0

Use bedtools intersect. Here I'm using the narrowPeak files from MACS2, but this can be adapted to any file that has chr, st, end as the first three columns.

Concatenate narrowPeak files, coordinate sort, then merge peaks within 10 bp

cat ${SampleName}_1_peaks.narrowPeak ${SampleName}_2_peaks.narrowPeak ${SampleName}_1_peaks.narrowPeak > ${SampleName}_123
sort -k1,1 -k2,2n -k3,3n ${SampleName}_123 > ${SampleName}_sort
cat ${SampleName}_sort | awk '{print $1"\t"$2"\t"$3"\t"$7}' > ${SampleName}_sort.bed
bedtools merge -d 10 -c 4 -o mean -i ${SampleName}_sort.bed > ${SampleName}_mergedpeaks.bed

Write out bed file with showing the number of replicates that support peaks in summary file ${SampleName}_mergedpeaks.bed

bedtools intersect -wa -c \ 
    -a ${SampleName}_mergedpeaks.bed \
    -b ${SampleName}_1_peaks.narrowPeak ${SampleName}_2_peaks.narrowPeak ${SampleName}_3_peaks.narrowPeak \
    -sorted \
    -F 1.0 > ${SampleName}_mergedpeak_replicates.bed

To double check your results, you could also write out all of the peaks in each file that overlap with the merged peak files:

bedtools intersect -wa -wb \
    -a ${SampleName}_mergedpeaks.bed \
    -b ${SampleName}_1_peaks.narrowPeak ${SampleName}_2_peaks.narrowPeak ${SampleName}_3_peaks.narrowPeak \
    -sorted \
    -filenames \
    -F 1.0 > ${SampleName}_mergedpeak_verbose_replicates.bed

Filter ${SampleName}_mergedpeak_replicates.bed file for peaks that have greater than or equal to 2 samples, sort, then remove the column with the number of replicates:

awk '$5 >= 2 {print}' ${SampleName}_mergedpeak_replicates.bed > ${SampleName}_peak_replicates_filter.bed
sort -k1,1 -k2,2n -k3,3n ${SampleName}_peak_replicates_filter.bed > ${SampleName}_peak_replicates_sort.bed
cat ${SampleName}_peak_replicates_sort.bed | awk '{print $1"\t"$2"\t"$3}' > ${SampleName}_peak_replicates.bed
ADD COMMENT

Login before adding your answer.

Traffic: 3019 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6