How to correctly use bedtools merge for annotated .bed files?
0
0
Entering edit mode
2.2 years ago

Hello,

I have two annotated .bed files that each contain 26 columns-- the first 3 columns are the standard chr number, start position, and end position, while the remaining columns contain additional information.

I want to merge these two annotated .bed files while retaining the information in columns 4-26. To specify, even if the values in the first 3 cells of two rows are the same, if any value in the subsequent columns is different, I want these rows to be preserved as two separate rows. I've tried running commands like

cat BINDetectv3//beds/_Ctr_bound.bed | bedtools sort | bedtools merge -c 4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26 -o distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct -s > compiled_homer_ctr_bound.bed'

but I keep getting a merged .bed file where it seems like, if multiple rows share the same chr, chr_start, and chr_end positions, they merge into one row, leaving me with the additional columns for the merged row containing multiple values separated by commas.

How can I avoid this?

Any helps is massively appreciated (I apologize in advance if this is a very simplistic question)!

bedtools linux bed • 1.7k views
ADD COMMENT
0
Entering edit mode

If you can have multiple lines of overlapped regions, why merge in the first place? Is you goal to just get rid of duplicate lines? Or do you have a need for special region merging?

ADD REPLY
0
Entering edit mode

I'm not sure if this helps with context, but the annotated .bed files contain information on TF footprints as the first three columns and then the additional columns contain the TF symbol, info on the MACS2 peak within which these TFs are falling, etc. So I think the reason I might have potentially multiple rows with the same first three values (chr #, chr_start, chr_end positions) is that there are some TFs with very similar footprint sequences, so multiple TFs are being associated with a single footprint.

I want a single .bed file so that I can later take this file to use for downstream analysis in visualizing where footprints fall (e.g. through the IGV browser, or for use in later steps of the TOBIAS pipeline, which I am falling to analyze my ATAC-seq results).

I hope this makes sense, but please let me know if there's any other info I can provide to clarify/if I am going about thinking about this in the wrong way.

ADD REPLY
0
Entering edit mode

I don't think merging makes sense here ("merging" meaning actually merging the regions together; as distinction, when 'merging' files, I'll call it concatenating...)

You can concatenate your footprints if you'd like.

cat BINDetectv3//beds/_Ctr_bound.bed | bedtools sort > compiled_homer_ctr_bound.bed

If anything, it sounds like the only regions you would actually want to merge are overlapping footprints of the same TF since these are likely to have the same annotation columns as well (if I understand, you want to keep different annotations on separate rows, so if you had CEBPA and CEBPE footprints that overlapped, you would still want them on two rows.)

If there really is that much overlap of individual TF footprints, you could merge before concatenating, but I am not sure this would really do anything since I believe the TF footprints don't usually overlap themselves. ( I actually just tested my TOBIAS footprint file for CEBPA, I have 11175 TFBS before merging and 11169 TFBS after merging)

ADD REPLY
2
Entering edit mode

Thank you so much! Earlier, since I was pretty unsure of how to use bedtools correctly, I ended up concatenating the files (like you showed above) and then opened the file in Excel to manually sort before reconverting back to a .bed file, but your command is certainly much simpler!

ADD REPLY

Login before adding your answer.

Traffic: 4090 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6