Question: Combine .bed files and keep columns / headers
0
gravatar for fr
17 days ago by
fr100
fr100 wrote:

I'm combining multiple .bed files. For instance:

#file1
chr1    1   2    2
chr1    10  11    3
chr1    50  51   4

#file2
chr1    1   2   10
chr1    10  11   8
chr10   2   3   8

#file3
chr1    1   2   1
chr1    50  51   2
chr10   2   3   9

All files have the same structure for the 4 columns: chr start end filename

I want to combine all of them (have a lot of them!) such that the column names are kept for columns 4 and higher.

chrm start end file1 file2 file3
chr1    1   2   2    10    1
chr1    10  11  3    8      NA
chr1    50  51  4    NA   2   
chr10   2   3   NA  2      9

I have tried this approach but the end result doesn't have column names and doesn't show the multiple matching columns (i.e. output only has 4 columns, with everything concatenated vertically).

I have also tried this approach which seems to work a bit better, but doesn't have column names either - and for a large number of files, it becomes very hard to assign them. It also seems that it fails for chr 10 - 22 as indicated here.

Could someone please help me get the last output? I would really appreciate it!

Thanks a lot in advance

bedmap bedops bedtools • 134 views
ADD COMMENTlink modified 17 days ago by ATpoint17k • written 17 days ago by fr100
2
gravatar for Alex Reynolds
17 days ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

You could use the second part of my answer to get NaN values, and just cat a header at the end with awk or some other scripted approach.

ADD COMMENTlink written 17 days ago by Alex Reynolds28k

Thanks. I was able to get it working with dummy data above, but not with my own data. I hope to get there soon! Thanks a lot

ADD REPLYlink modified 17 days ago • written 17 days ago by fr100
1

Perhaps compare the structure and content of the dummy data with the structure and content of a small snippet (head) of your own data. This should help guide you in preparing your own data or modifying the commands to use your data as-is.

ADD REPLYlink written 17 days ago by Alex Reynolds28k

Indeed. Thanks a lot, both your post and @ATpoint's below were quite useful. I got it working but for my purposes I think --partition was better than merge

ADD REPLYlink written 17 days ago by fr100
2
gravatar for ATpoint
17 days ago by
ATpoint17k
Germany
ATpoint17k wrote:

I think this two-step process should do it:

bedtools multiinter -empty -i <all*bed*files>.bed -g chromSizes.txt \
  | awk 'OFS="\t" {if($5 == "none") next} {print}' \
  | tee >(cut -f1-3 > coords.bed) \
  | cut -f6- \
  | awk 'OFS="\t" {gsub("^0$","NA");print}' > values.txt

paste coords.bed values.txt > output.txt

Essentially, it exploits bedtools multiinter's -empty function to output null for intervals that are empty in some files. As this is only possibe together with a genome file, one simply awks out all intervals empty in any file. One then separates the coordinates from the rest to be able to 1) exclude unwanted output from multiinter and 2) to be able to do a simply gsub replacing zeros with NA. Eventually, paste the two parts together.

ADD COMMENTlink modified 17 days ago • written 17 days ago by ATpoint17k

Thanks so much @ATpoint, I'm almost there. However, I get only 1 and NA in output.txt, and don't know how to intersect it with the initial input files. Also, is the order of columns 4 and so on the same as that specified in -i [file1.bed, file2.bed, file3.bed]? It doesn't seem that the output is sorted, so I just want to make sure

ADD REPLYlink written 17 days ago by fr100

With your dummy data I get exactly the output you gave as the desired one.

ADD REPLYlink written 17 days ago by ATpoint17k

I really don't understand why, but now I can't even get the NA. I have used \t as delimiters in my input, and now I'm getting

#cat output.txt
chr1    1   2   1   1   1
chr1    2   5   1   0   0
chr1    10  11  1   1   0
chr1    50  51  1   1   1
chr10   2   3   0   0   1
ADD REPLYlink written 17 days ago by fr100
1

chr1 2 5 1 0 0 this line does not make sense as it is not in your dummy input. Maybe you have left-over files from previous attempts to run the code snippet, make sure you clean up properly.

ADD REPLYlink written 17 days ago by ATpoint17k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 897 users visited in the last hour