Question

Filtering counts from BEDTOOLS depending on the overlapped size

0

Entering edit mode

9.1 years ago

Verónica Olmos ▴ 10

Hi everyone!

I'm working with some data related to mouse whole genome in Bedtools, but I'm having some problems with this.

I get two different outputs. One of them (head) looks like this:

chr1    0       1000    .       -1      -1      .       -1      .       .       .       .       0
chr1    1000    2000    .       -1      -1      .       -1      .       .       .       .       0
chr1    2000    3000    .       -1      -1      .       -1      .       .       .       .       0
chr1    3000    4000    .       -1      -1      .       -1      .       .       .       .       0
chr1    4000    5000    .       -1      -1      .       -1      .       .       .       .       0
chr1    5000    6000    .       -1      -1      .       -1      .       .       .       .       0
chr1    6000    7000    .       -1      -1      .       -1      .       .       .       .       0
chr1    7000    8000    .       -1      -1      .       -1      .       .       .       .       0
chr1    8000    9000    .       -1      -1      .       -1      .       .       .       .       0
chr1    9000    10000   .       -1      -1      .       -1      .       .       .       .       0

The last field is the number of overlapped bases with the reference.

And the other file (head) looks like this:

chr1    0       1000    0
chr1    1000    2000    0
chr1    2000    3000    0
chr1    3000    4000    0
chr1    4000    5000    0
chr1    5000    6000    0
chr1    6000    7000    0
chr1    7000    8000    0
chr1    8000    9000    0
chr1    9000    10000   0

The last field is the number of times an element overlaps in the reference.

My problem is that I want to exclude the elements with a size <= 10 and > 0. I mean, when this happens, I want to substract 1 from the count file.

I could manage to make some trials with just one chromosome using some AWK and Python, but now the files are from whole genomes.

My initial strategy (for 1 chr) was:

Filtering values with 0 < size <= 10
Isolating the start coordinates from the filter and comparing them with the start coordinates in the file with the counts
Creating an array with the indices
Substract one from the column with counts with the help of these indices

I was trying to follow a similar strategy with the files containing the whole genome, but it was useless.

Does anyone know a faster/easier way to do it?

Thanks in advance!

genome bedtools python • 3.6k views

ADD COMMENT • link updated 22 months ago by Ram 43k • written 9.1 years ago by Verónica Olmos ▴ 10

0

Entering edit mode

Which command are you running to getting two different outputs ? Can you explain what is your data and what are trying to do ?

ADD REPLY • link 9.1 years ago by GouthamAtla 12k

0

Entering edit mode

The reference file is the mouse genome divided into 1k windows.

The file I'm overlapping is a histone mark.

In bedtools, I type something like this:

bedtools intersect -a mm9.1k.genome.txt -b H3K9ac.txt -wao > wg_H3K9ac.txt

bedtools intersect -a mm9.1k.genome.txt -b H3K9ac.txt -c > count_wg_H3K9ac.txt

So I get the files I described in the post.

I'm trying to organize data like these (I have several other files too, but I already dealed with them) to create an integrative table to work stadistically with the data.

Since wg_H3K9ac.txt and count_wg_H3K9ac.txt have different number of lines I cannot correct directly the file with counts.

The output I'm interested in includes the information with the counts across the genome windows considering (in the sense of substracting) the cases where the size is so small that can be approximated to zero.

ADD REPLY • link 9.1 years ago by Verónica Olmos ▴ 10

0

Entering edit mode

Veronica, as Geek_y said, your question is very unclear. Which bedtools command are you using? Is it intersect? If so, try using (wo option).

I want to subtract 1 from the count file. What is count file?

ADD REPLY • link updated 22 months ago by Ram 43k • written 9.1 years ago by PoGibas 5.1k

0

Entering edit mode

Sorry for my explanations.

The count file is the one I get with this command:

bedtools intersect -a mm9.1k.genome.txt -b H3K9ac.txt -c > count_wg_H3K9ac.txt

ADD REPLY • link 9.1 years ago by Verónica Olmos ▴ 10

0

Entering edit mode

It is still unclear what are you trying to do here.

Do you want to find how many overlaps each coordinate has? For example:

chr1    0       1000    1
chr1    1001       2000    10
chr1    2001       3000    8
...

ADD REPLY • link updated 22 months ago by Ram 43k • written 9.1 years ago by PoGibas 5.1k

Ram · Answer 1 · 2015-04-07

0

Entering edit mode

9.1 years ago

GouthamAtla 12k

There should be no issues if the number of lines are different.

Sort both the files (sort -k1,1 -k2,2n)
Use -wo as mentioned by @Pgibas. You can also play around with -f and -r to choose percent of overlap.

And in your 1K regions, I think it should be 1-1000, 1001-2000, 2001-3000 etc.

ADD COMMENT • link updated 22 months ago by Ram 43k • written 9.1 years ago by GouthamAtla 12k

Ram · Answer 2 · 2015-04-09

You could use BEDOPS bedmap for this, using standard I/O streams:

Assume a sorted reference BED file containing windows over your genome, called ref.bed. If necessary:

$ sort-bed ref.unsorted.bed > ref.bed

Assume a sorted map BED file containing histone marks or other regions you want to count, called map.bed. Similarly, if necessary:

$ sort-bed map.unsorted.bed > map.bed

Assume both files are 0-indexed, half-open. (Why? Because we want to calculate their lengths correctly. See: What Are The Advantages/Disadvantages Of One-Based Vs. Zero-Based Genome Coordinate Systems)

You could filter out regions less than eleven bases long with awk and pipe the filtered regions to count with bedmap, like so:

$ awk '($3-$2)>10' map.bed \
  | bedmap --echo --count --delim '\t' ref.bed - \
  > answer.bed

Note the use of the hyphen character in bedmap. BEDOPS tools use the convention used in Unix utilities to handle standard input from upstream processes. Its use here denotes that map data will come in from whatever filtered elements come out of awk.

The file answer.bed will have all the columns of ref.bed, with an additional column for the number or --count of filtered elements from map.bed, which overlap windows in the reference set.