Question: Filtering counts from BEDTOOLS depending on the overlapped size
0
gravatar for Verónica Olmos
4.0 years ago by
European Union
Verónica Olmos10 wrote:

Hi everyone!

I'm working with some data related to mouse whole genome in Bedtools, but I'm having some problems with this.

I get two different outputs. One of them (head) looks like this:

chr1    0       1000    .       -1      -1      .       -1      .       .       .       .       0
chr1    1000    2000    .       -1      -1      .       -1      .       .       .       .       0
chr1    2000    3000    .       -1      -1      .       -1      .       .       .       .       0
chr1    3000    4000    .       -1      -1      .       -1      .       .       .       .       0
chr1    4000    5000    .       -1      -1      .       -1      .       .       .       .       0
chr1    5000    6000    .       -1      -1      .       -1      .       .       .       .       0
chr1    6000    7000    .       -1      -1      .       -1      .       .       .       .       0
chr1    7000    8000    .       -1      -1      .       -1      .       .       .       .       0
chr1    8000    9000    .       -1      -1      .       -1      .       .       .       .       0
chr1    9000    10000   .       -1      -1      .       -1      .       .       .       .       0

The last field is the number of overlapped bases with the reference.

And the other file (head) looks like this:

chr1    0       1000    0
chr1    1000    2000    0
chr1    2000    3000    0
chr1    3000    4000    0
chr1    4000    5000    0
chr1    5000    6000    0
chr1    6000    7000    0
chr1    7000    8000    0
chr1    8000    9000    0
chr1    9000    10000   0

The last field is the number of times an element overlaps in the reference.

My problem is that I want to exclude the elements with a size <= 10 and > 0. I mean, when this happens, I want to substract 1 from the count file.

I could manage to make some trials with just one chromosome using some AWK and Python, but now the files are from whole genomes.

My initial strategy (for 1 chr) was:

  1. Filtering values with 0 < size <= 10
  2. Isolating the start coordinates from the filter and comparing them with the start coordinates in the file with the counts
  3. Creating an array with the indices
  4. Substract one from the column with counts with the help of these indices

I was trying to follow a similar strategy with the files containing the whole genome, but it was useless.

Does anyone know a faster/easier way to do it?

Thanks in advance!

python bedtools genome • 1.4k views
ADD COMMENTlink modified 4.0 years ago by Alex Reynolds28k • written 4.0 years ago by Verónica Olmos10

Which command are you running to getting two different outputs ? Can you explain what is your data and what are trying to do ?

ADD REPLYlink written 4.0 years ago by geek_y9.4k

The reference file is the mouse genome divided into 1k windows.

The file I'm overlapping is a histone mark.

In bedtools, I type something like this:

bedtools intersect -a mm9.1k.genome.txt -b H3K9ac.txt -wao > wg_H3K9ac.txt

bedtools intersect -a mm9.1k.genome.txt -b H3K9ac.txt -c > count_wg_H3K9ac.txt

So I get the files I described in the post.

I'm trying to organize data like these (I have several other files too, but I already dealed with them) to create an integrative table to work stadistically with the data.

Since wg_H3K9ac.txt and count_wg_H3K9ac.txt have different number of lines I cannot correct directly the file with counts.

The output I'm interested in includes the information with the counts across the genome windows considering (in the sense of substracting) the cases where the size is so small that can be approximated to zero.

ADD REPLYlink written 4.0 years ago by Verónica Olmos10

Veronica, as Geek_y said, your question is very unclear. Which bedtools command are you using? Is it intersect? If so, try using (wo option).

I want to substract 1 from the count file. What is count file?

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by PoGibas4.8k

Sorry for my explanations.

The count file is the one I get with this command:

bedtools intersect -a mm9.1k.genome.txt -b H3K9ac.txt -c > count_wg_H3K9ac.txt
ADD REPLYlink written 4.0 years ago by Verónica Olmos10

It is still unclear what are you trying to do here.

Do you want to find how many overlaps each coordinate has? For example:

chr1    0       1000    1
chr1    1001       2000    10
chr1    2001       3000    8
...
ADD REPLYlink written 4.0 years ago by PoGibas4.8k
0
gravatar for geek_y
4.0 years ago by
geek_y9.4k
Barcelona/CRG/London/Imperial
geek_y9.4k wrote:

There should be no issues if the number of lines are different.

1. Sort both the files (sort -k1,1 -k2,2n)

2. Use -wo as mentioned by @Pgibas. You can also play around with -f and -r to choose percent of overlap. 

And in your 1K regions, I think it should be 1-1000, 1001-2000, 2001-3000 etc. 

ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by geek_y9.4k
0
gravatar for Alex Reynolds
4.0 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

You could use BEDOPS bedmap for this, using standard I/O streams:

Assume a sorted reference BED file containing windows over your genome, called ref.bed. If necessary:

$ sort-bed ref.unsorted.bed > ref.bed

Assume a sorted map BED file containing histone marks or other regions you want to count, called map.bed. Similarly, if necessary:

$ sort-bed map.unsorted.bed > map.bed

Assume both files are 0-indexed, half-open. (Why? Because we want to calculate their lengths correctly. See: "What Are The Advantages/Disadvantages Of One-Based Vs. Zero-Based Genome Coordinate Systems".)

You could filter out regions less than eleven bases long with awk and pipe the filtered regions to count with bedmap, like so:

$ awk '($3-$2)>10' map.bed \
  | bedmap --echo --count --delim '\t' ref.bed - \
  > answer.bed

Note the use of the hyphen character in bedmap. BEDOPS tools use the convention used in Unix utilities to handle standard input from upstream processes. Its use here denotes that map data will come in from whatever filtered elements come out of awk.

The file answer.bed will have all the columns of ref.bed, with an additional column for the number or --count of filtered elements from map.bed, which overlap windows in the reference set.

ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by Alex Reynolds28k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1243 users visited in the last hour