Hi everyone!
I'm working with some data related to mouse whole genome in Bedtools, but I'm having some problems with this.
I get two different outputs. One of them (head) looks like this:
chr1 0 1000 . -1 -1 . -1 . . . . 0 chr1 1000 2000 . -1 -1 . -1 . . . . 0 chr1 2000 3000 . -1 -1 . -1 . . . . 0 chr1 3000 4000 . -1 -1 . -1 . . . . 0 chr1 4000 5000 . -1 -1 . -1 . . . . 0 chr1 5000 6000 . -1 -1 . -1 . . . . 0 chr1 6000 7000 . -1 -1 . -1 . . . . 0 chr1 7000 8000 . -1 -1 . -1 . . . . 0 chr1 8000 9000 . -1 -1 . -1 . . . . 0 chr1 9000 10000 . -1 -1 . -1 . . . . 0
The last field is the number of overlapped bases with the reference.
And the other file (head) looks like this:
chr1 0 1000 0 chr1 1000 2000 0 chr1 2000 3000 0 chr1 3000 4000 0 chr1 4000 5000 0 chr1 5000 6000 0 chr1 6000 7000 0 chr1 7000 8000 0 chr1 8000 9000 0 chr1 9000 10000 0
The last field is the number of times an element overlaps in the reference.
My problem is that I want to exclude the elements with a size <= 10 and > 0. I mean, when this happens, I want to substract 1 from the count file.
I could manage to make some trials with just one chromosome using some AWK and Python, but now the files are from whole genomes.
My initial strategy (for 1 chr) was:
- Filtering values with 0 < size <= 10
- Isolating the start coordinates from the filter and comparing them with the start coordinates in the file with the counts
- Creating an array with the indices
- Substract one from the column with counts with the help of these indices
I was trying to follow a similar strategy with the files containing the whole genome, but it was useless.
Does anyone know a faster/easier way to do it?
Thanks in advance!
Which command are you running to getting two different outputs ? Can you explain what is your data and what are trying to do ?
The reference file is the mouse genome divided into 1k windows.
The file I'm overlapping is a histone mark.
In bedtools, I type something like this:
So I get the files I described in the post.
I'm trying to organize data like these (I have several other files too, but I already dealed with them) to create an integrative table to work stadistically with the data.
Since wg_H3K9ac.txt and count_wg_H3K9ac.txt have different number of lines I cannot correct directly the file with counts.
The output I'm interested in includes the information with the counts across the genome windows considering (in the sense of substracting) the cases where the size is so small that can be approximated to zero.
Veronica, as Geek_y said, your question is very unclear. Which
bedtools
command are you using? Is itintersect
? If so, try using (wo
option).I want to subtract 1 from the count file. What is count file?
Sorry for my explanations.
The count file is the one I get with this command:
It is still unclear what are you trying to do here.
Do you want to find how many overlaps each coordinate has? For example: