Question

Count regions that overlap with python

0

Entering edit mode

3.3 years ago

ManuelDB ▴ 110

I have a dataframe with the coordinates of large mutations (CNVs)

 chr start end
1   200   1000
1   400   800
1  600   1500

How can I identify zones that overlap and count occurrences?

The result for the dataframe given above would be something like this

 start end occurrences 
200    800    2
600    1000   2
600    800    3

In the HPC where I work, I have a few python libraries (e.i. pandas) and Bedtools. How could I do this?

CNV • 2.9k views

ADD COMMENT • link updated 3.3 years ago by Alex Reynolds 36k • written 3.3 years ago by ManuelDB ▴ 110

1

Entering edit mode

That's what you looking for!

ADD REPLY • link 3.3 years ago by brunobsouzaa ▴ 840

4

Entering edit mode

how would you do that with bedtools intersect ?

ADD REPLY • link 3.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

It's bad practice to ask a question without an output example. How exactly is output supposed to be formatted. Check bedtools intersect as suggested, especially the counting option -wo. By the way, package managers like conda do not require root access so you always have the option to install most software you want with that, even on HPCs.

ADD REPLY • link 3.3 years ago by ATpoint 89k

0

Entering edit mode

I don't really know what you mean by "without an output example". I have provided the output I would like to achieve. The HPC I am using is the Genomic England Research Environment. In this HPC you cannot install anything if this is not previously provided by them. Not even using Conda.

I have checked the -wo option. But this mention that you need to bed file. I only have one. Not sure how to do what you mean.

ADD REPLY • link 3.3 years ago by ManuelDB ▴ 110

0

Entering edit mode

Your partitioning in the example output is incorrect or at least inconsistent with the starting example dataframe. Nonetheless, BEDOPS bedops --partition is an easy way to do this correctly.

ADD REPLY • link 3.3 years ago by Alex Reynolds 36k

score 0 · Answer 1 · 2022-07-25

0

Entering edit mode

3.3 years ago

Alex Reynolds 36k

Here's how you can do this with BEDOPS bedops --partition and bedmap --count, if you have that toolkit available on your HPC or can install it via the Bioconda channel.

Start by creating a sorted BED file from your dataframe text file:

tail -n+2 CNVs.txt | sort-bed - > CNVs.bed

Then partition the CNVs and map the CNVs against the partitions, count-ing the number of CNVs that fall over a partition:

bedmap --echo --count --delim '\t' <(bedops --partition CNVs.bed) CNVs.bed > answer.bed

If you just want the last three columns, i.e., to strip out the chromosome (not recommended):

cut -f2- answer.bed > answer.txt

ADD COMMENT • link 3.3 years ago by Alex Reynolds 36k

0

Entering edit mode

I will check this on my PC out of curiosity. I am not sure I have that tool in the HPC I am using.

ADD REPLY • link 3.3 years ago by ManuelDB ▴ 110

0

Entering edit mode

conda install -c bioconda bedops

If you need software on the HPC, you should have an administrator or IT department who can put that software in place for you.

ADD REPLY • link 3.3 years ago by Alex Reynolds 36k

score 0 · Answer 2 · 2022-07-25

I have developed this

 lines = [(200,1000),
         (400,800),
         (600,1500),
        ]


all_numbers = [(x[0],"O") for x in lines] + [(x[1],"C") for x in lines]

all_numbers = sorted(all_numbers, key=lambda x: x[0])


now_open = 0

for first_tuple, second_tuple in  zip(all_numbers, all_numbers[1:]):

    begin, what_side = first_tuple
    end, _ = second_tuple

    if what_side == "O":
        now_open += 1
    elif what_side == "C":
        now_open -= 1


    print(begin, end, now_open)

With Python ... don't need any python library.