Question

A efficient way to simulate bedtools interset with python

0

Entering edit mode

21 months ago

Manuel ▴ 40

I need to select the ROI of a gvcf using a bed file. This is the approach I have developed so far but I think it is quite inefficient

My bed tool is parsed like this

 [('chr1', 36931694, 36931909, 'CSF3R.exon.17.line.1.chr1.36931697.36932509'), ('chr1', 36931910, 36932116, 'CSF3R.exon.17.line.1.chr1.36931697.36932509')]

and the code that does the interset is this

def roi(gvcf_data,bedregions):
    '''Interset. Take ROI of the gVCF using
       the bed file'''
    rows = []
    for region in bedregions:

        for index, row in gvcf_data.iterrows():
            if (region[0] == row['CHROM']) & (row['POS'] in range (region[1],region[2])):
                rows.append([row['CHROM'], row['POS'],row['DP'], region[3]])
    return pd.DataFrame(rows, columns=["CHROM", "POS", "DP", "Probe_name"])

This application will be deployed on the computer of one sequencer and I cannot install or do anything to finally use bedtools

Bedtools • 1.4k views

ADD COMMENT • link updated 21 months ago by Istvan Albert 100k • written 21 months ago by Manuel ▴ 40

2

Entering edit mode

Not what you want to hear but reinventing high-performance software in a non-performant language with the help of strangers on the internet is not what I'd call a clever investment of time. You should rather come up with a workaround to either find a way deploying software on that computer or outsource the computation to a machine you have rights to. Variant selection is extremely efficient via something like bedtools, tabix or bcftools. Why exactly can’t you compile existing software on that machine? I mean, imagine you tomorrow realize that you need a samtools equivalent or aligner on that computer. Not going to implement that, right?

ADD REPLY • link 21 months ago by ATpoint 82k

2

Entering edit mode

Thank for your time.

There are many reasons I could give you to answer your question. Here are some of them,

1) First I am the trainee (but also the only bioinformatician) working in a clinical lab which means that my knowledge is limited.

2) The computer is in a sequencer that cost at the moment $128K.

3) Patient data cannot be sent to an external machine via the internet, all computers in my hospital are window. What is more, I have to have a serious discussion with the IT of our hospital because I was not able to even run a python application or similar programming language. So I cannot install a virtual environment to have for example Linux.

I have just developed a small application that after finishing the Illumina software that generates among other files, the gVCF files. And I have been called to create a report that informs the names of the probes of which one of their bases is lower than a cutoff. For this, I need to interset that gVCF file.

I don't think I'll waste too much time writing a few lines in this community to see if someone has a better idea. From these strangers (as you called them) I have learnt a lot, I have received a lot of help and I have helped

ADD REPLY • link 21 months ago by Manuel ▴ 40

0

Entering edit mode

Patient data cannot be sent to an external machine via the internet

But you surely have a network backup in place, no? So maybe you can access the files from the backup instead directly on the sequencing computer? After all, the machine running your tool doesn't need to be connected to the internet, if you pack your software and its dependencies into a container and deploy it on a computer that is part of the trusted network.

As far as your Python question is concerned, you could try to use cgranges instead (see the python subfolder in die repo for an example).

ADD REPLY • link 21 months ago by Matthias Zepper 4.6k

0

Entering edit mode

21 months ago

Manuel ▴ 40

Thanks all for your answers.

I have developed some code using dictionaries that is quite efficient and doesn't need any library

   f = open(file_path, 'r')
lines = f.readlines()
chr2base2index = dict()
for index,line in enumerate(lines):
    if (len(line)) == 1:
            break
    if line[0] == '#':            
            continue
    handle = line.strip().split()
    chrm, base = handle[0], int(handle[1])
    if chrm not in chr2base2index:
        chr2base2index[chrm] = dict()
    if base not in chr2base2index[chrm]:
        chr2base2index[chrm][base] = index

filtered_lines = []
for chrm, start, end, probe_name in bedregions:
    if chrm not in chr2base2index:
        print(f'Chromosome {chrm} not found')
        continue
    for base in range(start, end):
        index = chr2base2index[chrm].get(base, None)
        if index != None:
            filtered_lines.append('\t'.join(lines[index].strip().split() + [probe_name]))

ADD COMMENT • link 21 months ago by Manuel ▴ 40

0

Entering edit mode

your code does not look quite generic enough, it seems super odd to use a dictionary for the sole reason to map a chromosome to a single coordinate,

and then there seems to be a disconnect there, why would the line index of the file have any relationship to the start position?

the line index is simply the number of the line, it is not connected to what coordinate is stored at that line

writing an efficient and correct interval intersect code is a much harder task - it requires a dedicated data structure

ADD REPLY • link 21 months ago by Istvan Albert 100k

0

Entering edit mode

Sorry, you are right, I will change that. I copied and pasted a old versions

This now works

ADD REPLY • link 21 months ago by Manuel ▴ 40

score 2 · Accepted Answer · 2022-07-29

Even with Python you can get decent performance out of various interval tree implementations

https://pypi.org/project/intervaltree/

or

https://bx-python.readthedocs.io/en/latest/lib/bx.intervals.html

But of course, details matter, just how many intervals you have, as the overheads may accumulate. I would guess the pure Python interval tree would work up to, say ... perhaps millions of intervals. I would run some tests.