2.2 years ago
ManuelDB ▴ 80

I am working in the research environment of Genomic England (which means the number of tools is very limited) and I have a pandas data frame and one of the columns contains genes ID. Some are repeated.

I have a long bed file with all human exons. I want to get the exons of the genes that match that data frame. What is the best way to do this? I can use bedtools, shell commands and python commands only.

This is one step of an application I am developing.

The bed file looks like this

 #chr1 start end Gene_ID Exon_ID
1    1      10  IDA     ID1
1    10     20  IDA     ID2
1    20     30  IDA     ID3
2    1      10  IDB     ID1
2    20     20  IDB     ID2
2    30     30  IDB     ID3

Imagine I have in my data frame the gene IDB, the result should be

    2    1      10  IDB     ID1
    2    20     20  IDB     ID2
    2    30     30  IDB     ID3

I am thinking of getting a unique gene ID, creating a list and then to the query with some shell script.

Something like this

grep -Fw -f words myfile

Copy from Do you have a better idea?

If the Gene_ID in both words and myfile match exactly, I would use unix join instead of grep -F; it will be much faster than grep.


